2022-03-25 ยท 2 min read



  • "Workload" scheduler and orchestrator
  • Supports container, VM, raw forked binary, or custom "driver" workloads.
  • Devs declaratively specify workloads as Job file, which describes different services, versions, ports, dependencies, etc...
  • Nomad looks at current desired set of jobs and tries to make reality match desired jobs.
  • Nomad schedules workloads across available hardware. It restarts jobs that have crashed. It migrates jobs from failed machines.
  • It allows you to specify version upgrade rollout strategy.
  • Developers consume infrastructure via APIs. Nomad provides these "northbound" APIs.
  • Ops manages infrastructure via APIs. Nomad provides these "southbound" APIs.
  • Custom Driver Plugins (how to run and manage a task)
  • Custom Device Plugins (custom resources that can be exposed). Could be useful for SGX maybe?

Example #

CircleCI #

  • CircleCI uses Nomad as a Job Queue
  • New commits from customers triggers an event
  • This commit event is then submitted to Nomad as a Job (to test their code or w/e).
  • Nomad can act like a Job queue and buffer Jobs until there is available capacity in the fleet. For example, CircleCI might get 2k+ jobs/min but only have 1k jobs/min hardware capacity during peak hours. Nomad will correctly buffer jobs until there is capacity.

Citadel #

  • Citadel cares about how quickly, in absolute time, they can compute some job, rather than the total cost / number of cores / etc...
  • Their issue is: how many containers can we run in a short period of time?
  • Need to support bursty workloads (analysts want to run big models as quickly as possible).
  • Need to work across existing DCs and "burst" to cloud on-demand.
  • Want to run 40M containers in a short period.
  • 2017 talk mentions 3k+ containers scheduled / sec.
  • related blog post:
    • HashiCorp Nomad scheduled 2,000,000 Docker containers on 6,100 hosts in 10 AWS regions in 22 minutes.
    • 1.5k containers/sec