Skip to main content
  1. concepts/

Scheduler #

scheduler = policy engine that turns scarce capacity into ordered, placed, admitted work

A scheduler is more than a queue:

queue      = policy-governed waiting set
scheduler  = queue + capacity model + decision discipline + binding

Central tension (the objective space):

fairness, latency, throughput, locality, utilization, isolation
cannot all be maximized at once.
Every discipline is a weighting of this vector.
Every failure mode is an under-weighted objective presenting its bill.

Design Axes (the core module) #

A “scheduler type” is a point in a five-axis space. Design by choosing per axis; recognize systems by reading their vector.

Axis 1 — Decision Question #

ordering:    what runs next          (queue discipline)
placement:   where it runs           (node/worker/shard choice)
admission:   whether/how much enters (rate, quota, waitlist)

Real schedulers usually answer two or three at once (k8s: admission via quota + placement via filter/score; SQS: ordering + admission).

Interrogation:

What is being scheduled? What resource is scarce?
Is this about order, placement, admission, or all three?
Which question does each component answer? (name them separately)

Axis 2 — Discipline / Objective Weighting #

Disciplines are substitutable policies per question, not scheduler types.

ordering disciplines:
  FIFO            predictability          pays: HoL blocking, big-delays-small
  priority        importance first        pays: starvation, inversion, abuse
  fair-share      isolation per class     pays: latency, idle capacity waste
  deadline (EDF)  meet due times          pays: needs admission test + good estimates
  SJF/SRPT        mean latency            pays: large-job starvation, estimate gaming

placement objectives (weighted terms in one score, not rivals):
  bin-pack        utilization             pays: hotspots, correlated failure, fragmentation
  spread          failure decorrelation   pays: utilization
  locality        data/cache/device near  pays: wait-for-local-slot, hot-data hotspot

admission disciplines:
  token/rate bucket    shape arrival rate       pays: wrong key, burst tuning, distributed skew
  concurrency limit    cap in-flight, not rate  pays: permit leak, long jobs hold all slots
  quota/waitlist       hold until eligible      pays: starvation, stale promotion

Interrogation:

Hard constraints vs soft preferences — which is which?
What is the fairness definition? Over what window?
Can work starve? What ages it out of starvation?
Can work be preempted? What does preemption cost?
Whose estimates does the discipline trust, and who can game them?
Which objective did you down-weight, and what bill will it present?

Axis 3 — Decision Topology (who decides) #

central:        omniscient view, one decider     (k8s scheduler, Borg, Trino coordinator)
pull/claim:     workers ask, service selects     (SQS, Pub/Sub, Temporal task queues)
peer stealing:  workers rebalance among selves   (ForkJoin, Go runtime, Ray)
deterministic:  no runtime decider; hash + membership  (consistent hashing, Dynamo ring)
coordinator-lease: central grants exclusive shard ownership  (Kafka groups, Kinesis)

The tradeoff: global view vs freshness vs bottleneck.

central        sees everything, but is a bottleneck with a stale view
pull           fresh worker state, but myopic; no global optimization
stealing       adaptive, but locality loss + victim contention
deterministic  zero decision latency, but blind to load; hot keys unfixable by policy
coord-lease    exclusive ownership, but rebalance pauses + generation bookkeeping

Interrogation:

Who holds the capacity model, and how stale is it?
Is the decider itself a scalability or availability bottleneck?
What happens to scheduling when the decider is down? (does placed work keep running?)
Per-item decision or amortized assignment? (request LB vs shard lease)

Axis 4 — Binding Strength #

advisory:  suggestion; executor may ignore         (locality preference)
durable:   written fact; survives restarts         (pod.spec.nodeName)
leased:    ownership expires unless renewed        (visibility timeout, shard lease)
fenced:    lease + monotonic token; old owner's writes rejected  (generation/epoch)

Interrogation:

What artifact does a decision produce, and where does it live?
Placement is not ownership: who may act on the placed work, enforced how?
If the old owner wakes up after expiry, what rejects it? (no fence -> double execution)
Binding is not materialization: what confirms the work actually started?

Axis 5 — Unit Granularity #

request:     per-call, stateless-ish          (load balancer pick)
task/item:   claimable unit                   (queue item)
job:         long-lived allocation
gang:        all-or-nothing set               (MPI, distributed training)
shard:       long-lived exclusive ownership   (partition, split)
replica set: unit + topology constraint       (spread across domains)

Larger/atomic units need heavier machinery:

gang   -> partial-allocation deadlock; needs all-or-nothing admission (Volcano/Kueue)
big    -> reservation: hold capacity before commit; pays leaked holds, overbooking,
          stranded capacity, expired-hold-used-later
shard  -> lease + checkpoint must move together; rebalance thrash

Interrogation:

What is the atomic unit of scheduling?
Can it be partially allocated? Is that a deadlock?
Does the unit outlive the decision? (then binding needs lease/fence)
Is a two-phase reserve/commit needed? Who expires reservations?

Scheduler As Protocol (the crossing point — keep verbatim) #

Every vector implements some subset of:

receive or observe work intent
admit or reject
build candidate set
filter hard constraints
score soft preferences
choose target/order
reserve or bind
notify/materialize
observe completion/failure
rebalance, preempt, or retry

Kubernetes instantiation:

watch unscheduled Pods
filter nodes (predicates)
score feasible nodes (weighted plugins: pack + spread + locality in one vector)
bind (write nodeName, optimistic concurrency)
kubelet materializes; PodStatus feeds back

Pull/claim instantiation:

worker polls -> queue selects eligible item -> visibility lease granted
worker executes -> ack/delete, or lease expires and item is claimable again

Technical Bottleneck: Decision-View vs Execution-Reality* #

every scheduler decides on a model of the world;
the world moves between decision and effect.

Essential, no general solution. Half of all scheduler failure modes are this one gap in different clothes:

stale capacity view          binding race
stale locality info          stale endpoint / bad health signal
stale shard owner            capacity changed after binding
chosen worker died           lease expired mid-execution

Known recipes (bounded, composable, none universal):

optimistic concurrency on bind   (k8s: bind fails if node state moved; retry)
lease + expiry                   (ownership is rented, never owned)
fencing token / generation       (old decider's or owner's writes rejected)
feedback + reconcile             (every decision provisional; observe, correct)
preemption                       (undo a decision the world invalidated)

A strong design says explicitly:

what is scarce,
what discipline allocates it,
what binding is produced,
what feedback corrects bad decisions,
and what happens when the chosen placement fails.

Named Configurations (lookup table) #

Vector = {question, discipline/objective, topology, binding, unit}.

NameVectorCanonical study objectSignature failure
FIFO schedulerordering, FIFO, central-ish, advisory, taskthread-pool executorHoL blocking; big delays small
Priority schedulerordering, priority, central, durable, task/podk8s PriorityClass + preemptionstarvation; inversion; preemption storm
Fair-shareordering, fair-share, central, —, task/classLinux CFS; WFQwrong weights; fair-but-slow; gaming
Deadline (EDF)ordering+admission, deadline, central, —, taskEDF real-timeimpossible admits; cascade misses; bad estimates
SJF/SRPTordering, size-aware, central, —, taskSRPT disciplinelarge-job starvation; understated costs
Token/rate limiteradmission, rate-shape, central or distributed, —, requesttoken bucketwrong key; distributed counter skew
Concurrency limiteradmission, in-flight cap, local, permit, requestconnection pool; Envoy adaptivepermit leak; slots held by long jobs; limit ≠ bottleneck
Central placementplacement+admission, pack+spread+locality score, central, durable (optimistic bind), podk8s filter/score/bind; Borgstale view; binding race; fragmentation; scheduler bottleneck
Bin-packingplacement, pack objective, central, durable, job/VMBorg/k8s scoringhotspot; correlated failure; future big job can’t fit
Spread/anti-affinityplacement, spread objective, central, durable, replica settopology spread constraintsutilization loss; unschedulable replicas; wrong domain model
Locality-awareplacement, locality objective, central, advisory→durable, taskMapReduce/Spark locality levelswait-for-local; hot data hotspot; stale locality
Pull/claimordering, FIFO-ish, pull, leased, taskSQS visibility timeout; Temporal pollingduplicate execution; lease expiry mid-work; hoarding
Work-stealingplacement, load-balance, peer, ephemeral, taskForkJoin dequesteal overhead; locality loss; victim contention
Hash/ringplacement, deterministic, no-decider, durable-by-math, keyDynamo ring + vnodeshot key; churn movement; imbalance
Lease shard assignmentplacement, sticky-balance, coordinator-lease, leased+fenced, shardKafka group coordinatorstale owner; double processing; rebalance thrash; checkpoint/owner split
Query fragmentplacement+ordering, locality+slots, central coordinator, durable, fragment/gang-ishTrino planning/executionstraggler; skew; coordinator bottleneck; partial-retry pain
Traffic/LBplacement, load metric per pick, deterministic or local, advisory, requestEnvoy LB policiesstale endpoint; retry storm; bad health signal
Reservationadmission, hold-then-commit, central, leased reservation, large unitcapacity/inventory holdleaked hold; overbooking; stranded capacity
Gangplacement+admission, all-or-nothing, central, durable, gangVolcano/Kueue; HPC batchpartial-allocation deadlock; fragmentation; long waits
Waitlist/admissionadmission, priority+eligibility, central, —, job/podk8s pending-pod queuestarvation; stale promotion; infinite wait

Vocabulary #

work item  candidate set  capacity  constraint  score
priority  deadline  weight  fair share  quota
reservation  binding  lease  fence  generation  preemption
affinity  anti-affinity  locality  fragmentation
permit  slot  admission  promotion
feedback  reconcile  health  straggler  skew

Deep Lesson #

Scheduler bugs come from confusing pairs on different axes:

queue order          vs  scheduling discipline    (queue block vs axis 2)
available capacity   vs  usable capacity          (view vs reality*)
placement            vs  ownership                (axis 4: durable vs leased/fenced)
admission            vs  execution                (axis 1: separate questions)
priority             vs  fairness                 (two disciplines, two objectives)
local optimum        vs  global health            (axis 3: topology limits the view)
autoscaling          vs  scheduling               (changing capacity vs allocating it)
binding              vs  materialization          (decision artifact vs confirmed effect)

Design procedure: name the scarce resource, walk the five axes, state the binding strength, wire the feedback loop, plan for the dead placement. The named types are recognition shortcuts, not the design space.