Scheduler #
scheduler = policy engine that turns scarce capacity into ordered, placed, admitted work
A scheduler is more than a queue:
queue = policy-governed waiting set
scheduler = queue + capacity model + decision discipline + binding
Central tension (the objective space):
fairness, latency, throughput, locality, utilization, isolation
cannot all be maximized at once.
Every discipline is a weighting of this vector.
Every failure mode is an under-weighted objective presenting its bill.
Design Axes (the core module) #
A “scheduler type” is a point in a five-axis space. Design by choosing per axis; recognize systems by reading their vector.
Axis 1 — Decision Question #
ordering: what runs next (queue discipline)
placement: where it runs (node/worker/shard choice)
admission: whether/how much enters (rate, quota, waitlist)
Real schedulers usually answer two or three at once (k8s: admission via quota + placement via filter/score; SQS: ordering + admission).
Interrogation:
What is being scheduled? What resource is scarce?
Is this about order, placement, admission, or all three?
Which question does each component answer? (name them separately)
Axis 2 — Discipline / Objective Weighting #
Disciplines are substitutable policies per question, not scheduler types.
ordering disciplines:
FIFO predictability pays: HoL blocking, big-delays-small
priority importance first pays: starvation, inversion, abuse
fair-share isolation per class pays: latency, idle capacity waste
deadline (EDF) meet due times pays: needs admission test + good estimates
SJF/SRPT mean latency pays: large-job starvation, estimate gaming
placement objectives (weighted terms in one score, not rivals):
bin-pack utilization pays: hotspots, correlated failure, fragmentation
spread failure decorrelation pays: utilization
locality data/cache/device near pays: wait-for-local-slot, hot-data hotspot
admission disciplines:
token/rate bucket shape arrival rate pays: wrong key, burst tuning, distributed skew
concurrency limit cap in-flight, not rate pays: permit leak, long jobs hold all slots
quota/waitlist hold until eligible pays: starvation, stale promotion
Interrogation:
Hard constraints vs soft preferences — which is which?
What is the fairness definition? Over what window?
Can work starve? What ages it out of starvation?
Can work be preempted? What does preemption cost?
Whose estimates does the discipline trust, and who can game them?
Which objective did you down-weight, and what bill will it present?
Axis 3 — Decision Topology (who decides) #
central: omniscient view, one decider (k8s scheduler, Borg, Trino coordinator)
pull/claim: workers ask, service selects (SQS, Pub/Sub, Temporal task queues)
peer stealing: workers rebalance among selves (ForkJoin, Go runtime, Ray)
deterministic: no runtime decider; hash + membership (consistent hashing, Dynamo ring)
coordinator-lease: central grants exclusive shard ownership (Kafka groups, Kinesis)
The tradeoff: global view vs freshness vs bottleneck.
central sees everything, but is a bottleneck with a stale view
pull fresh worker state, but myopic; no global optimization
stealing adaptive, but locality loss + victim contention
deterministic zero decision latency, but blind to load; hot keys unfixable by policy
coord-lease exclusive ownership, but rebalance pauses + generation bookkeeping
Interrogation:
Who holds the capacity model, and how stale is it?
Is the decider itself a scalability or availability bottleneck?
What happens to scheduling when the decider is down? (does placed work keep running?)
Per-item decision or amortized assignment? (request LB vs shard lease)
Axis 4 — Binding Strength #
advisory: suggestion; executor may ignore (locality preference)
durable: written fact; survives restarts (pod.spec.nodeName)
leased: ownership expires unless renewed (visibility timeout, shard lease)
fenced: lease + monotonic token; old owner's writes rejected (generation/epoch)
Interrogation:
What artifact does a decision produce, and where does it live?
Placement is not ownership: who may act on the placed work, enforced how?
If the old owner wakes up after expiry, what rejects it? (no fence -> double execution)
Binding is not materialization: what confirms the work actually started?
Axis 5 — Unit Granularity #
request: per-call, stateless-ish (load balancer pick)
task/item: claimable unit (queue item)
job: long-lived allocation
gang: all-or-nothing set (MPI, distributed training)
shard: long-lived exclusive ownership (partition, split)
replica set: unit + topology constraint (spread across domains)
Larger/atomic units need heavier machinery:
gang -> partial-allocation deadlock; needs all-or-nothing admission (Volcano/Kueue)
big -> reservation: hold capacity before commit; pays leaked holds, overbooking,
stranded capacity, expired-hold-used-later
shard -> lease + checkpoint must move together; rebalance thrash
Interrogation:
What is the atomic unit of scheduling?
Can it be partially allocated? Is that a deadlock?
Does the unit outlive the decision? (then binding needs lease/fence)
Is a two-phase reserve/commit needed? Who expires reservations?
Scheduler As Protocol (the crossing point — keep verbatim) #
Every vector implements some subset of:
receive or observe work intent
admit or reject
build candidate set
filter hard constraints
score soft preferences
choose target/order
reserve or bind
notify/materialize
observe completion/failure
rebalance, preempt, or retry
Kubernetes instantiation:
watch unscheduled Pods
filter nodes (predicates)
score feasible nodes (weighted plugins: pack + spread + locality in one vector)
bind (write nodeName, optimistic concurrency)
kubelet materializes; PodStatus feeds back
Pull/claim instantiation:
worker polls -> queue selects eligible item -> visibility lease granted
worker executes -> ack/delete, or lease expires and item is claimable again
Technical Bottleneck: Decision-View vs Execution-Reality* #
every scheduler decides on a model of the world;
the world moves between decision and effect.
Essential, no general solution. Half of all scheduler failure modes are this one gap in different clothes:
stale capacity view binding race
stale locality info stale endpoint / bad health signal
stale shard owner capacity changed after binding
chosen worker died lease expired mid-execution
Known recipes (bounded, composable, none universal):
optimistic concurrency on bind (k8s: bind fails if node state moved; retry)
lease + expiry (ownership is rented, never owned)
fencing token / generation (old decider's or owner's writes rejected)
feedback + reconcile (every decision provisional; observe, correct)
preemption (undo a decision the world invalidated)
A strong design says explicitly:
what is scarce,
what discipline allocates it,
what binding is produced,
what feedback corrects bad decisions,
and what happens when the chosen placement fails.
Named Configurations (lookup table) #
Vector = {question, discipline/objective, topology, binding, unit}.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| FIFO scheduler | ordering, FIFO, central-ish, advisory, task | thread-pool executor | HoL blocking; big delays small |
| Priority scheduler | ordering, priority, central, durable, task/pod | k8s PriorityClass + preemption | starvation; inversion; preemption storm |
| Fair-share | ordering, fair-share, central, —, task/class | Linux CFS; WFQ | wrong weights; fair-but-slow; gaming |
| Deadline (EDF) | ordering+admission, deadline, central, —, task | EDF real-time | impossible admits; cascade misses; bad estimates |
| SJF/SRPT | ordering, size-aware, central, —, task | SRPT discipline | large-job starvation; understated costs |
| Token/rate limiter | admission, rate-shape, central or distributed, —, request | token bucket | wrong key; distributed counter skew |
| Concurrency limiter | admission, in-flight cap, local, permit, request | connection pool; Envoy adaptive | permit leak; slots held by long jobs; limit ≠ bottleneck |
| Central placement | placement+admission, pack+spread+locality score, central, durable (optimistic bind), pod | k8s filter/score/bind; Borg | stale view; binding race; fragmentation; scheduler bottleneck |
| Bin-packing | placement, pack objective, central, durable, job/VM | Borg/k8s scoring | hotspot; correlated failure; future big job can’t fit |
| Spread/anti-affinity | placement, spread objective, central, durable, replica set | topology spread constraints | utilization loss; unschedulable replicas; wrong domain model |
| Locality-aware | placement, locality objective, central, advisory→durable, task | MapReduce/Spark locality levels | wait-for-local; hot data hotspot; stale locality |
| Pull/claim | ordering, FIFO-ish, pull, leased, task | SQS visibility timeout; Temporal polling | duplicate execution; lease expiry mid-work; hoarding |
| Work-stealing | placement, load-balance, peer, ephemeral, task | ForkJoin deque | steal overhead; locality loss; victim contention |
| Hash/ring | placement, deterministic, no-decider, durable-by-math, key | Dynamo ring + vnodes | hot key; churn movement; imbalance |
| Lease shard assignment | placement, sticky-balance, coordinator-lease, leased+fenced, shard | Kafka group coordinator | stale owner; double processing; rebalance thrash; checkpoint/owner split |
| Query fragment | placement+ordering, locality+slots, central coordinator, durable, fragment/gang-ish | Trino planning/execution | straggler; skew; coordinator bottleneck; partial-retry pain |
| Traffic/LB | placement, load metric per pick, deterministic or local, advisory, request | Envoy LB policies | stale endpoint; retry storm; bad health signal |
| Reservation | admission, hold-then-commit, central, leased reservation, large unit | capacity/inventory hold | leaked hold; overbooking; stranded capacity |
| Gang | placement+admission, all-or-nothing, central, durable, gang | Volcano/Kueue; HPC batch | partial-allocation deadlock; fragmentation; long waits |
| Waitlist/admission | admission, priority+eligibility, central, —, job/pod | k8s pending-pod queue | starvation; stale promotion; infinite wait |
Vocabulary #
work item candidate set capacity constraint score
priority deadline weight fair share quota
reservation binding lease fence generation preemption
affinity anti-affinity locality fragmentation
permit slot admission promotion
feedback reconcile health straggler skew
Deep Lesson #
Scheduler bugs come from confusing pairs on different axes:
queue order vs scheduling discipline (queue block vs axis 2)
available capacity vs usable capacity (view vs reality*)
placement vs ownership (axis 4: durable vs leased/fenced)
admission vs execution (axis 1: separate questions)
priority vs fairness (two disciplines, two objectives)
local optimum vs global health (axis 3: topology limits the view)
autoscaling vs scheduling (changing capacity vs allocating it)
binding vs materialization (decision artifact vs confirmed effect)
Design procedure: name the scarce resource, walk the five axes, state the binding strength, wire the feedback loop, plan for the dead placement. The named types are recognition shortcuts, not the design space.