Scheduler #

scheduler = policy engine that turns scarce capacity into ordered, placed, admitted work

A scheduler is more than a queue:

queue      = policy-governed waiting set
scheduler  = queue + capacity model + decision discipline + binding

Central tension (the objective space):

fairness, latency, throughput, locality, utilization, isolation
cannot all be maximized at once.
Every discipline is a weighting of this vector.
Every failure mode is an under-weighted objective presenting its bill.

Design Axes (the core module) #

A “scheduler type” is a point in a five-axis space. Design by choosing per axis; recognize systems by reading their vector.

Axis 1 — Decision Question #

ordering:    what runs next          (queue discipline)
placement:   where it runs           (node/worker/shard choice)
admission:   whether/how much enters (rate, quota, waitlist)

Real schedulers usually answer two or three at once (k8s: admission via quota + placement via filter/score; SQS: ordering + admission).

Interrogation:

What is being scheduled? What resource is scarce?
Is this about order, placement, admission, or all three?
Which question does each component answer? (name them separately)

Axis 2 — Discipline / Objective Weighting #

Disciplines are substitutable policies per question, not scheduler types.

ordering disciplines:
  FIFO            predictability          pays: HoL blocking, big-delays-small
  priority        importance first        pays: starvation, inversion, abuse
  fair-share      isolation per class     pays: latency, idle capacity waste
  deadline (EDF)  meet due times          pays: needs admission test + good estimates
  SJF/SRPT        mean latency            pays: large-job starvation, estimate gaming

placement objectives (weighted terms in one score, not rivals):
  bin-pack        utilization             pays: hotspots, correlated failure, fragmentation
  spread          failure decorrelation   pays: utilization
  locality        data/cache/device near  pays: wait-for-local-slot, hot-data hotspot

admission disciplines:
  token/rate bucket    shape arrival rate       pays: wrong key, burst tuning, distributed skew
  concurrency limit    cap in-flight, not rate  pays: permit leak, long jobs hold all slots
  quota/waitlist       hold until eligible      pays: starvation, stale promotion

Interrogation:

Hard constraints vs soft preferences — which is which?
What is the fairness definition? Over what window?
Can work starve? What ages it out of starvation?
Can work be preempted? What does preemption cost?
Whose estimates does the discipline trust, and who can game them?
Which objective did you down-weight, and what bill will it present?

Axis 3 — Decision Topology (who decides) #

central:        omniscient view, one decider     (k8s scheduler, Borg, Trino coordinator)
pull/claim:     workers ask, service selects     (SQS, Pub/Sub, Temporal task queues)
peer stealing:  workers rebalance among selves   (ForkJoin, Go runtime, Ray)
deterministic:  no runtime decider; hash + membership  (consistent hashing, Dynamo ring)
coordinator-lease: central grants exclusive shard ownership  (Kafka groups, Kinesis)

The tradeoff: global view vs freshness vs bottleneck.

central        sees everything, but is a bottleneck with a stale view
pull           fresh worker state, but myopic; no global optimization
stealing       adaptive, but locality loss + victim contention
deterministic  zero decision latency, but blind to load; hot keys unfixable by policy
coord-lease    exclusive ownership, but rebalance pauses + generation bookkeeping

Interrogation:

Who holds the capacity model, and how stale is it?
Is the decider itself a scalability or availability bottleneck?
What happens to scheduling when the decider is down? (does placed work keep running?)
Per-item decision or amortized assignment? (request LB vs shard lease)

Axis 4 — Binding Strength #

advisory:  suggestion; executor may ignore         (locality preference)
durable:   written fact; survives restarts         (pod.spec.nodeName)
leased:    ownership expires unless renewed        (visibility timeout, shard lease)
fenced:    lease + monotonic token; old owner's writes rejected  (generation/epoch)

Interrogation:

What artifact does a decision produce, and where does it live?
Placement is not ownership: who may act on the placed work, enforced how?
If the old owner wakes up after expiry, what rejects it? (no fence -> double execution)
Binding is not materialization: what confirms the work actually started?

Axis 5 — Unit Granularity #

request:     per-call, stateless-ish          (load balancer pick)
task/item:   claimable unit                   (queue item)
job:         long-lived allocation
gang:        all-or-nothing set               (MPI, distributed training)
shard:       long-lived exclusive ownership   (partition, split)
replica set: unit + topology constraint       (spread across domains)

Larger/atomic units need heavier machinery:

gang   -> partial-allocation deadlock; needs all-or-nothing admission (Volcano/Kueue)
big    -> reservation: hold capacity before commit; pays leaked holds, overbooking,
          stranded capacity, expired-hold-used-later
shard  -> lease + checkpoint must move together; rebalance thrash

Interrogation:

What is the atomic unit of scheduling?
Can it be partially allocated? Is that a deadlock?
Does the unit outlive the decision? (then binding needs lease/fence)
Is a two-phase reserve/commit needed? Who expires reservations?

Scheduler As Protocol (the crossing point — keep verbatim) #

Every vector implements some subset of:

receive or observe work intent
admit or reject
build candidate set
filter hard constraints
score soft preferences
choose target/order
reserve or bind
notify/materialize
observe completion/failure
rebalance, preempt, or retry

Kubernetes instantiation:

watch unscheduled Pods
filter nodes (predicates)
score feasible nodes (weighted plugins: pack + spread + locality in one vector)
bind (write nodeName, optimistic concurrency)
kubelet materializes; PodStatus feeds back

Pull/claim instantiation:

worker polls -> queue selects eligible item -> visibility lease granted
worker executes -> ack/delete, or lease expires and item is claimable again

Technical Bottleneck: Decision-View vs Execution-Reality* #

every scheduler decides on a model of the world;
the world moves between decision and effect.

Essential, no general solution. Half of all scheduler failure modes are this one gap in different clothes:

stale capacity view          binding race
stale locality info          stale endpoint / bad health signal
stale shard owner            capacity changed after binding
chosen worker died           lease expired mid-execution

Known recipes (bounded, composable, none universal):

optimistic concurrency on bind   (k8s: bind fails if node state moved; retry)
lease + expiry                   (ownership is rented, never owned)
fencing token / generation       (old decider's or owner's writes rejected)
feedback + reconcile             (every decision provisional; observe, correct)
preemption                       (undo a decision the world invalidated)

A strong design says explicitly:

what is scarce,
what discipline allocates it,
what binding is produced,
what feedback corrects bad decisions,
and what happens when the chosen placement fails.

Named Configurations (lookup table) #

Vector = {question, discipline/objective, topology, binding, unit}.

Name	Vector	Canonical study object	Signature failure
FIFO scheduler	ordering, FIFO, central-ish, advisory, task	thread-pool executor	HoL blocking; big delays small
Priority scheduler	ordering, priority, central, durable, task/pod	k8s PriorityClass + preemption	starvation; inversion; preemption storm
Fair-share	ordering, fair-share, central, —, task/class	Linux CFS; WFQ	wrong weights; fair-but-slow; gaming
Deadline (EDF)	ordering+admission, deadline, central, —, task	EDF real-time	impossible admits; cascade misses; bad estimates
SJF/SRPT	ordering, size-aware, central, —, task	SRPT discipline	large-job starvation; understated costs
Token/rate limiter	admission, rate-shape, central or distributed, —, request	token bucket	wrong key; distributed counter skew
Concurrency limiter	admission, in-flight cap, local, permit, request	connection pool; Envoy adaptive	permit leak; slots held by long jobs; limit ≠ bottleneck
Central placement	placement+admission, pack+spread+locality score, central, durable (optimistic bind), pod	k8s filter/score/bind; Borg	stale view; binding race; fragmentation; scheduler bottleneck
Bin-packing	placement, pack objective, central, durable, job/VM	Borg/k8s scoring	hotspot; correlated failure; future big job can’t fit
Spread/anti-affinity	placement, spread objective, central, durable, replica set	topology spread constraints	utilization loss; unschedulable replicas; wrong domain model
Locality-aware	placement, locality objective, central, advisory→durable, task	MapReduce/Spark locality levels	wait-for-local; hot data hotspot; stale locality
Pull/claim	ordering, FIFO-ish, pull, leased, task	SQS visibility timeout; Temporal polling	duplicate execution; lease expiry mid-work; hoarding
Work-stealing	placement, load-balance, peer, ephemeral, task	ForkJoin deque	steal overhead; locality loss; victim contention
Hash/ring	placement, deterministic, no-decider, durable-by-math, key	Dynamo ring + vnodes	hot key; churn movement; imbalance
Lease shard assignment	placement, sticky-balance, coordinator-lease, leased+fenced, shard	Kafka group coordinator	stale owner; double processing; rebalance thrash; checkpoint/owner split
Query fragment	placement+ordering, locality+slots, central coordinator, durable, fragment/gang-ish	Trino planning/execution	straggler; skew; coordinator bottleneck; partial-retry pain
Traffic/LB	placement, load metric per pick, deterministic or local, advisory, request	Envoy LB policies	stale endpoint; retry storm; bad health signal
Reservation	admission, hold-then-commit, central, leased reservation, large unit	capacity/inventory hold	leaked hold; overbooking; stranded capacity
Gang	placement+admission, all-or-nothing, central, durable, gang	Volcano/Kueue; HPC batch	partial-allocation deadlock; fragmentation; long waits
Waitlist/admission	admission, priority+eligibility, central, —, job/pod	k8s pending-pod queue	starvation; stale promotion; infinite wait

Vocabulary #

work item  candidate set  capacity  constraint  score
priority  deadline  weight  fair share  quota
reservation  binding  lease  fence  generation  preemption
affinity  anti-affinity  locality  fragmentation
permit  slot  admission  promotion
feedback  reconcile  health  straggler  skew

Deep Lesson #

Scheduler bugs come from confusing pairs on different axes:

queue order          vs  scheduling discipline    (queue block vs axis 2)
available capacity   vs  usable capacity          (view vs reality*)
placement            vs  ownership                (axis 4: durable vs leased/fenced)
admission            vs  execution                (axis 1: separate questions)
priority             vs  fairness                 (two disciplines, two objectives)
local optimum        vs  global health            (axis 3: topology limits the view)
autoscaling          vs  scheduling               (changing capacity vs allocating it)
binding              vs  materialization          (decision artifact vs confirmed effect)

Design procedure: name the scarce resource, walk the five axes, state the binding strength, wire the feedback loop, plan for the dead placement. The named types are recognition shortcuts, not the design space.