- My Development Notes/
- System Design Components/
- Container Scheduling / Placement (Kubernetes Scheduler-Class) Analysis Note/
Container Scheduling / Placement (Kubernetes Scheduler-Class) Analysis Note
Table of Contents
Container Scheduling / Placement (Kubernetes Scheduler-Class) Analysis Note #
This note captures the full step-by-step analysis for a Kubernetes-scheduler-class container placement system: workload intent, node resource state, scheduling decisions, binding/claim semantics, and reconciliation.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a container scheduling / placement system like the Kubernetes scheduler
- users submit workloads (pods/jobs)
- system places workloads onto eligible nodes
- placement must respect resource capacity and policy constraints
- failed nodes or changed cluster state should trigger rescheduling/reconciliation
- system scales across large clusters
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| User creates or updates workload intent | Client | overwrite state | S1update targetWorkloadSpec | C1 |
| System records node resource/health state | System | overwrite state | S1update targetNodeState | C1 |
| Scheduler computes placement decision | System | state transition | S1update targetPlacementDecision | C1 |
| Scheduler binds workload to node | System | state transition | S1update targetBindingState | C1 |
| System reconciles workload after node failure or drift | System | async process | S1hidden write targetBindingState | C1 |
| System propagates placement/config snapshot to node agents | System | async process | S1hidden write targetNodeSnapshot | C1 |
| User reads workload/scheduling status | Client | read projection | S1read projection targetSchedulingStatusView | R2 |
Notes on normalization:
- workload spec is overwrite-state desired intent
- node state is overwrite-state current capacity/health truth
- placement decision and binding are lifecycle transitions
- reconciliation is internal async process
- node snapshots are control-plane dissemination to agents
This system is a composition of:
Optimization / Matching DecisionControl Plane + Data Plane
The core correctness center is the scheduling/binding decision.
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Create/update workload intent | C1 | desired workload truth drives all placement |
| Record node resource/health state | C1 | stale node truth causes invalid placements |
| Compute placement decision | C1 | scheduler must pick an eligible node |
| Bind workload to node | C1 | one current binding should win per schedulable workload instance |
| Reconcile after node failure/drift | C1 | system must restore desired placement safely |
| Propagate placement/config snapshot to node agents | C1 | stale node view can delay or corrupt execution |
| Read scheduling status | R2 | operational only |
Critical paths:
P1update workload specP2update node stateP3compute placement decisionP4bind workload to nodeP5reconcile after failure/driftP6propagate node snapshot
Step 3 — Primary State Extraction #
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| WorkloadSpec | direct noun | Yes | keep as candidate | intent | Yes | service | overwrite | instance | workload_id |
| NodeState | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | instance | node_id |
| PlacementDecision | hidden write target | Yes | keep as candidate | process | Yes | service | overwrite | instance | workload_id |
| BindingState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | workload_instance_id |
| NodeSnapshot | hidden write target | Yes | keep as candidate | projection | Yes | service | overwrite | instance | node_id |
| SchedulingStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | namespace or workload scope |
Minimal primary set:
WorkloadSpecNodeStatePlacementDecisionBindingStateNodeSnapshot
Important modeling choices:
WorkloadSpec #
This is desired state / intent.
PlacementDecision #
Worth keeping explicit because:
- the scheduler computes a chosen placement from current cluster inputs
- decision quality and legality matter
BindingState #
Primary because:
- binding a workload instance to a node has lifecycle and exclusivity semantics
NodeSnapshot #
Important because:
- node agents act on propagated desired assignments and config, not synchronous control-plane reads for every reconciliation step
Step 4 — Hard Invariants #
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 update workload spec | HARD | ordering | Workload-spec revisions are ordered by monotonic version within workload_id. |
P2 update node state | HARD | ordering | Node-state updates are ordered by monotonic observation revision/timestamp within node_id. |
P3 compute placement decision | HARD | eligibility | place_workload is valid only if selected node is eligible under current workload constraints, node capacity, and policy state at decision time. |
P4 bind workload to node | HARD | uniqueness | Key workload_instance_id maps to at most one logical outcome current active binding within that workload-instance scope. |
P4 bind workload to node | HARD | eligibility | bind_workload is valid only if selected node still has sufficient allocatable capacity and binding state allows the transition. |
P5 reconcile after drift/failure | HARD | eligibility | reconcile_binding is valid only if current observed cluster state shows the active binding is invalid, failed, or drifted relative to desired workload state. |
P6 propagate node snapshot | HARD | freshness | NodeSnapshot(node_id) reflects authoritative desired assignments/config within configured propagation bound. |
What matters most:
- scheduler must not place onto ineligible or over-capacity nodes
- only one binding should win per workload instance
- node failure/drift should trigger legal rebinding
- node agents must converge to current desired assignment state
Step 5 — Execution Context #
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical scheduling system with many nodes/agents |
| Write coordination scope | per object scope | correctness is per workload instance, node, and node snapshot scope |
| Read consistency target | bounded stale allowed | scheduler can often operate on slightly stale node views, but binding must still be guarded |
| Holder model | node | nodes temporarily hold capacity/binding for workload instances |
| Compensation acceptable? | No | illegal overcommit or duplicate active binding is not a compensable workflow in baseline correctness |
Derived:
holder_may_crash = truebounded_staleness_allowed = trueexclusive_claim_required = trueguarded_by_current_state = true
This implies:
- authoritative workload/node state in control plane
- guarded placement/binding transitions
- bounded-stale scheduling inputs but authoritative bind checks
Step 6 — Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 update workload spec | overwrite current value | CAS on version | workload version |
P2 update node state | overwrite current value | CAS on version or monotonic overwrite | node revision/timestamp |
P3 compute placement decision | guarded state transition | single writer scheduler decision | decision version |
P4 bind workload to node | guarded state transition | CAS on (state, version) or single writer bind with optimistic concurrency | binding version, resource check |
P5 reconcile after drift/failure | guarded state transition | single writer reconciler decision + CAS bind update | binding version |
P6 propagate node snapshot | overwrite current value | single writer snapshot publication | snapshot version |
Why these fit:
- workload/node state are current-state config/status values
- placement is an eligibility-driven decision
- binding is a guarded current-state transition with uniqueness/exclusivity
- reconciliation is another guarded rebinding path
- node snapshot propagation is versioned overwrite dissemination
Step 7 — Read Model / Source of Truth #
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 workload intent | WorkloadSpec | read source directly | authoritative workload store |
C2 node resource/health state | NodeState | read source directly | authoritative node-state store |
C3 placement decision | PlacementDecision | read source directly | recompute from workload + node inputs |
C4 binding lifecycle | BindingState | read source directly | authoritative binding store |
C5 node snapshot | NodeSnapshot | materialized view | rebuild from latest authoritative desired assignment state |
C6 scheduling status | derived | materialized view | recompute from primary state |
Important point:
Node agents should act from propagated desired snapshots, not synchronously query scheduler for every local action.
The scheduler itself may read bounded-stale node state, but binding must still validate against current enough authoritative state.
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
| workload update | retry with workload version | stale update loses CAS | committed workload spec survives crash if persisted | snapshot propagation may lag | n/a |
| node-state update | retry with monotonic observation revision | latest valid node view wins | committed node state survives crash if persisted | snapshot propagation may lag | n/a |
| placement decision | retry safe from workload/node inputs | single scheduler/decision version wins | scheduler crash delays placement; next scheduler retries | n/a | stale decision rejected by newer binding/version |
| bind workload | retry with binding version | stale/duplicate bind loses CAS | committed bind survives crash if persisted | node-snapshot propagation may lag | stale node binding fenced by newer binding version |
| reconcile after failure/drift | retry safe from current cluster state | only one active rebinding should win | reconciler crash delays recovery; next reconciler retries | snapshot propagation may lag | stale node/agent should stop acting after newer desired snapshot |
| snapshot propagation | retry with versioned snapshot | older snapshot loses to newer version | node keeps last good snapshot until refresh | failed push retried or pulled | n/a |
What matters most:
- one binding wins per workload instance
- bind must validate against current capacity and state
- node agents move monotonically forward by snapshot version
- reconciliation must safely replace failed/invalid bindings
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| large scheduling queue bursts | write throughput hotspot | shard scheduler work queues and parallelize scoring/filtering |
| very large cluster/node state | read hotspot | partition node state by region/zone/pool and cache scheduler inputs |
| frequent node-state churn | contention hotspot | dampen updates and aggregate node-state deltas |
| binding/reconciliation storms after failures | contention hotspot | batch reconciliation and rate-limit rebinding |
| snapshot propagation to many agents | fan-out hotspot | publish incremental node snapshots and support pull-on-version-miss |
| status/dashboard reads | read hotspot | derived views only |
What scales well:
- scheduler can parallelize scoring
- node agents act from local snapshots
- desired-state model allows reconciliation workers to operate independently
What fails first:
- cluster-wide failure causing rebinding storms
- huge node-state churn
- large fanout of node snapshots
Canonical design conclusion:
- archetype composition:
Optimization / Matching DecisionControl Plane + Data Plane
- primary truth:
WorkloadSpecNodeStatePlacementDecisionBindingStateNodeSnapshot
- hot path:
- scheduler computes placement from current cluster state
- guarded bind commits assignment
- node agents act from local snapshots
Concrete Substrate #
- control plane in
Go/Java - authoritative cluster state store in
etcd/strongly consistent control-plane DB - scheduler fleet consuming pending workload queue
- node agents/kubelet-like workers consuming versioned desired assignment snapshots
- optional event stream for status changes and analytics
Operation Layer #
PutWorkloadSpec(workload_id, spec, expected_version?)
- entry point: control-plane API
- authoritative decider: workload store owner
- transition: overwrite
WorkloadSpec
ReportNodeState(node_id, revision, capacity, health, labels)
- entry point: node-state API
- authoritative decider: node-state owner
- transition: overwrite
NodeState
- internal scheduling
- reads pending workload + candidate node set
- computes
PlacementDecision
BindWorkload(workload_instance_id, node_id, expected_binding_version?)
- entry point: scheduler
- authoritative decider: binding-state owner
- transition: guarded update to
BindingState
- internal reconciliation
- recompute/rebind workloads whose nodes failed or drifted
- snapshot propagation
- publish latest
NodeSnapshot(version)to node agents
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| workload update | control-plane API | workload store owner | control-plane node | scheduling platform |
| node-state update | node-state API | node-state owner | control-plane node | scheduling platform |
| placement decision | scheduler | scheduler + binding-state owner | scheduler/control-plane | scheduling platform |
| bind workload | scheduler | binding-state owner | control-plane node | scheduling platform |
| snapshot propagation | node agent / control plane | snapshot publisher | control/data-plane | scheduling platform |
| local execution by node agent | node agent | local node snapshot | node agent | scheduling platform |
Concrete HLD #
Main components:
- control-plane API
- workload state store
- node-state store
- scheduler fleet
- binding/reconciliation controller
- node-snapshot distribution layer
- node agents / kubelet-like workers
Short interview version #
“I’d design the scheduler as an optimization-plus-control-plane system. Users write desired workload state, nodes report current resource and health state, and a scheduler computes legal placements from those inputs. Binding a workload to a node is a guarded state transition with uniqueness semantics, and reconciliation later repairs failed or drifted bindings. Node agents don’t query the scheduler for every action; they consume versioned desired-assignment snapshots and act locally.”