Table of Contents

Container Scheduling / Placement (Kubernetes Scheduler-Class) Analysis Note #

This note captures the full step-by-step analysis for a Kubernetes-scheduler-class container placement system: workload intent, node resource state, scheduling decisions, binding/claim semantics, and reconciliation.

Step 1 — Normalize #

Assume the baseline prompt is:

design a container scheduling / placement system like the Kubernetes scheduler
users submit workloads (pods/jobs)
system places workloads onto eligible nodes
placement must respect resource capacity and policy constraints
failed nodes or changed cluster state should trigger rescheduling/reconciliation
system scales across large clusters

Requirement	Actor	Operation	State touched	Priority
User creates or updates workload intent	Client	overwrite state	`S1` `update target` `WorkloadSpec`	C1
System records node resource/health state	System	overwrite state	`S1` `update target` `NodeState`	C1
Scheduler computes placement decision	System	state transition	`S1` `update target` `PlacementDecision`	C1
Scheduler binds workload to node	System	state transition	`S1` `update target` `BindingState`	C1
System reconciles workload after node failure or drift	System	async process	`S1` `hidden write target` `BindingState`	C1
System propagates placement/config snapshot to node agents	System	async process	`S1` `hidden write target` `NodeSnapshot`	C1
User reads workload/scheduling status	Client	read projection	`S1` `read projection target` `SchedulingStatusView`	R2

Notes on normalization:

workload spec is overwrite-state desired intent
node state is overwrite-state current capacity/health truth
placement decision and binding are lifecycle transitions
reconciliation is internal async process
node snapshots are control-plane dissemination to agents

This system is a composition of:

Optimization / Matching Decision
Control Plane + Data Plane

The core correctness center is the scheduling/binding decision.

Step 2 — Critical Path Selection #

Requirement	Priority class	Why
Create/update workload intent	C1	desired workload truth drives all placement
Record node resource/health state	C1	stale node truth causes invalid placements
Compute placement decision	C1	scheduler must pick an eligible node
Bind workload to node	C1	one current binding should win per schedulable workload instance
Reconcile after node failure/drift	C1	system must restore desired placement safely
Propagate placement/config snapshot to node agents	C1	stale node view can delay or corrupt execution
Read scheduling status	R2	operational only

Critical paths:

P1 update workload spec
P2 update node state
P3 compute placement decision
P4 bind workload to node
P5 reconcile after failure/drift
P6 propagate node snapshot

Step 3 — Primary State Extraction #

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
WorkloadSpec	direct noun	Yes	keep as candidate	intent	Yes	service	overwrite	instance	workload_id
NodeState	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	instance	node_id
PlacementDecision	hidden write target	Yes	keep as candidate	process	Yes	service	overwrite	instance	workload_id
BindingState	lifecycle object	Yes	keep as candidate	process	Yes	service	state machine	instance	workload_instance_id
NodeSnapshot	hidden write target	Yes	keep as candidate	projection	Yes	service	overwrite	instance	node_id
SchedulingStatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	namespace or workload scope

Minimal primary set:

WorkloadSpec
NodeState
PlacementDecision
BindingState
NodeSnapshot

Important modeling choices:

`WorkloadSpec` #

This is desired state / intent.

`PlacementDecision` #

Worth keeping explicit because:

the scheduler computes a chosen placement from current cluster inputs
decision quality and legality matter

`BindingState` #

Primary because:

binding a workload instance to a node has lifecycle and exclusivity semantics

`NodeSnapshot` #

Important because:

node agents act on propagated desired assignments and config, not synchronous control-plane reads for every reconciliation step

Step 4 — Hard Invariants #

Path	Tier	Type	Invariant statement
`P1` update workload spec	HARD	ordering	Workload-spec revisions are ordered by monotonic version within `workload_id`.
`P2` update node state	HARD	ordering	Node-state updates are ordered by monotonic observation revision/timestamp within `node_id`.
`P3` compute placement decision	HARD	eligibility	`place_workload` is valid only if selected node is eligible under current workload constraints, node capacity, and policy state at decision time.
`P4` bind workload to node	HARD	uniqueness	Key `workload_instance_id` maps to at most one logical outcome `current active binding` within that workload-instance scope.
`P4` bind workload to node	HARD	eligibility	`bind_workload` is valid only if selected node still has sufficient allocatable capacity and binding state allows the transition.
`P5` reconcile after drift/failure	HARD	eligibility	`reconcile_binding` is valid only if current observed cluster state shows the active binding is invalid, failed, or drifted relative to desired workload state.
`P6` propagate node snapshot	HARD	freshness	`NodeSnapshot(node_id)` reflects authoritative desired assignments/config within configured propagation bound.

What matters most:

scheduler must not place onto ineligible or over-capacity nodes
only one binding should win per workload instance
node failure/drift should trigger legal rebinding
node agents must converge to current desired assignment state

Step 5 — Execution Context #

Field	Value	Why
Topology	single service distributed	one logical scheduling system with many nodes/agents
Write coordination scope	per object scope	correctness is per workload instance, node, and node snapshot scope
Read consistency target	bounded stale allowed	scheduler can often operate on slightly stale node views, but binding must still be guarded
Holder model	node	nodes temporarily hold capacity/binding for workload instances
Compensation acceptable?	No	illegal overcommit or duplicate active binding is not a compensable workflow in baseline correctness

Derived:

holder_may_crash = true
bounded_staleness_allowed = true
exclusive_claim_required = true
guarded_by_current_state = true

This implies:

authoritative workload/node state in control plane
guarded placement/binding transitions
bounded-stale scheduling inputs but authoritative bind checks

Step 6 — Deterministic Mechanism Selection #

Path	Write shape	Base mechanism	Required companions
`P1` update workload spec	overwrite current value	CAS on version	workload version
`P2` update node state	overwrite current value	CAS on version or monotonic overwrite	node revision/timestamp
`P3` compute placement decision	guarded state transition	single writer scheduler decision	decision version
`P4` bind workload to node	guarded state transition	CAS on `(state, version)` or single writer bind with optimistic concurrency	binding version, resource check
`P5` reconcile after drift/failure	guarded state transition	single writer reconciler decision + CAS bind update	binding version
`P6` propagate node snapshot	overwrite current value	single writer snapshot publication	snapshot version

Why these fit:

workload/node state are current-state config/status values
placement is an eligibility-driven decision
binding is a guarded current-state transition with uniqueness/exclusivity
reconciliation is another guarded rebinding path
node snapshot propagation is versioned overwrite dissemination

Step 7 — Read Model / Source of Truth #

Concept	Truth	Read path	Rebuild path
`C1` workload intent	`WorkloadSpec`	read source directly	authoritative workload store
`C2` node resource/health state	`NodeState`	read source directly	authoritative node-state store
`C3` placement decision	`PlacementDecision`	read source directly	recompute from workload + node inputs
`C4` binding lifecycle	`BindingState`	read source directly	authoritative binding store
`C5` node snapshot	`NodeSnapshot`	materialized view	rebuild from latest authoritative desired assignment state
`C6` scheduling status	derived	materialized view	recompute from primary state

Important point:

Node agents should act from propagated desired snapshots, not synchronously query scheduler for every local action.

The scheduler itself may read bounded-stale node state, but binding must still validate against current enough authoritative state.

Step 8 — Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
workload update	retry with workload version	stale update loses CAS	committed workload spec survives crash if persisted	snapshot propagation may lag	n/a
node-state update	retry with monotonic observation revision	latest valid node view wins	committed node state survives crash if persisted	snapshot propagation may lag	n/a
placement decision	retry safe from workload/node inputs	single scheduler/decision version wins	scheduler crash delays placement; next scheduler retries	n/a	stale decision rejected by newer binding/version
bind workload	retry with binding version	stale/duplicate bind loses CAS	committed bind survives crash if persisted	node-snapshot propagation may lag	stale node binding fenced by newer binding version
reconcile after failure/drift	retry safe from current cluster state	only one active rebinding should win	reconciler crash delays recovery; next reconciler retries	snapshot propagation may lag	stale node/agent should stop acting after newer desired snapshot
snapshot propagation	retry with versioned snapshot	older snapshot loses to newer version	node keeps last good snapshot until refresh	failed push retried or pulled	n/a

What matters most:

one binding wins per workload instance
bind must validate against current capacity and state
node agents move monotonically forward by snapshot version
reconciliation must safely replace failed/invalid bindings

Step 9 — Scale Adjustments #

Hotspot	Type	First response
large scheduling queue bursts	write throughput hotspot	shard scheduler work queues and parallelize scoring/filtering
very large cluster/node state	read hotspot	partition node state by region/zone/pool and cache scheduler inputs
frequent node-state churn	contention hotspot	dampen updates and aggregate node-state deltas
binding/reconciliation storms after failures	contention hotspot	batch reconciliation and rate-limit rebinding
snapshot propagation to many agents	fan-out hotspot	publish incremental node snapshots and support pull-on-version-miss
status/dashboard reads	read hotspot	derived views only

What scales well:

scheduler can parallelize scoring
node agents act from local snapshots
desired-state model allows reconciliation workers to operate independently

What fails first:

cluster-wide failure causing rebinding storms
huge node-state churn
large fanout of node snapshots

Canonical design conclusion:

archetype composition:
- Optimization / Matching Decision
- Control Plane + Data Plane
primary truth:
- WorkloadSpec
- NodeState
- PlacementDecision
- BindingState
- NodeSnapshot
hot path:
- scheduler computes placement from current cluster state
- guarded bind commits assignment
- node agents act from local snapshots

Concrete Substrate #

control plane in Go/Java
authoritative cluster state store in etcd/strongly consistent control-plane DB
scheduler fleet consuming pending workload queue
node agents/kubelet-like workers consuming versioned desired assignment snapshots
optional event stream for status changes and analytics

Operation Layer #

PutWorkloadSpec(workload_id, spec, expected_version?)

entry point: control-plane API
authoritative decider: workload store owner
transition: overwrite WorkloadSpec

ReportNodeState(node_id, revision, capacity, health, labels)

entry point: node-state API
authoritative decider: node-state owner
transition: overwrite NodeState

internal scheduling

reads pending workload + candidate node set
computes PlacementDecision

BindWorkload(workload_instance_id, node_id, expected_binding_version?)

entry point: scheduler
authoritative decider: binding-state owner
transition: guarded update to BindingState

internal reconciliation

recompute/rebind workloads whose nodes failed or drifted

snapshot propagation

publish latest NodeSnapshot(version) to node agents

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
workload update	control-plane API	workload store owner	control-plane node	scheduling platform
node-state update	node-state API	node-state owner	control-plane node	scheduling platform
placement decision	scheduler	scheduler + binding-state owner	scheduler/control-plane	scheduling platform
bind workload	scheduler	binding-state owner	control-plane node	scheduling platform
snapshot propagation	node agent / control plane	snapshot publisher	control/data-plane	scheduling platform
local execution by node agent	node agent	local node snapshot	node agent	scheduling platform

Concrete HLD #

Main components:

control-plane API
workload state store
node-state store
scheduler fleet
binding/reconciliation controller
node-snapshot distribution layer
node agents / kubelet-like workers

Short interview version #

“I’d design the scheduler as an optimization-plus-control-plane system. Users write desired workload state, nodes report current resource and health state, and a scheduler computes legal placements from those inputs. Binding a workload to a node is a guarded state transition with uniqueness semantics, and reconciliation later repairs failed or drifted bindings. Node agents don’t query the scheduler for every action; they consume versioned desired-assignment snapshots and act locally.”