Table of Contents

Distributed Job Scheduler (Airflow / Cron-Class) Analysis Note #

This note captures the full step-by-step analysis for a distributed job scheduler: schedule intent, due-time evaluation, run creation, worker claim, execution lifecycle, and result/status projection.

Step 1 — Normalize #

Assume the baseline prompt is:

design a distributed job scheduler like cron / Airflow scheduler
users define one-time or recurring schedules
scheduler creates runs when schedules become due
workers execute runs
status/results should be visible
retries and crash recovery matter

Requirement	Actor	Operation	State touched	Priority
User creates/updates schedule definition	Client	overwrite state	`S1` `update target` `ScheduleSpec`	C1
System determines that schedule is due	System	state transition	`S1` `update target` `ScheduleState`	C1
System creates a job run	System	append event	`S1` `create target` `JobRun`	C1
Worker claims runnable job run	Worker	state transition	`S1` `update target` `RunState`	C1
Worker reports completion/failure	Worker	state transition	`S1` `update target` `RunState`	C1
System retries or reschedules failed/timed-out run	System	async process	`S1` `hidden write target` `RunState`	C1
User reads schedule/run status	Client	read projection	`S1` `read projection target` `SchedulerStatusView`	R2

Notes on normalization:

schedule definition is overwrite-state intent
due-time recognition is a lifecycle transition on schedule state
run creation is append-only fact
worker claim/completion/retry are execution lifecycle transitions
status view is derived

This is a composition of:

Future Constraint + Claimable Run
Claimable Execution Process

Step 2 — Critical Path Selection #

Requirement	Priority class	Why
Create/update schedule definition	C1	changes future execution truth
Mark schedule as due	C1	wrong due detection causes missed or duplicate runs
Create job run	C1	run creation is core execution truth
Claim runnable job run	C1	one execution attempt should own a run at a time
Report run completion/failure	C1	run lifecycle truth depends on it
Retry/reschedule failed/timed-out run	C1	crash recovery and retry semantics depend on it
Read schedule/run status	R2	operational/user-facing only

Critical paths:

P1 update schedule definition
P2 mark schedule due
P3 create job run
P4 claim run
P5 complete/fail run
P6 retry/reschedule run

Step 3 — Primary State Extraction #

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
ScheduleSpec	direct noun	Yes	keep as candidate	intent	Yes	service	overwrite	instance	schedule_id
ScheduleState	lifecycle object	Yes	keep as candidate	process	Yes	service	state machine	instance	schedule_id
JobRun	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	run_id
RunState	lifecycle object	Yes	keep as candidate	process	Yes	service	state machine	instance	run_id
SchedulerStatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	schedule_id or run_id

Minimal primary set:

ScheduleSpec
ScheduleState
JobRun
RunState

Important modeling choices:

`ScheduleSpec` #

This is future intent.

`ScheduleState` #

Worth making explicit because:

next due time
pause/resume
backfill/last-fired tracking
lifecycle of the schedule itself

`JobRun` #

Primary immutable fact:

each actual execution attempt/run identity

`RunState` #

Primary because:

claim, running, success, failure, retry, timeout are all current lifecycle state

Step 4 — Hard Invariants #

Path	Tier	Type	Invariant statement
`P1` update schedule definition	HARD	ordering	Schedule-spec revisions are ordered by monotonic version within `schedule_id`.
`P2` mark schedule due	HARD	eligibility	`mark_due` is valid only if current time and schedule lifecycle state make the schedule due under current schedule spec.
`P3` create job run	HARD	uniqueness	Key `(schedule_id, fire_time)` maps to at most one logical outcome `job run creation` within schedule firing scope.
`P4` claim run	HARD	uniqueness	Key `run_id` maps to at most one logical outcome `current active execution owner` within run-attempt scope.
`P5` complete/fail run	HARD	eligibility	`complete_or_fail_run` is valid only if current `RunState` is owned by the reporting execution attempt and lifecycle state allows completion/failure.
`P6` retry/reschedule run	HARD	eligibility	`retry_or_reschedule_run` is valid only if current run state and retry policy allow a new attempt or state transition.

What matters most:

no duplicate run creation for same schedule fire instant
one active worker/owner per run attempt
legal run lifecycle transitions only
due-time logic must not miss or duplicate runs

Step 5 — Execution Context #

Field	Value	Why
Topology	single service distributed	one logical scheduler system with many workers
Write coordination scope	per object scope	correctness is per schedule and per run
Read consistency target	strong only	due-time detection and claim/run lifecycle need authoritative current state
Holder model	worker	workers temporarily own active run execution
Compensation acceptable?	No	duplicate or missed runs are not safely compensable in baseline correctness

Derived:

holder_may_crash = true
bounded_staleness_allowed = false
exclusive_claim_required = true
guarded_by_current_state = true

This implies:

authoritative schedule/run state
claimable execution ownership
strong lifecycle checks

Step 6 — Deterministic Mechanism Selection #

Path	Write shape	Base mechanism	Required companions
`P1` update schedule definition	overwrite current value	CAS on version	schedule version
`P2` mark schedule due	guarded state transition	CAS on `(state, version)` or single-writer scheduler decision	schedule version, time source
`P3` create job run	append-only event	append log	dedup key `(schedule_id, fire_time)`
`P4` claim run	exclusive claim	lease	fencing token, heartbeat
`P5` complete/fail run	guarded state transition	CAS on `(state, version)`	attempt token / execution epoch
`P6` retry/reschedule run	guarded state transition	CAS on `(state, version)`	retry policy version

Why these fit:

schedule spec is current intent/config
due-time transition depends on current schedule state and clock
run creation is immutable fact
execution ownership is a claim/lease problem
completion/retry are guarded lifecycle transitions

Step 7 — Read Model / Source of Truth #

Concept	Truth	Read path	Rebuild path
`C1` schedule definition	`ScheduleSpec`	read source directly	authoritative schedule store
`C2` current schedule lifecycle	`ScheduleState`	read source directly	authoritative schedule-state store
`C3` run creation fact	`JobRun`	read source directly	authoritative run store
`C4` current run lifecycle	`RunState`	read source directly	authoritative run-state store
`C5` status/dashboard	derived	materialized view	recompute from primary state

Important point:

scheduler decisions should read authoritative schedule and run state
workers should claim against authoritative run lifecycle state
status is a derived projection only

Step 8 — Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
schedule update	retry with schedule version	stale update loses CAS	committed schedule survives crash if persisted	status propagation may lag	n/a
mark due	retry safe with version/time guard	duplicate due transition loses guard	scheduler crash delays due marking; next scheduler retries	n/a	n/a
create run	retry with `(schedule_id, fire_time)` dedup	duplicate creation loses unique dedup check	committed run creation survives crash if persisted	status propagation may lag	n/a
claim run	retry with execution epoch	only one worker should hold current run lease	if claim committed and worker crashes, lease expires and run can be retried/reclaimed	n/a	stale worker fenced by attempt epoch
complete/fail run	retry with current attempt token	stale or wrong attempt loses guarded transition	committed completion/failure survives crash if persisted	result/status publication may lag	stale worker cannot complete after lease loss
retry/reschedule run	retry with current run version	duplicate retry transition loses guard	scheduler crash delays retry; next scheduler retries	n/a	stale scheduler/worker rejected by newer run state

What matters most:

dedup run creation by schedule fire instant
claim fencing for worker ownership
strong lifecycle validation on completion and retry
crash recovery from expired worker claims

Step 9 — Scale Adjustments #

Hotspot	Type	First response
huge number of schedules becoming due simultaneously	write throughput hotspot	shard schedule evaluation by due-time buckets and parallelize due detection
worker-claim contention on runnable queue	contention hotspot	partition runnable runs and distribute claim load
scheduler crash/recovery backlog	contention hotspot	lease-based scheduler ownership of due partitions and incremental recovery
large status/dashboard reads	read hotspot	derived views only
retry storms after worker failures	write throughput hotspot	backoff, jitter, and bounded concurrent retries
time-bucket scans	write throughput hotspot	indexed next-due buckets and incremental scans

What scales well:

schedule definitions and run lifecycles are partitionable
run claims can be partitioned across runnable queues
workers can scale horizontally as claimers

What fails first:

synchronized cron spikes
claim contention for runnable work
retry storms after systemic worker failures

Canonical design conclusion:

archetype composition:
- Future Constraint + Claimable Run
- Claimable Execution Process
primary truth:
- ScheduleSpec
- ScheduleState
- JobRun
- RunState
hot execution path:
- due detection
- append run creation
- worker claim
- guarded completion/retry

Concrete Substrate #

control/scheduler plane in Go/Java
authoritative schedule and run state in strongly consistent store
due-time index over schedules
durable run queue/log for created runs
worker fleet claiming runs with leases
derived status/log views from primary schedule/run state

Operation Layer #

PutSchedule(schedule_id, spec, expected_version?)

entry point: scheduler API
authoritative decider: schedule store owner
transition: overwrite ScheduleSpec

internal due detection

scan/index schedules by next due time
guarded transition to due state and append run creation

CreateRun(schedule_id, fire_time)

internal scheduler action
authoritative decider: run store owner
transition: append JobRun

ClaimRun(run_id, worker_id, attempt_epoch)

entry point: worker
authoritative decider: run-state owner
transition: exclusive claim on RunState

CompleteRun(run_id, attempt_epoch, result) / FailRun(...)

entry point: worker
authoritative decider: run-state owner
transition: guarded completion/failure

internal retry/reschedule

transition run state to retryable / create next attempt as allowed by policy

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
schedule update	scheduler API	schedule store owner	control-plane node	job scheduler
due detection	scheduler	schedule/run-state owner	internal	job scheduler
run creation	scheduler	run store owner	internal	job scheduler
claim run	worker	run-state owner	control-plane node / run-state owner	job scheduler
complete/fail run	worker	run-state owner	control-plane node	job scheduler
retry/reschedule	scheduler	run-state owner	internal	job scheduler

Concrete HLD #

Main components:

scheduler API
schedule store + due-time index
scheduler fleet
run store / runnable-run queue
worker fleet
status/dashboard views

Short interview version #

“I’d design the distributed scheduler as a future-constraint plus claimable-run system. Schedule definitions are authoritative intent, due detection creates immutable run records keyed by (schedule_id, fire_time), and workers claim runs with fenced ownership before executing them. Completion and retry are guarded lifecycle transitions on run state, so duplicate claims or stale workers cannot corrupt execution truth. The main scaling levers are sharded due-time buckets, partitioned runnable queues, and careful handling of synchronized schedule spikes.”