- My Development Notes/
- System Design Components/
- Distributed Job Scheduler (Airflow / Cron-Class) Analysis Note/
Distributed Job Scheduler (Airflow / Cron-Class) Analysis Note
Table of Contents
Distributed Job Scheduler (Airflow / Cron-Class) Analysis Note #
This note captures the full step-by-step analysis for a distributed job scheduler: schedule intent, due-time evaluation, run creation, worker claim, execution lifecycle, and result/status projection.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a distributed job scheduler like cron / Airflow scheduler
- users define one-time or recurring schedules
- scheduler creates runs when schedules become due
- workers execute runs
- status/results should be visible
- retries and crash recovery matter
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| User creates/updates schedule definition | Client | overwrite state | S1update targetScheduleSpec | C1 |
| System determines that schedule is due | System | state transition | S1update targetScheduleState | C1 |
| System creates a job run | System | append event | S1create targetJobRun | C1 |
| Worker claims runnable job run | Worker | state transition | S1update targetRunState | C1 |
| Worker reports completion/failure | Worker | state transition | S1update targetRunState | C1 |
| System retries or reschedules failed/timed-out run | System | async process | S1hidden write targetRunState | C1 |
| User reads schedule/run status | Client | read projection | S1read projection targetSchedulerStatusView | R2 |
Notes on normalization:
- schedule definition is overwrite-state intent
- due-time recognition is a lifecycle transition on schedule state
- run creation is append-only fact
- worker claim/completion/retry are execution lifecycle transitions
- status view is derived
This is a composition of:
Future Constraint + Claimable RunClaimable Execution Process
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Create/update schedule definition | C1 | changes future execution truth |
| Mark schedule as due | C1 | wrong due detection causes missed or duplicate runs |
| Create job run | C1 | run creation is core execution truth |
| Claim runnable job run | C1 | one execution attempt should own a run at a time |
| Report run completion/failure | C1 | run lifecycle truth depends on it |
| Retry/reschedule failed/timed-out run | C1 | crash recovery and retry semantics depend on it |
| Read schedule/run status | R2 | operational/user-facing only |
Critical paths:
P1update schedule definitionP2mark schedule dueP3create job runP4claim runP5complete/fail runP6retry/reschedule run
Step 3 — Primary State Extraction #
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| ScheduleSpec | direct noun | Yes | keep as candidate | intent | Yes | service | overwrite | instance | schedule_id |
| ScheduleState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | schedule_id |
| JobRun | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | run_id |
| RunState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | run_id |
| SchedulerStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | schedule_id or run_id |
Minimal primary set:
ScheduleSpecScheduleStateJobRunRunState
Important modeling choices:
ScheduleSpec #
This is future intent.
ScheduleState #
Worth making explicit because:
- next due time
- pause/resume
- backfill/last-fired tracking
- lifecycle of the schedule itself
JobRun #
Primary immutable fact:
- each actual execution attempt/run identity
RunState #
Primary because:
- claim, running, success, failure, retry, timeout are all current lifecycle state
Step 4 — Hard Invariants #
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 update schedule definition | HARD | ordering | Schedule-spec revisions are ordered by monotonic version within schedule_id. |
P2 mark schedule due | HARD | eligibility | mark_due is valid only if current time and schedule lifecycle state make the schedule due under current schedule spec. |
P3 create job run | HARD | uniqueness | Key (schedule_id, fire_time) maps to at most one logical outcome job run creation within schedule firing scope. |
P4 claim run | HARD | uniqueness | Key run_id maps to at most one logical outcome current active execution owner within run-attempt scope. |
P5 complete/fail run | HARD | eligibility | complete_or_fail_run is valid only if current RunState is owned by the reporting execution attempt and lifecycle state allows completion/failure. |
P6 retry/reschedule run | HARD | eligibility | retry_or_reschedule_run is valid only if current run state and retry policy allow a new attempt or state transition. |
What matters most:
- no duplicate run creation for same schedule fire instant
- one active worker/owner per run attempt
- legal run lifecycle transitions only
- due-time logic must not miss or duplicate runs
Step 5 — Execution Context #
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical scheduler system with many workers |
| Write coordination scope | per object scope | correctness is per schedule and per run |
| Read consistency target | strong only | due-time detection and claim/run lifecycle need authoritative current state |
| Holder model | worker | workers temporarily own active run execution |
| Compensation acceptable? | No | duplicate or missed runs are not safely compensable in baseline correctness |
Derived:
holder_may_crash = truebounded_staleness_allowed = falseexclusive_claim_required = trueguarded_by_current_state = true
This implies:
- authoritative schedule/run state
- claimable execution ownership
- strong lifecycle checks
Step 6 — Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 update schedule definition | overwrite current value | CAS on version | schedule version |
P2 mark schedule due | guarded state transition | CAS on (state, version) or single-writer scheduler decision | schedule version, time source |
P3 create job run | append-only event | append log | dedup key (schedule_id, fire_time) |
P4 claim run | exclusive claim | lease | fencing token, heartbeat |
P5 complete/fail run | guarded state transition | CAS on (state, version) | attempt token / execution epoch |
P6 retry/reschedule run | guarded state transition | CAS on (state, version) | retry policy version |
Why these fit:
- schedule spec is current intent/config
- due-time transition depends on current schedule state and clock
- run creation is immutable fact
- execution ownership is a claim/lease problem
- completion/retry are guarded lifecycle transitions
Step 7 — Read Model / Source of Truth #
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 schedule definition | ScheduleSpec | read source directly | authoritative schedule store |
C2 current schedule lifecycle | ScheduleState | read source directly | authoritative schedule-state store |
C3 run creation fact | JobRun | read source directly | authoritative run store |
C4 current run lifecycle | RunState | read source directly | authoritative run-state store |
C5 status/dashboard | derived | materialized view | recompute from primary state |
Important point:
- scheduler decisions should read authoritative schedule and run state
- workers should claim against authoritative run lifecycle state
- status is a derived projection only
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
| schedule update | retry with schedule version | stale update loses CAS | committed schedule survives crash if persisted | status propagation may lag | n/a |
| mark due | retry safe with version/time guard | duplicate due transition loses guard | scheduler crash delays due marking; next scheduler retries | n/a | n/a |
| create run | retry with (schedule_id, fire_time) dedup | duplicate creation loses unique dedup check | committed run creation survives crash if persisted | status propagation may lag | n/a |
| claim run | retry with execution epoch | only one worker should hold current run lease | if claim committed and worker crashes, lease expires and run can be retried/reclaimed | n/a | stale worker fenced by attempt epoch |
| complete/fail run | retry with current attempt token | stale or wrong attempt loses guarded transition | committed completion/failure survives crash if persisted | result/status publication may lag | stale worker cannot complete after lease loss |
| retry/reschedule run | retry with current run version | duplicate retry transition loses guard | scheduler crash delays retry; next scheduler retries | n/a | stale scheduler/worker rejected by newer run state |
What matters most:
- dedup run creation by schedule fire instant
- claim fencing for worker ownership
- strong lifecycle validation on completion and retry
- crash recovery from expired worker claims
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| huge number of schedules becoming due simultaneously | write throughput hotspot | shard schedule evaluation by due-time buckets and parallelize due detection |
| worker-claim contention on runnable queue | contention hotspot | partition runnable runs and distribute claim load |
| scheduler crash/recovery backlog | contention hotspot | lease-based scheduler ownership of due partitions and incremental recovery |
| large status/dashboard reads | read hotspot | derived views only |
| retry storms after worker failures | write throughput hotspot | backoff, jitter, and bounded concurrent retries |
| time-bucket scans | write throughput hotspot | indexed next-due buckets and incremental scans |
What scales well:
- schedule definitions and run lifecycles are partitionable
- run claims can be partitioned across runnable queues
- workers can scale horizontally as claimers
What fails first:
- synchronized cron spikes
- claim contention for runnable work
- retry storms after systemic worker failures
Canonical design conclusion:
- archetype composition:
Future Constraint + Claimable RunClaimable Execution Process
- primary truth:
ScheduleSpecScheduleStateJobRunRunState
- hot execution path:
- due detection
- append run creation
- worker claim
- guarded completion/retry
Concrete Substrate #
- control/scheduler plane in
Go/Java - authoritative schedule and run state in strongly consistent store
- due-time index over schedules
- durable run queue/log for created runs
- worker fleet claiming runs with leases
- derived status/log views from primary schedule/run state
Operation Layer #
PutSchedule(schedule_id, spec, expected_version?)
- entry point: scheduler API
- authoritative decider: schedule store owner
- transition: overwrite
ScheduleSpec
- internal due detection
- scan/index schedules by next due time
- guarded transition to due state and append run creation
CreateRun(schedule_id, fire_time)
- internal scheduler action
- authoritative decider: run store owner
- transition: append
JobRun
ClaimRun(run_id, worker_id, attempt_epoch)
- entry point: worker
- authoritative decider: run-state owner
- transition: exclusive claim on
RunState
CompleteRun(run_id, attempt_epoch, result)/FailRun(...)
- entry point: worker
- authoritative decider: run-state owner
- transition: guarded completion/failure
- internal retry/reschedule
- transition run state to retryable / create next attempt as allowed by policy
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| schedule update | scheduler API | schedule store owner | control-plane node | job scheduler |
| due detection | scheduler | schedule/run-state owner | internal | job scheduler |
| run creation | scheduler | run store owner | internal | job scheduler |
| claim run | worker | run-state owner | control-plane node / run-state owner | job scheduler |
| complete/fail run | worker | run-state owner | control-plane node | job scheduler |
| retry/reschedule | scheduler | run-state owner | internal | job scheduler |
Concrete HLD #
Main components:
- scheduler API
- schedule store + due-time index
- scheduler fleet
- run store / runnable-run queue
- worker fleet
- status/dashboard views
Short interview version #
“I’d design the distributed scheduler as a future-constraint plus claimable-run system. Schedule definitions are authoritative intent, due detection creates immutable run records keyed by
(schedule_id, fire_time), and workers claim runs with fenced ownership before executing them. Completion and retry are guarded lifecycle transitions on run state, so duplicate claims or stale workers cannot corrupt execution truth. The main scaling levers are sharded due-time buckets, partitioned runnable queues, and careful handling of synchronized schedule spikes.”