Skip to main content
  1. System Design Components/

Distributed Job Scheduler (Airflow / Cron-Class) Analysis Note

Distributed Job Scheduler (Airflow / Cron-Class) Analysis Note #

This note captures the full step-by-step analysis for a distributed job scheduler: schedule intent, due-time evaluation, run creation, worker claim, execution lifecycle, and result/status projection.

Step 1 — Normalize #

Assume the baseline prompt is:

  • design a distributed job scheduler like cron / Airflow scheduler
  • users define one-time or recurring schedules
  • scheduler creates runs when schedules become due
  • workers execute runs
  • status/results should be visible
  • retries and crash recovery matter
RequirementActorOperationState touchedPriority
User creates/updates schedule definitionClientoverwrite stateS1
update target
ScheduleSpec
C1
System determines that schedule is dueSystemstate transitionS1
update target
ScheduleState
C1
System creates a job runSystemappend eventS1
create target
JobRun
C1
Worker claims runnable job runWorkerstate transitionS1
update target
RunState
C1
Worker reports completion/failureWorkerstate transitionS1
update target
RunState
C1
System retries or reschedules failed/timed-out runSystemasync processS1
hidden write target
RunState
C1
User reads schedule/run statusClientread projectionS1
read projection target
SchedulerStatusView
R2

Notes on normalization:

  • schedule definition is overwrite-state intent
  • due-time recognition is a lifecycle transition on schedule state
  • run creation is append-only fact
  • worker claim/completion/retry are execution lifecycle transitions
  • status view is derived

This is a composition of:

  • Future Constraint + Claimable Run
  • Claimable Execution Process

Step 2 — Critical Path Selection #

RequirementPriority classWhy
Create/update schedule definitionC1changes future execution truth
Mark schedule as dueC1wrong due detection causes missed or duplicate runs
Create job runC1run creation is core execution truth
Claim runnable job runC1one execution attempt should own a run at a time
Report run completion/failureC1run lifecycle truth depends on it
Retry/reschedule failed/timed-out runC1crash recovery and retry semantics depend on it
Read schedule/run statusR2operational/user-facing only

Critical paths:

  • P1 update schedule definition
  • P2 mark schedule due
  • P3 create job run
  • P4 claim run
  • P5 complete/fail run
  • P6 retry/reschedule run

Step 3 — Primary State Extraction #

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
ScheduleSpecdirect nounYeskeep as candidateintentYesserviceoverwriteinstanceschedule_id
ScheduleStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstanceschedule_id
JobRundirect nounYeskeep as candidateeventYesserviceappend-onlyinstancerun_id
RunStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstancerun_id
SchedulerStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectionschedule_id or run_id

Minimal primary set:

  • ScheduleSpec
  • ScheduleState
  • JobRun
  • RunState

Important modeling choices:

ScheduleSpec #

This is future intent.

ScheduleState #

Worth making explicit because:

  • next due time
  • pause/resume
  • backfill/last-fired tracking
  • lifecycle of the schedule itself

JobRun #

Primary immutable fact:

  • each actual execution attempt/run identity

RunState #

Primary because:

  • claim, running, success, failure, retry, timeout are all current lifecycle state

Step 4 — Hard Invariants #

PathTierTypeInvariant statement
P1 update schedule definitionHARDorderingSchedule-spec revisions are ordered by monotonic version within schedule_id.
P2 mark schedule dueHARDeligibilitymark_due is valid only if current time and schedule lifecycle state make the schedule due under current schedule spec.
P3 create job runHARDuniquenessKey (schedule_id, fire_time) maps to at most one logical outcome job run creation within schedule firing scope.
P4 claim runHARDuniquenessKey run_id maps to at most one logical outcome current active execution owner within run-attempt scope.
P5 complete/fail runHARDeligibilitycomplete_or_fail_run is valid only if current RunState is owned by the reporting execution attempt and lifecycle state allows completion/failure.
P6 retry/reschedule runHARDeligibilityretry_or_reschedule_run is valid only if current run state and retry policy allow a new attempt or state transition.

What matters most:

  • no duplicate run creation for same schedule fire instant
  • one active worker/owner per run attempt
  • legal run lifecycle transitions only
  • due-time logic must not miss or duplicate runs

Step 5 — Execution Context #

FieldValueWhy
Topologysingle service distributedone logical scheduler system with many workers
Write coordination scopeper object scopecorrectness is per schedule and per run
Read consistency targetstrong onlydue-time detection and claim/run lifecycle need authoritative current state
Holder modelworkerworkers temporarily own active run execution
Compensation acceptable?Noduplicate or missed runs are not safely compensable in baseline correctness

Derived:

  • holder_may_crash = true
  • bounded_staleness_allowed = false
  • exclusive_claim_required = true
  • guarded_by_current_state = true

This implies:

  • authoritative schedule/run state
  • claimable execution ownership
  • strong lifecycle checks

Step 6 — Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 update schedule definitionoverwrite current valueCAS on versionschedule version
P2 mark schedule dueguarded state transitionCAS on (state, version) or single-writer scheduler decisionschedule version, time source
P3 create job runappend-only eventappend logdedup key (schedule_id, fire_time)
P4 claim runexclusive claimleasefencing token, heartbeat
P5 complete/fail runguarded state transitionCAS on (state, version)attempt token / execution epoch
P6 retry/reschedule runguarded state transitionCAS on (state, version)retry policy version

Why these fit:

  • schedule spec is current intent/config
  • due-time transition depends on current schedule state and clock
  • run creation is immutable fact
  • execution ownership is a claim/lease problem
  • completion/retry are guarded lifecycle transitions

Step 7 — Read Model / Source of Truth #

ConceptTruthRead pathRebuild path
C1 schedule definitionScheduleSpecread source directlyauthoritative schedule store
C2 current schedule lifecycleScheduleStateread source directlyauthoritative schedule-state store
C3 run creation factJobRunread source directlyauthoritative run store
C4 current run lifecycleRunStateread source directlyauthoritative run-state store
C5 status/dashboardderivedmaterialized viewrecompute from primary state

Important point:

  • scheduler decisions should read authoritative schedule and run state
  • workers should claim against authoritative run lifecycle state
  • status is a derived projection only

Step 8 — Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
schedule updateretry with schedule versionstale update loses CAScommitted schedule survives crash if persistedstatus propagation may lagn/a
mark dueretry safe with version/time guardduplicate due transition loses guardscheduler crash delays due marking; next scheduler retriesn/an/a
create runretry with (schedule_id, fire_time) dedupduplicate creation loses unique dedup checkcommitted run creation survives crash if persistedstatus propagation may lagn/a
claim runretry with execution epochonly one worker should hold current run leaseif claim committed and worker crashes, lease expires and run can be retried/reclaimedn/astale worker fenced by attempt epoch
complete/fail runretry with current attempt tokenstale or wrong attempt loses guarded transitioncommitted completion/failure survives crash if persistedresult/status publication may lagstale worker cannot complete after lease loss
retry/reschedule runretry with current run versionduplicate retry transition loses guardscheduler crash delays retry; next scheduler retriesn/astale scheduler/worker rejected by newer run state

What matters most:

  • dedup run creation by schedule fire instant
  • claim fencing for worker ownership
  • strong lifecycle validation on completion and retry
  • crash recovery from expired worker claims

Step 9 — Scale Adjustments #

HotspotTypeFirst response
huge number of schedules becoming due simultaneouslywrite throughput hotspotshard schedule evaluation by due-time buckets and parallelize due detection
worker-claim contention on runnable queuecontention hotspotpartition runnable runs and distribute claim load
scheduler crash/recovery backlogcontention hotspotlease-based scheduler ownership of due partitions and incremental recovery
large status/dashboard readsread hotspotderived views only
retry storms after worker failureswrite throughput hotspotbackoff, jitter, and bounded concurrent retries
time-bucket scanswrite throughput hotspotindexed next-due buckets and incremental scans

What scales well:

  • schedule definitions and run lifecycles are partitionable
  • run claims can be partitioned across runnable queues
  • workers can scale horizontally as claimers

What fails first:

  • synchronized cron spikes
  • claim contention for runnable work
  • retry storms after systemic worker failures

Canonical design conclusion:

  • archetype composition:
    • Future Constraint + Claimable Run
    • Claimable Execution Process
  • primary truth:
    • ScheduleSpec
    • ScheduleState
    • JobRun
    • RunState
  • hot execution path:
    • due detection
    • append run creation
    • worker claim
    • guarded completion/retry

Concrete Substrate #

  • control/scheduler plane in Go/Java
  • authoritative schedule and run state in strongly consistent store
  • due-time index over schedules
  • durable run queue/log for created runs
  • worker fleet claiming runs with leases
  • derived status/log views from primary schedule/run state

Operation Layer #

  1. PutSchedule(schedule_id, spec, expected_version?)
  • entry point: scheduler API
  • authoritative decider: schedule store owner
  • transition: overwrite ScheduleSpec
  1. internal due detection
  • scan/index schedules by next due time
  • guarded transition to due state and append run creation
  1. CreateRun(schedule_id, fire_time)
  • internal scheduler action
  • authoritative decider: run store owner
  • transition: append JobRun
  1. ClaimRun(run_id, worker_id, attempt_epoch)
  • entry point: worker
  • authoritative decider: run-state owner
  • transition: exclusive claim on RunState
  1. CompleteRun(run_id, attempt_epoch, result) / FailRun(...)
  • entry point: worker
  • authoritative decider: run-state owner
  • transition: guarded completion/failure
  1. internal retry/reschedule
  • transition run state to retryable / create next attempt as allowed by policy

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
schedule updatescheduler APIschedule store ownercontrol-plane nodejob scheduler
due detectionschedulerschedule/run-state ownerinternaljob scheduler
run creationschedulerrun store ownerinternaljob scheduler
claim runworkerrun-state ownercontrol-plane node / run-state ownerjob scheduler
complete/fail runworkerrun-state ownercontrol-plane nodejob scheduler
retry/rescheduleschedulerrun-state ownerinternaljob scheduler

Concrete HLD #

Main components:

  • scheduler API
  • schedule store + due-time index
  • scheduler fleet
  • run store / runnable-run queue
  • worker fleet
  • status/dashboard views

Short interview version #

“I’d design the distributed scheduler as a future-constraint plus claimable-run system. Schedule definitions are authoritative intent, due detection creates immutable run records keyed by (schedule_id, fire_time), and workers claim runs with fenced ownership before executing them. Completion and retry are guarded lifecycle transitions on run state, so duplicate claims or stale workers cannot corrupt execution truth. The main scaling levers are sharded due-time buckets, partitioned runnable queues, and careful handling of synchronized schedule spikes.”