Skip to main content
  1. System Design Components/

Distributed Task Queue (Celery / Sidekiq-class)

Distributed Task Queue (Celery / Sidekiq-class) #

This note models a Celery/Sidekiq-class background job system where producers enqueue tasks, workers claim and execute them, and the system supports retries, scheduled retries, and worker crash recovery.


Step 1 - Normalize #

Assume the baseline prompt is:

  • design a distributed task queue
  • producers enqueue jobs
  • workers pull and execute jobs
  • jobs are acked on success
  • failures may retry or move to dead-letter storage
  • system scales across many nodes

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Producer enqueues taskClientappend eventS1
create target
TaskEnvelope
C1
Worker reserves/claims next taskClientstate transitionS1
update target
TaskExecutionState
C1
Worker acknowledges successful taskClientstate transitionS1
update target
TaskExecutionState
C1
Worker marks task failed/retryableClientstate transitionS1
update target
TaskExecutionState
C1
System requeues timed-out or retryable taskSystemasync processS1
hidden write target
TaskExecutionState
C1
System promotes delayed/retry-scheduled task when dueSystemasync processS1
hidden write target
ScheduledTaskState
C1
System routes queue/shard to current ownerSystemread sourceS1
read source target
PartitionMap
C1
System reassigns shard ownership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1
Client reads queue depth / worker statsClientread projectionS1
read projection target
QueueStatsView
R2

Notes on normalization #

Important choices:

  • enqueue is append event
    • task submission is an immutable fact
  • reserve/claim is state transition
    • task moves from ready to claimed/in-progress
  • ack/fail are state transition
    • task lifecycle changes
  • retries and delayed tasks are explicit internal paths
  • routing/ownership are explicit because this is distributed infra

This system is closer to:

  • claimable execution queue

than to:

  • replayable event log

Step 2 - Critical Path Selection #

RequirementPriority classWhy
Enqueue taskC1task intent must not be lost or duplicated incorrectly
Reserve/claim next taskC1task execution correctness depends on valid exclusive claim
Ack successful taskC1completion semantics depend on current claim holder
Mark failed/retryable taskC1retry and terminal failure semantics depend on valid lifecycle transition
Requeue timed-out or retryable taskC1worker crash recovery and retries depend on this path
Promote delayed/retry-scheduled taskC1scheduled work must become runnable at the right time
Route queue/shard to ownerC1wrong routing can split claims and task order
Reassign shard ownershipC1failover must preserve task lifecycle correctness
Queue depth / worker statsR2operational only

Baseline critical paths #

Main C1 paths:

  • P1 enqueue task
  • P2 reserve/claim task
  • P3 ack task success
  • P4 fail/retry task
  • P5 requeue timed-out or retryable task
  • P6 promote delayed task
  • P7 route to shard owner
  • P8 reassign shard ownership

This design is driven by:

  • durable task intent
  • one active worker claim per task attempt
  • guarded lifecycle transitions
  • retry and timeout recovery

Step 3 - Primary State Extraction #

For a distributed task queue, the minimal primary state is the task itself, its execution lifecycle, delayed scheduling state, and routing/ownership state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
TaskEnvelopedirect nounYeskeep as candidateeventYesserviceappend-onlyinstancetask_id
TaskExecutionStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstancetask_id
ScheduledTaskStatehidden write targetYeskeep as candidateprocessYesservicestate machineinstancetask_id
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstanceshard_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectionqueue shards
QueueStatsViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectionqueue_id
ClaimAttempthidden write targetNoreject as implementation choiceeventNoderivedappend-onlycollectiontask_id

Important modeling choices #

TaskEnvelope #

Primary because:

  • enqueue is immutable fact recording
  • payload, queue, headers, retries, and metadata live here

TaskExecutionState #

This is the central task-queue object. It captures lifecycle like:

  • READY
  • CLAIMED(worker_id, claim_token, expiry)
  • SUCCEEDED
  • FAILED
  • RETRY_PENDING
  • DEAD_LETTERED

ScheduledTaskState #

Kept explicit because delayed jobs and scheduled retries are common in Celery/Sidekiq-class systems. It captures:

  • run_at
  • delayed / promoted lifecycle

PartitionOwnership #

Needed because:

  • one owner should control claim transitions for a shard at a time

PartitionMap #

Needed because:

  • producers and workers must route consistently to the right shard authority

Minimal strict primary set #

The strongest minimal set is:

  • TaskEnvelope
  • TaskExecutionState
  • ScheduledTaskState
  • PartitionOwnership
  • PartitionMap

Step 4 - Hard Invariants #

For a Celery/Sidekiq-class task queue, the hard invariants are about durable task existence, exclusive active claim per task attempt, guarded completion/failure transitions, and safe retry promotion.

PathTierTypeInvariant statement
P1 enqueue taskHARDuniquenessKey enqueue request_id maps to at most one logical outcome enqueued task within queue scope.
P2 reserve/claim taskHARDeligibilityAction claim_task is valid only if TaskExecutionState(task_id) is READY at decision time.
P2 reserve/claim taskHARDuniquenessKey task_id maps to at most one logical outcome active claim holder within claim-timeout scope.
P3 ack task successHARDeligibilityAction ack_task is valid only if TaskExecutionState(task_id) is CLAIMED and claim_token matches current holder at decision time.
P4 fail/retry taskHARDeligibilityAction fail_task is valid only if TaskExecutionState(task_id) is CLAIMED and claim_token matches current holder at decision time.
P5 requeue timed-out or retryable taskHARDeligibilityAction requeue_task is valid only if TaskExecutionState(task_id) is in a retryable or expired claimed state and current claim token/epoch still matches at decision time.
P6 promote delayed taskHARDeligibilityAction promote_delayed_task is valid only if ScheduledTaskState(task_id) is pending and run_at <= now at decision time.
P7 route to shard ownerHARDuniquenessKey shard_id maps to at most one logical outcome current authoritative owner within shard_id.
P8 reassign shard ownershipHARDeligibilityAction reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time.

What matters most #

1. One active claim per task at a time #

This is the central execution invariant.

2. Ack/fail must be fenced by current claim token #

A stale worker must not be able to ack or fail a task after timeout and reassignment.

3. Retry promotion must be guarded #

Only due delayed tasks or legitimately retryable tasks may become runnable again.

4. At-least-once execution is normal #

This system usually allows duplicate execution on worker crash or lost ack. Exactly-once is not the default invariant.


Step 5 - Execution Context #

For the strict baseline task queue:

FieldValueWhy
Topologysingle service distributedone logical task-queue service spread across many nodes
Write coordination scopeper object scopecorrectness is per task_id and per shard ownership scope
Read consistency targetstrong onlyclaim/ack/fail paths must use authoritative task state
Holder modelclientworkers temporarily hold claimed tasks
Compensation acceptable?Nowrong claim/ack/fail decisions cannot be repaired safely afterward

Derived implications #

  • holder_may_crash = true

    • workers can crash while holding claimed tasks
  • cross_service_write = false

    • baseline keeps task state, routing, and ownership within one logical service
  • bounded_staleness_allowed = false

    • claim/ack/fail paths must use authoritative execution state
  • cross_service_atomicity_required = false

    • no multi-service transaction in baseline
  • exclusive_claim_required = true

    • one worker should hold a task attempt at a time
  • guarded_by_current_state = true

    • claim, ack, fail, retry, and promotion all depend on current lifecycle state

What this implies #

This pushes us toward:

  • one authoritative owner per shard
  • explicit task execution state machine
  • claim timeout modeled as lease-like worker ownership
  • strong reads on the task-state hot path

Step 6 - Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 enqueue taskappend-only eventappend logrequest id dedup
P2 reserve/claim taskexclusive claimleaseclaim token, claim timeout
P3 ack task successguarded state transitionCAS on (state, version) or leader-applied guarded transitionclaim token
P4 fail/retry taskguarded state transitionCAS on (state, version) or leader-applied guarded transitionclaim token, retry policy
P5 requeue timed-out or retryable taskguarded state transitionleader-applied guarded transitionclaim token, timeout scan
P6 promote delayed taskguarded state transitionleader-applied guarded transitiondue-time index
P7 route to shard ownerexclusive claimleasefencing token, heartbeat
P8 reassign shard ownershipguarded state transitionCAS on (state, version)fencing token, shard catch-up check

Why these fit #

Enqueue #

Task submission is immutable fact recording, so append-only fits naturally.

Claim #

Task execution ownership is temporary and exclusive, which is effectively a lease.

Ack / fail #

These are not blind updates. They must verify the current claim token and state, so they are guarded transitions.

Retry / delayed promotion #

These depend on current lifecycle state plus time predicates, so guarded transitions fit.

Canonical substrate implied #

The baseline now points to:

  • sharded task queue service
  • one owner per shard
  • append-only task storage
  • per-task execution lifecycle state
  • lease-like task claim with token/timeout
  • delayed-task time index for scheduled retries

Step 7 - Read Model / Source of Truth #

For a Celery/Sidekiq-class task queue, truth is mostly direct source state. Operational queue metrics are derived.

ConceptTruthRead pathRebuild path
C1 task payload and metadataTaskEnveloperead source directlyauthoritative task store
C2 current execution lifecycleTaskExecutionStateread source directlyauthoritative execution-state store or replay from committed log
C3 delayed scheduling lifecycleScheduledTaskStateread source directlyauthoritative delayed-task store
C4 shard ownershipPartitionOwnershipread source directlyauthoritative ownership store
C5 shard routing mapPartitionMapread source directlyauthoritative routing metadata
C6 queue depth / worker statsderived from task and state datamaterialized viewrecompute from authoritative state

Important point #

For the core semantics:

  • enqueue writes authoritative task state
  • claim/ack/fail/retry operate on authoritative TaskExecutionState
  • delayed promotions operate on authoritative due-time state
  • queue depth and worker dashboards are projections

Step 8 - Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 enqueue taskretry with request_id to avoid duplicate enqueuecompeting enqueues coexist; dedup matters only for same logical requestcommitted enqueue survives owner crash if replicated past commit pointproducer may retry safely with idempotency keyn/a
P2 reserve/claim taskretry claim safely; may return another task or later same taskonly one active claim should win for a task at a timeif claim committed and worker crashes, task stays claimed until timeout or failovern/astale worker fenced by claim token/epoch
P3 ack task successretry ack with same claim tokenstale/wrong claim token loses guarded transitioncommitted success survives crash if replicated past commit pointn/astale holder cannot ack after timeout and re-claim
P4 fail/retry taskretry fail with same claim tokenstale token loses guarded transitioncommitted fail/retry survives crash if replicated past commit pointn/astale holder cannot fail after timeout and reassignment
P5 requeue timed-out or retryable tasktimeout/retry scan retry safeonly one requeue transition should win for current expired/retryable statescanner crash delays recovery; next scan retriesn/aold claim token becomes invalid after requeue
P6 promote delayed taskdue-scan retry safeonly one promotion transition should win for current delayed statescanner crash delays runnable promotion; next scan retriesn/an/a
P7 route to shard ownerretry after refreshing shard maponly one valid owner should existif owner changed, refreshed map points to new ownern/astale owner rejected by fencing token
P8 reassign shard ownershipretry failover transition safelyonly one reassignment wins current ownership statepromoted owner crash triggers later reassignmentn/aold owner fenced and must not continue serving

What matters most #

1. Claim-token fencing #

This is the key task-queue safety mechanism.

Bad case:

  • worker A claims task
  • claim times out
  • worker B reclaims task
  • worker A later acks success

Without fencing:

  • A could incorrectly finish B’s active attempt

So ack/fail must be tied to the current claim token.

2. At-least-once execution #

If a worker crashes after side effects but before ack:

  • task may run again

So application handlers should be idempotent or dedup-aware.

3. Enqueue retry dedup #

If producer retries after response loss:

  • dedup by request_id if API promises producer idempotency

Step 9 - Scale Adjustments #

HotspotTypeFirst response
hot queue / shard with many claimscontention hotspotincrease shard count and isolate hot queues
claim/ack pressure on same shardwrite throughput hotspotbatch claim fetches and spread queue families across shards
delayed-task promotion scanswrite throughput hotspotbucket due times by time wheel / min-heap and scan incrementally
retry storms after dependency outagecontention hotspotexponential backoff, jitter, retry caps, and dead-letter cutoffs
ownership churn during failurescontention hotspotstabilize leases/elections and avoid aggressive reassignment
stats/dashboard readsread hotspotkeep them as derived views only

What scales well #

This queue scales by:

  • sharding task queues
  • giving each shard one authoritative owner
  • keeping claim/ack/fail local to that owner
  • using efficient due-time indexes for delayed jobs and retries

What fails first #

Usually:

  • hot queue concentration on one shard
  • inefficient delayed-job scans
  • stale-claim fencing mistakes
  • retry storms overwhelming ready queues

Canonical design conclusion #

The mechanical outcome is:

  • primary state:
    • TaskEnvelope
    • TaskExecutionState
    • ScheduledTaskState
    • PartitionOwnership
    • PartitionMap
  • critical invariants:
    • durable task enqueue
    • one active worker claim per task attempt
    • ack/fail valid only for current claim token
    • timed-out or retryable tasks become runnable again safely
    • delayed tasks promote when due
  • mechanisms:
    • append log
    • lease
    • guarded ack/fail/requeue transitions
    • fenced shard ownership
  • reads:
    • direct authoritative reads for task and execution state
    • projections only for queue depth/stats

Polished interview answer #

“I’d design the task queue as a sharded service with one leader per shard. Enqueue appends an immutable task record. Workers reserve tasks through an exclusive claim that issues a claim token and timeout, and ack/fail are guarded transitions that succeed only for the current claim token. If a worker crashes or times out, the task is safely requeued for at-least-once execution, and delayed tasks or scheduled retries are promoted through a due-time index. The main scaling levers are more shards, hot-queue isolation, efficient delayed-task scanning, and retry backoff control.”


Concrete Substrate #

I’ll choose a service-owned sharded task queue with shard leaders as the concrete baseline, because it matches the mechanics we derived:

  • append-only enqueue
  • lease-like task claim
  • guarded ack/fail/requeue
  • delayed-task due-time promotion
  • one owner per shard

Concrete tech family:

  • service in Go or Java
  • local durable storage:
    • RocksDB or log-segment files
  • shard replication:
    • Raft or leader-follower replication with commit index
  • metadata/control:
    • etcd or internal metadata quorum

Each shard leader stores:

  • task records
  • current TaskExecutionState
  • ready queue / runnable index
  • in-flight expiry index
  • delayed-task due-time index

Operation Layer #

1. Enqueue task #

API

  • EnqueueTask(queue_id, payload, request_id?, run_at?)

Initiator

  • producer/client

Entry point

  • gateway or any task-queue node

Authoritative decider

  • current leader for target shard

Precondition

  • valid shard routing
  • optional dedup request id if producer idempotency is required

Transition

  • append TaskEnvelope
  • initialize TaskExecutionState = READY if runnable now
  • or create ScheduledTaskState(run_at) if delayed

Response

  • {task_id}

Failure cases

  • stale routing -> retry with updated shard map
  • response loss -> retry may duplicate unless request_id dedup is used

2. Reserve/claim task #

API

  • ReserveTask(queue_id, max_tasks, lease_timeout)

Initiator

  • worker/client

Entry point

  • gateway or any task-queue node

Authoritative decider

  • current shard leader(s) serving that queue

Precondition

  • task chosen must currently be READY

Transition

  • selected task:
    • READY -> CLAIMED(worker_id, claim_token, expiry)
  • remove from ready index
  • insert into claim-expiry index

Response

  • {tasks: [{task_id, payload, claim_token, lease_expiry}]}

Failure cases

  • no ready task -> empty response / long poll timeout
  • stale owner -> retry after redirect

3. Ack task success #

API

  • AckTask(queue_id, task_id, claim_token)

Initiator

  • worker/client

Entry point

  • gateway or any task-queue node

Authoritative decider

  • shard leader owning the task

Precondition

  • current TaskExecutionState is CLAIMED
  • claim_token matches current holder

Transition

  • CLAIMED -> SUCCEEDED

Response

  • success / no-op failure

Failure cases

  • stale claim token -> reject
  • task already timed out and reassigned -> reject stale ack

4. Fail or retry task #

API

  • FailTask(queue_id, task_id, claim_token, error, retry_at?)

Initiator

  • worker/client

Entry point

  • gateway or any task-queue node

Authoritative decider

  • shard leader owning the task

Precondition

  • current TaskExecutionState is CLAIMED
  • claim_token matches current holder

Transition

  • if retryable:
    • CLAIMED -> RETRY_PENDING
    • update ScheduledTaskState(retry_at)
  • else:
    • CLAIMED -> FAILED or DEAD_LETTERED

Response

  • success

5. Requeue timed-out or retryable task #

API

  • internal background process

Initiator

  • system

Entry point

  • shard leader

Authoritative decider

  • shard leader

Precondition

  • task is still CLAIMED and expired
  • or task is RETRY_PENDING and due for requeue

Transition

  • move task back to READY
  • remove from expired/retry tracking
  • insert into ready queue

6. Promote delayed task #

API

  • internal background process

Initiator

  • system

Entry point

  • shard leader

Authoritative decider

  • shard leader

Precondition

  • delayed task is pending and run_at <= now

Transition

  • ScheduledTaskState becomes promoted
  • TaskExecutionState -> READY
  • insert into ready queue

7. Reassign shard ownership #

API

  • internal failover flow:
    • AcquireShardLease(shard_id)
    • or Raft leader election

Initiator

  • system

Entry point

  • follower/candidate node or coordination layer

Authoritative decider

  • shard quorum or metadata lease store

Precondition

  • current owner failed or relinquished
  • candidate replica sufficiently current

Transition

  • new owner/leader epoch established
  • old owner fenced

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
EnqueueTaskgateway / any nodeshard leaderleader or front nodetask queue service
ReserveTaskgateway / any nodeshard leaderleader or front nodetask queue service
AckTaskgateway / any nodeshard leaderleader or front nodetask queue service
FailTaskgateway / any nodeshard leaderleader or front nodetask queue service
delayed/retry promotionshard leadershard leaderinternaltask queue service
shard failoverfollower / coordination layershard quorum / lease storenew leader / control planetask queue service

Concrete HLD #

Main components:

  • gateway/router
    • resolves queue/shard
    • forwards client and worker requests
  • shard leader
    • authoritative owner of task lifecycle for shard
    • manages ready, claimed, and delayed indexes
  • shard followers
    • replicate shard mutations
  • metadata/control service
    • tracks shard ownership and routing
  • background timeout / delayed promotion worker
    • usually runs on shard leader

Short Interview Version #

I’d build the task queue as a sharded service with one leader per shard. Enqueue appends an immutable task record. Workers reserve tasks through a lease-like claim that issues a claim token and expiry, and ack or fail are guarded transitions that succeed only for the current claim token. If a worker crashes or times out, the task is safely requeued for at-least-once execution, and delayed jobs or scheduled retries are promoted from a due-time index when they become runnable. Replication happens per shard, and failover promotes only a sufficiently current replica.