Queue #
queue = policy-governed waiting set
A queue is the statement:
work exists now,
execution happens later,
under some discipline.
It is where time, work, ownership, fairness, and failure meet.
Design Axes (the core module) #
A “queue type” is not a primitive. It is a point in a five-axis parameter space. Design by choosing a value on each axis; recognize systems by reading their vector.
Axis 1 — Retention Semantics (the structural cleave) #
remove-on-ack: item deleted globally once acknowledged
retained-log: items kept; each consumer advances its own offset
This is the only axis that changes the interface shape, not just policy:
remove-on-ack -> shared mutable ack/visibility state at the crossing point (thick)
retained-log -> progress tracking pushed entirely to consumer (thin)
Consequences:
remove-on-ack: one logical consumer group per item lifecycle; redelivery machinery
retained-log: replay, multiple independent consumers, retention window becomes a contract
Interrogation:
Is work removed globally or retained for replay?
When is progress committed, and by whom?
What deletes history, and is anyone still depending on it?
Axis 2 — Selection Policy #
FIFO: arrival order / approximate fairness
priority: importance, cost, deadline
fair/weighted: per-class or per-tenant share (WFQ, DRR)
coalescing: keyed latest-wins; dedupe obsolete work
Substitutable independently of retention. A Kafka partition is FIFO-selected retained-log; a controller workqueue is coalescing-selected remove-on-ack.
Interrogation:
Is order required or approximate?
Can priority override arrival order?
Can one source starve others?
Is repeated work on the same key wasteful? (-> coalesce)
Does aging exist to prevent starvation?
Axis 3 — Eligibility / Timing #
immediate: eligible on enqueue
delayed-until-T: timer/due-time (scheduled jobs, visibility delay, Temporal timers)
delayed-until-X: capacity or condition (waitlist, pending pods, accept backlog)
Interrogation:
Can items be delayed? By clock or by condition?
What happens to a due item after a crash? (missed timer)
Can a waiting request go stale before promotion?
Clock skew: whose clock decides "due"?
Axis 4 — Topology Position (failure routing) #
Retry queues and DLQs are not types; they are positions in a graph:
main path -> [fail] -> retry queue (delay + attempt metadata) -> main path
-> [attempts exhausted / poison] -> dead letter queue
A retry queue is a delay queue whose items carry {attempt, backoff, error}. A DLQ is any queue at the terminal-failure edge: quarantine + audit + redrive.
Interrogation:
How many attempts before terminal?
Backoff + jitter, or retry storm?
Is the failure recoverable at all? (poison classification)
Who watches the DLQ? (ignored DLQ = silent data loss)
Does redrive just repeat the failure?
PII/secrets in failure payloads?
Axis 5 — Capacity Policy #
unbounded: backlog absorbs everything; latency unbounded
bounded: overload becomes an explicit signal
-> backpressure (producer stalls)
-> rejection (drop / shed / timeout)
Interrogation:
Is the queue bounded?
How is overload expressed, and to whom?
Is backlog observable? (depth, age of oldest item)
Does a large buffer hide a capacity problem? (backlog != capacity)
The Ownership Protocol (cuts across all axes) #
Every queue with side-effecting consumers runs some subset of:
publish/enqueue
receive/pull
claim/lease <- ownership begins
process
ack/commit/delete <- ownership + item lifecycle end
nack/retry/extend
expire/redeliver <- lease death; ownership recycled
dead-letter
observe backlog
Interrogation:
Who owns an in-flight item?
Is ownership leased? What is the lease/visibility timeout?
What happens when the worker dies mid-lease?
Can two workers hold the same item? (yes, transiently — design for it)
Technical Bottleneck: the Commit Point* #
commit point = the moment progress is durably acknowledged,
relative to the side effect
Essential, and no general solution exists — only per-case recipes. Nearly every classic queue bug is downstream of it:
worker crashes after side effect, before ack -> duplicate execution
offset committed before processing -> lost work
offset committed after processing -> duplicate work
redelivery storm -> commit too slow / lease too short
"exactly once" -> commit point + dedupe boundary, nothing more
Known recipes (each bounded, none universal):
idempotent consumer (dedupe key at the effect)
transactional outbox (effect + commit in one transaction)
exactly-once-within-scope (Kafka txn: only inside the log's own boundary)
at-least-once + idempotency as the honest default
A strong design states explicitly:
when work becomes visible,
who owns it,
when it is complete,
what happens if completion is unknown,
and how overload is bounded.
Named Configurations (lookup table) #
Famous points in the parameter space. Vector = {retention, selection, eligibility, position, capacity}.
| Name | Vector | Canonical systems | Signature failure |
|---|---|---|---|
| FIFO queue | remove-on-ack, FIFO, immediate, main, varies | SQS FIFO, RabbitMQ queue | head-of-line blocking; order breaks under parallelism |
| Work queue | remove-on-ack, FIFO-ish, immediate, main, varies | Celery, SQS tasks, k8s workqueue, thread pool | crash-after-effect-before-ack; stuck in-flight |
| Broker/message queue | remove-on-ack, routing+FIFO, immediate, main, varies | RabbitMQ/AMQP, NATS queue group, Pub/Sub sub | loss via ack/persistence misconfig; slow subscriber backlog |
| Durable log | retained-log, FIFO per partition, immediate, main, retention-bounded | Kafka partition, Pulsar, Kinesis shard | consumer lag; bad offset commit; rebalance pause; retention deletes needed history |
| Delay/timer queue | either, timestamp-priority, delayed-until-T, main, varies | scheduled job table, SQS delay, Redis zset, Temporal timers | clock skew; timer storm; missed timer after crash |
| Priority queue | remove-on-ack, priority, immediate, main, varies | scheduler queues, k8s scheduling queue | starvation; priority inversion; priority abuse |
| Retry queue | remove-on-ack, FIFO, delayed-until-T, retry edge, varies | backoff topics, Temporal retry, DLQ redrive | retry storm; retrying unrecoverable work; duplicate effects |
| Dead letter queue | remove-on-ack, FIFO, immediate, terminal edge, unbounded-ish | SQS DLQ, Pub/Sub dead-letter topic, RabbitMQ DLX | ignored; second unbounded backlog; secrets in payloads |
| Bounded queue | either, FIFO, immediate, main, bounded | thread pool queue, socket buffer, bounded channel | too large -> latency; too small -> rejection; producer stall |
| Fair/multi-queue | remove-on-ack, weighted/DRR, immediate, main, per-class bounds | per-tenant queues, packet schedulers | unfair weights; starvation; utilization loss under idle tenants |
| Coalescing queue | remove-on-ack, keyed latest-wins, immediate, main, small | k8s controller workqueue, UI event loop, reindex queue | lost intermediate event that mattered; wrong dedupe key; hot-key starvation |
| Waitlist | remove-on-ack, priority/FIFO, delayed-until-X, main, bounded by policy | pending pods, GPU job queue, accept backlog | stale promotion; overbooking; infinite wait |
Vocabulary #
enqueue dequeue peek
claim lease visibility deadline
ack nack redelivery attempt backoff jitter
offset checkpoint replay retention
priority aging preemption fairness deficit-round-robin
head-of-line backpressure capacity load-shedding
dedupe coalescing latest-wins
poison DLQ quarantine redrive
Deep Lesson #
Queue bugs come from confusing pairs that sit on different axes:
delivery vs processing (ownership protocol vs effect)
ack vs successful effect (commit point)
FIFO vs global ordering (selection vs retention/partitioning)
queue vs log (axis 1)
retry vs idempotency (topology vs commit point)
backlog vs capacity (axis 5)
visibility timeout vs correctness (lease is liveness, not safety)
Design procedure: walk the five axes, state the commit point, bound the overload. The named types are recognition shortcuts, not the design space.