Retry / Idempotency #
The block’s premise, and the catalog’s most annoying fact:
a timeout does not tell you whether the operation happened.
That is state_machine.md’s ignorance* verbatim — and this block is what you DO about it:
retry = repeat an operation after an uncertain outcome
idempotency = make the repetition safe
Role in the catalog: the second reunion block — the duplication protocol. lease_fencing.md gathered nine files’ ownership fragments; this file gathers the commit point* — stated in queue.md, cited by six files since — and gives it its recipe book. queue.md owns the star’s statement; this file owns its practice. The homecomings:
message redelivery → queue.md's claim + lease_fencing's delivery lease
sequence + epoch → lease_fencing's token discipline (Kafka producer
epoch is already cited there)
CAS / conditional write → lease_fencing's check-at-effect
retry budget + backoff → backpressure.md's egress natives
the outbox → name-dropped in three files; it lives here
Central tension:
recover from transient failure
vs
duplicate side effects and overload amplification
Design Axes (the core module) #
Axis 1 — Operation Identity (the native core) #
Retry safety BEGINS when the server can distinguish “same operation again” from “new operation” — which requires a logical operation ID minted by the caller:
not the request (a retry is a new request)
not the connection (a retry may arrive on a new one)
the INTENT — "charge order 4711 once" — named before the first attempt
The idempotency-key failure modes are all identity-scoping failures:
key too broad dedupes DISTINCT intents (two real orders, one charge)
key too narrow misses the retry (new key per attempt = no dedupe at all)
same key, different body identity unbound from content —
the request FINGERPRINT must travel with the key,
and a mismatch is an error, not a dedupe hit
key expires early the retry outlives the memory of the original
Interrogation:
What is the logical operation, in business terms? Who mints its ID, when?
Is the fingerprint checked, so a reused key with new content is rejected?
How long can a retry plausibly arrive? (the dedupe window ≥ that, always)
Axis 2 — The Dedupe Ladder (the seventh strength ladder) #
Rungs cost progressively more state and coordination; the design question is the CHEAPEST rung the effect tolerates:
idempotent by algebra: set-to-X, upsert, add-to-set — repetition is
free, NO state needed (the rung to reach for
first; CRDT logic from replication.md axis 3)
conditional / CAS: guard on expected version — ETag If-Match,
etcd Txn, resourceVersion. retry-safe IF the
client handles conflict (and ABA if the version
is not monotonic)
sequence + epoch: per-producer ordering — duplicates AND zombies
rejected in one mechanism (Kafka idempotent
producer: ID, epoch, sequence → lease_fencing's
token, applied per-record)
dedupe record: remember completed operations — Stripe keys,
processed-message table. now the marker's
PLACEMENT matters (see the star*)
transactional coupling: marker and effect in one atomic commit —
the top rung, and the star's whole subject
Interrogation:
Which rung, per effect — chosen, or defaulted?
Could the effect be REWRITTEN to a lower rung? (increment → set-to-total
moves you from dedupe-record to algebra, and deletes a table)
Sequence: what resets it, and does a reset re-arm the zombie?
Axis 3 — Retry Placement (and the multiplication hazard) #
Retries live at four layers, and the layers cannot see each other:
transport: HTTP/gRPC client, LB retry, TCP retransmit
application: job retry, workflow activity, payment capture
broker: redelivery after lease expiry (→ queue.md, lease_fencing)
transaction: abort-and-retry loop (FDB, Spanner, serializable SQL)
The native hazard is MULTIPLICATION:
3 transport × 3 application × broker redelivery
≈ a storm manufactured from one failure —
and each layer's retries are invisible to the others.
(backpressure.md's retry-budget discipline applies at EVERY layer,
and hidden retries multiply through the service chain.)
The transaction layer carries its own trap:
the abort-retry loop re-executes the BODY —
external effects inside it duplicate silently.
FDB's discipline: effects outside the loop, or guarded by a rung above.
"abort" means the DATABASE did nothing; it says nothing about the
email your body sent on attempt one.
Interrogation:
Enumerate the layers between intent and effect. Multiply their attempts.
Which errors are retryable AT EACH layer — and is overload (429/503)
correctly non-retryable-without-budget? (backpressure.md)
What sits inside the transaction retry body that isn't a database write?
Axis 4 — Effect Reversibility (what kind of thing are you protecting?) #
idempotent by nature: safe already (axis 2, rung 1)
dedupe-guarded: made safe by rungs 2–5
compensable: the saga's residence — and compensation deserves
its full weight:
COMPENSATION IS NOT UNDO. it is a new forward
action against a world that already SAW the
intermediate state (the refund is not an
un-charge; the customer got the email).
state_machine.md's compensation edges, composed.
irreversible: the sent email, the fired notification, the
launched missile — the only defense is dedupe
UPSTREAM of the effect, because nothing
downstream can help.
Interrogation:
Classify every external effect on this path — one of the four, explicitly.
For compensable: is the compensation itself idempotent and retryable?
(a saga whose compensations can duplicate has two problems now)
For irreversible: which rung stands upstream, and is it the top one?
Technical Bottleneck: The Atomicity Domain* #
What this block owns that queue.md’s commit-point statement doesn’t:
exactly-once is not a property of a system.
it is a property of a BOUNDARY.
The dedupe marker and the effect must commit ATOMICALLY — and atomic commitment exists only inside one atomicity domain:
same DB transaction: outbox (business row + event row, one commit);
processed-message table (marker + state change,
one commit)
same log transaction: Kafka exactly-once — which works precisely and
ONLY within Kafka's own commit protocol
(consume-transform-produce + offsets, one txn;
the moment the effect leaves Kafka, so does
the guarantee)
same conditional write: ETag PUT — the EXTERNAL system holds the marker,
fused with the effect by its own CAS
Every domain crossing re-derives the whole problem from scratch:
"exactly-once delivery" is not a thing;
"exactly-once effect WITHIN A NAMED DOMAIN" is.
where no domain spans the effect, you have left exactly-once territory
and entered compensation's (axis 4).
Marker-placement corollaries (the classic bugs, derived):
marker before effect, not atomic → crash between = effect lost forever
(the marker lies: "done" and it wasn't)
effect before marker, not atomic → crash between = duplicate on retry
(queue.md's crash-after-effect, at home)
marker in a different store → you built a distributed transaction
by accident, and it doesn't work
A strong design says explicitly:
what the logical operation identity is and who mints it (axis 1),
which dedupe rung guards each effect (axis 2),
how many retry layers stack and their multiplied worst case (axis 3),
the reversibility class of every external effect (axis 4),
and the NAMED atomicity domain where marker and effect commit together —
or the honest admission that none spans it, and compensation begins.
Retry/Idempotency As Protocol (the crossing-point spec — keep) #
Request path:
mint logical operation ID (+ fingerprint)
send with ID / sequence / expected version
server: already completed? → return PREVIOUS result (not an error)
new? → perform effect ATOMICALLY WITH dedupe marker (the star*)
client retries on timeout with the SAME ID, backoff + jitter + budget
expire dedupe state only after the plausible-retry window closes
Message processing ( queue.md’s claim, made safe):
receive with lease → check processed-table
idempotent effect or transaction (marker inside the domain)
record completion → ack
crash before ack → redelivery repeats SAFELY (that's the whole point)
Transaction loop (the FDB discipline):
read at version → prepare writes → commit with conflict check
conflict/unknown → retry the WHOLE body
external effects: outside the loop, or guarded by a higher rung
Named Configurations (lookup table) #
Vector = {identity, rung, placement, reversibility, domain}. Rows marked → are owned elsewhere.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Transport retry | none(!), —, transport, varies, — | gRPC/HTTP retry semantics | retrying non-idempotent POST; storm; retry-after-commit (ignorance*) |
| Application retry | operation ID, chosen rung, application, business effects, named | Temporal activity + activity ID | duplicate payment/email; permanent failure retried forever |
| Message redelivery → queue.md | message ID + receipt, rung 4-ish, broker, —, processed-table | SQS visibility / ack deadline | (owned: lease expiry duplication, poison loops) |
| Transaction retry | txn version, CAS-native, transaction, body-effect trap, the DB itself | FoundationDB retry loop | external effect inside the body; contention livelock |
| Idempotency key | caller-minted + fingerprint, rung 4, application, any, server-side store | Stripe idempotency keys | scope too broad/narrow; body mismatch; early expiry; “done” cached before done |
| Sequence + epoch → lease_fencing.md | producer ID, rung 3, transport-to-log, —, the log | Kafka idempotent producer | zombie resumes (fenced); sequence reset; per-partition scope misunderstood |
| CAS / conditional | expected version, rung 2, any, —, the target itself | etcd Txn; ETag If-Match | conflict ignored; ABA on non-monotonic version; blind overwrite |
| Outbox | event ID, rung 5, application→broker, —, the DB transaction | transactional outbox + CDC | publish duplicated (consumer must dedupe — the domain ends at the DB); stuck rows |
| Saga / compensation | step IDs, per-step rungs, application, compensable, none spans it | order/payment saga | compensation ≠ undo; compensation fails; intermediate state seen |
| Retry budget → backpressure.md | —, —, every layer, —, — | SRE budgets; backoff+jitter | (owned: storms, synchronization, hidden multiplication) |
| Dedupe store | operation ID, rung 4, consumer-side, —, must share the effect’s domain | processed-message table | marker/effect not atomic*; store down = fail how?; TTL shorter than retry window |
| Exactly-once discipline | txn ID + epoch, rung 5, log-scoped, —, Kafka’s own protocol, only | Kafka transactions | the guarantee stops at Kafka’s edge*; zombie producer (fenced); unknown commit (ignorance*) |
Vocabulary #
attempt timeout unknown outcome retryable permanent
operation ID intent fingerprint scope window
idempotent upsert algebra CAS expected version ABA
sequence epoch zombie fence
dedupe record processed table marker placement
outbox transactional coupling atomicity domain
saga compensation forward action irreversible
backoff jitter budget multiplication storm
exactly-once effect (never: exactly-once delivery)
Deep Lesson #
Retry bugs come from confusing pairs on different axes:
timeout vs failure (the premise: ignorance*, imported)
retry vs safe retry (axis 2: name the rung first)
idempotent request vs idempotent effect (the server deduped; the email sent anyway)
message ack vs business commit (queue.md's seam: two ladders, two coordinates)
dedupe marker vs actual completion (the star*: marker placement is everything)
exactly-once delivery vs exactly-once effect (the star*: a boundary property, not a system one)
compensation vs undo (axis 4: the world already saw it)
Design procedure: mint the identity with the intent, pick the cheapest sufficient rung per effect, count the stacked retry layers and budget them all, classify every effect’s reversibility — and then draw the atomicity domain on the whiteboard, put the marker inside it, and say out loud where the guarantee ends. The named types are recognition shortcuts, not the design space.