Skip to main content
  1. concepts/

Retry / Idempotency #

The block’s premise, and the catalog’s most annoying fact:

a timeout does not tell you whether the operation happened.

That is state_machine.md’s ignorance* verbatim — and this block is what you DO about it:

retry       = repeat an operation after an uncertain outcome
idempotency = make the repetition safe

Role in the catalog: the second reunion block — the duplication protocol. lease_fencing.md gathered nine files’ ownership fragments; this file gathers the commit point* — stated in queue.md, cited by six files since — and gives it its recipe book. queue.md owns the star’s statement; this file owns its practice. The homecomings:

message redelivery      → queue.md's claim + lease_fencing's delivery lease
sequence + epoch        → lease_fencing's token discipline (Kafka producer
                          epoch is already cited there)
CAS / conditional write → lease_fencing's check-at-effect
retry budget + backoff  → backpressure.md's egress natives
the outbox              → name-dropped in three files; it lives here

Central tension:

recover from transient failure
        vs
duplicate side effects and overload amplification

Design Axes (the core module) #

Axis 1 — Operation Identity (the native core) #

Retry safety BEGINS when the server can distinguish “same operation again” from “new operation” — which requires a logical operation ID minted by the caller:

not the request (a retry is a new request)
not the connection (a retry may arrive on a new one)
the INTENT — "charge order 4711 once" — named before the first attempt

The idempotency-key failure modes are all identity-scoping failures:

key too broad     dedupes DISTINCT intents (two real orders, one charge)
key too narrow    misses the retry (new key per attempt = no dedupe at all)
same key, different body   identity unbound from content —
                  the request FINGERPRINT must travel with the key,
                  and a mismatch is an error, not a dedupe hit
key expires early the retry outlives the memory of the original

Interrogation:

What is the logical operation, in business terms? Who mints its ID, when?
Is the fingerprint checked, so a reused key with new content is rejected?
How long can a retry plausibly arrive? (the dedupe window ≥ that, always)

Axis 2 — The Dedupe Ladder (the seventh strength ladder) #

Rungs cost progressively more state and coordination; the design question is the CHEAPEST rung the effect tolerates:

idempotent by algebra:   set-to-X, upsert, add-to-set — repetition is
                         free, NO state needed (the rung to reach for
                         first; CRDT logic from replication.md axis 3)
conditional / CAS:       guard on expected version — ETag If-Match,
                         etcd Txn, resourceVersion. retry-safe IF the
                         client handles conflict (and ABA if the version
                         is not monotonic)
sequence + epoch:        per-producer ordering — duplicates AND zombies
                         rejected in one mechanism (Kafka idempotent
                         producer: ID, epoch, sequence → lease_fencing's
                         token, applied per-record)
dedupe record:           remember completed operations — Stripe keys,
                         processed-message table. now the marker's
                         PLACEMENT matters (see the star*)
transactional coupling:  marker and effect in one atomic commit —
                         the top rung, and the star's whole subject

Interrogation:

Which rung, per effect — chosen, or defaulted?
Could the effect be REWRITTEN to a lower rung? (increment → set-to-total
  moves you from dedupe-record to algebra, and deletes a table)
Sequence: what resets it, and does a reset re-arm the zombie?

Axis 3 — Retry Placement (and the multiplication hazard) #

Retries live at four layers, and the layers cannot see each other:

transport:    HTTP/gRPC client, LB retry, TCP retransmit
application:  job retry, workflow activity, payment capture
broker:       redelivery after lease expiry (→ queue.md, lease_fencing)
transaction:  abort-and-retry loop (FDB, Spanner, serializable SQL)

The native hazard is MULTIPLICATION:

3 transport × 3 application × broker redelivery
≈ a storm manufactured from one failure —
and each layer's retries are invisible to the others.
(backpressure.md's retry-budget discipline applies at EVERY layer,
and hidden retries multiply through the service chain.)

The transaction layer carries its own trap:

the abort-retry loop re-executes the BODY —
external effects inside it duplicate silently.
FDB's discipline: effects outside the loop, or guarded by a rung above.
"abort" means the DATABASE did nothing; it says nothing about the
email your body sent on attempt one.

Interrogation:

Enumerate the layers between intent and effect. Multiply their attempts.
Which errors are retryable AT EACH layer — and is overload (429/503)
  correctly non-retryable-without-budget? (backpressure.md)
What sits inside the transaction retry body that isn't a database write?

Axis 4 — Effect Reversibility (what kind of thing are you protecting?) #

idempotent by nature:   safe already (axis 2, rung 1)
dedupe-guarded:         made safe by rungs 2–5
compensable:            the saga's residence — and compensation deserves
                        its full weight:
                        COMPENSATION IS NOT UNDO. it is a new forward
                        action against a world that already SAW the
                        intermediate state (the refund is not an
                        un-charge; the customer got the email).
                        state_machine.md's compensation edges, composed.
irreversible:           the sent email, the fired notification, the
                        launched missile — the only defense is dedupe
                        UPSTREAM of the effect, because nothing
                        downstream can help.

Interrogation:

Classify every external effect on this path — one of the four, explicitly.
For compensable: is the compensation itself idempotent and retryable?
  (a saga whose compensations can duplicate has two problems now)
For irreversible: which rung stands upstream, and is it the top one?

Technical Bottleneck: The Atomicity Domain* #

What this block owns that queue.md’s commit-point statement doesn’t:

exactly-once is not a property of a system.
it is a property of a BOUNDARY.

The dedupe marker and the effect must commit ATOMICALLY — and atomic commitment exists only inside one atomicity domain:

same DB transaction:      outbox (business row + event row, one commit);
                          processed-message table (marker + state change,
                          one commit)
same log transaction:     Kafka exactly-once — which works precisely and
                          ONLY within Kafka's own commit protocol
                          (consume-transform-produce + offsets, one txn;
                          the moment the effect leaves Kafka, so does
                          the guarantee)
same conditional write:   ETag PUT — the EXTERNAL system holds the marker,
                          fused with the effect by its own CAS

Every domain crossing re-derives the whole problem from scratch:

"exactly-once delivery" is not a thing;
"exactly-once effect WITHIN A NAMED DOMAIN" is.
where no domain spans the effect, you have left exactly-once territory
and entered compensation's (axis 4).

Marker-placement corollaries (the classic bugs, derived):

marker before effect, not atomic → crash between = effect lost forever
                                   (the marker lies: "done" and it wasn't)
effect before marker, not atomic → crash between = duplicate on retry
                                   (queue.md's crash-after-effect, at home)
marker in a different store      → you built a distributed transaction
                                   by accident, and it doesn't work

A strong design says explicitly:

what the logical operation identity is and who mints it (axis 1),
which dedupe rung guards each effect (axis 2),
how many retry layers stack and their multiplied worst case (axis 3),
the reversibility class of every external effect (axis 4),
and the NAMED atomicity domain where marker and effect commit together —
or the honest admission that none spans it, and compensation begins.

Retry/Idempotency As Protocol (the crossing-point spec — keep) #

Request path:

mint logical operation ID (+ fingerprint)
send with ID / sequence / expected version
server: already completed? → return PREVIOUS result (not an error)
new? → perform effect ATOMICALLY WITH dedupe marker (the star*)
client retries on timeout with the SAME ID, backoff + jitter + budget
expire dedupe state only after the plausible-retry window closes

Message processing ( queue.md’s claim, made safe):

receive with lease → check processed-table
idempotent effect or transaction (marker inside the domain)
record completion → ack
crash before ack → redelivery repeats SAFELY (that's the whole point)

Transaction loop (the FDB discipline):

read at version → prepare writes → commit with conflict check
conflict/unknown → retry the WHOLE body
external effects: outside the loop, or guarded by a higher rung

Named Configurations (lookup table) #

Vector = {identity, rung, placement, reversibility, domain}. Rows marked → are owned elsewhere.

NameVectorCanonical study objectSignature failure
Transport retrynone(!), —, transport, varies, —gRPC/HTTP retry semanticsretrying non-idempotent POST; storm; retry-after-commit (ignorance*)
Application retryoperation ID, chosen rung, application, business effects, namedTemporal activity + activity IDduplicate payment/email; permanent failure retried forever
Message redelivery → queue.mdmessage ID + receipt, rung 4-ish, broker, —, processed-tableSQS visibility / ack deadline(owned: lease expiry duplication, poison loops)
Transaction retrytxn version, CAS-native, transaction, body-effect trap, the DB itselfFoundationDB retry loopexternal effect inside the body; contention livelock
Idempotency keycaller-minted + fingerprint, rung 4, application, any, server-side storeStripe idempotency keysscope too broad/narrow; body mismatch; early expiry; “done” cached before done
Sequence + epoch → lease_fencing.mdproducer ID, rung 3, transport-to-log, —, the logKafka idempotent producerzombie resumes (fenced); sequence reset; per-partition scope misunderstood
CAS / conditionalexpected version, rung 2, any, —, the target itselfetcd Txn; ETag If-Matchconflict ignored; ABA on non-monotonic version; blind overwrite
Outboxevent ID, rung 5, application→broker, —, the DB transactiontransactional outbox + CDCpublish duplicated (consumer must dedupe — the domain ends at the DB); stuck rows
Saga / compensationstep IDs, per-step rungs, application, compensable, none spans itorder/payment sagacompensation ≠ undo; compensation fails; intermediate state seen
Retry budget → backpressure.md—, —, every layer, —, —SRE budgets; backoff+jitter(owned: storms, synchronization, hidden multiplication)
Dedupe storeoperation ID, rung 4, consumer-side, —, must share the effect’s domainprocessed-message tablemarker/effect not atomic*; store down = fail how?; TTL shorter than retry window
Exactly-once disciplinetxn ID + epoch, rung 5, log-scoped, —, Kafka’s own protocol, onlyKafka transactionsthe guarantee stops at Kafka’s edge*; zombie producer (fenced); unknown commit (ignorance*)

Vocabulary #

attempt  timeout  unknown outcome  retryable  permanent
operation ID  intent  fingerprint  scope  window
idempotent  upsert  algebra  CAS  expected version  ABA
sequence  epoch  zombie  fence
dedupe record  processed table  marker  placement
outbox  transactional coupling  atomicity domain
saga  compensation  forward action  irreversible
backoff  jitter  budget  multiplication  storm
exactly-once effect  (never: exactly-once delivery)

Deep Lesson #

Retry bugs come from confusing pairs on different axes:

timeout             vs  failure               (the premise: ignorance*, imported)
retry               vs  safe retry            (axis 2: name the rung first)
idempotent request  vs  idempotent effect     (the server deduped; the email sent anyway)
message ack         vs  business commit       (queue.md's seam: two ladders, two coordinates)
dedupe marker       vs  actual completion     (the star*: marker placement is everything)
exactly-once delivery vs exactly-once effect  (the star*: a boundary property, not a system one)
compensation        vs  undo                  (axis 4: the world already saw it)

Design procedure: mint the identity with the intent, pick the cheapest sufficient rung per effect, count the stacked retry layers and budget them all, classify every effect’s reversibility — and then draw the atomicity domain on the whiteboard, put the marker inside it, and say out loud where the guarantee ends. The named types are recognition shortcuts, not the design space.