Resilience Mechanisms to Archetypes Mapping #
This note maps cross-cutting resilience mechanisms to the system design archetypes already used in the cheat sheets.
It complements:
Those notes are archetype-first. This note is mechanism-first.
The source inspiration here is resilient.txt, especially its themes around:
- timeouts
- retries
- idempotency
- thundering herds
- rate limiting
- resilience observability
How To Use This Note #
For any path:
- identify the archetype
- identify the failure shape
- choose the resilience mechanism that fits
- decide whether the path should:
- fail closed
- fail open
- degrade gracefully
- retry later
- buffer and replay
1. Timeouts #
What it solves #
- hung calls
- long-tail latency
- stuck downstream dependencies
- thread/connection pool exhaustion
Archetypes that need it most #
Critical Transaction ProcessWorkflow / Lifecycle StateMatching / AssignmentFuture Constraint + Claimable RunFrontier + Claimable RunShared Mutable Subject
Why #
These archetypes often involve:
- external calls
- human-in-the-loop waits
- claim/lease holding
- multi-step workflows
- synchronous dependency chains
Interview default #
- use timeouts on all remote calls
- make timeouts shorter than lease/hold durations
- never let a hung dependency silently consume all capacity
Key caution #
Timeout is not the same as failure.
Especially for:
- payments
- booking
- external side effects
A timeout often means:
- outcome unknown
- move to pending/reconciliation state
2. Retries #
What it solves #
- transient failures
- packet loss
- temporary dependency unavailability
- leader failover gaps
Archetypes that need it most #
Append-Only Child ObjectWorkflow / Lifecycle StateCritical Transaction ProcessFrontier + Claimable RunFuture Constraint + Claimable RunDerived Projection
Why #
These systems frequently have:
- asynchronous processing
- retryable side effects
- queue-based recovery
- replayable writes
Interview default #
- retry only transient failures
- cap retries
- use exponential backoff
- add jitter
- pair retries with idempotency
Key caution #
Retries without idempotency create duplicate effects.
3. Idempotency #
What it solves #
- duplicate requests
- duplicate deliveries
- ambiguous retries
- replay after partial failure
Archetypes that need it most #
Critical Transaction ProcessWorkflow / Lifecycle StateAppend-Only Child ObjectRelation / EdgeVersioned Namespace + Immutable Content-Addressed Units
Why #
These systems commonly suffer from:
- retried API requests
- duplicate event delivery
- replayed workflow steps
- ambiguous success after network loss
Interview default #
- put idempotency at the natural operation boundary
- use stable request/event keys
- persist dedup truth durably
Key caution #
Do not confuse:
- idempotency
- optimistic concurrency control
- leases
They solve different failure shapes.
4. Backoff and Jitter #
What it solves #
- synchronized retries
- cascading overload
- retry storms
- coordinated reconnects
Archetypes that need it most #
Realtime FanoutFuture Constraint + Claimable RunFrontier + Claimable RunMatching / AssignmentShared Mutable Subject
Why #
These systems can create bursty behavior:
- many clients reconnecting
- many workers retrying
- many scheduled jobs waking together
- many users refreshing the same object
Interview default #
- always mention jitter with retries
- mention staggered reconnects for client-heavy systems
Key caution #
Backoff alone is not enough. Without jitter, clients still synchronize.
5. Thundering Herd Protection #
What it solves #
- many actors hitting same hot object/path at once
- cache stampedes
- reconnect storms
- mass lease reclaims
Archetypes that need it most #
Derived ProjectionSearch-First ProductRealtime FanoutShared Mutable SubjectTime-Bounded Exclusive AllocationAuction / Competitive Window
Why #
These systems tend to create:
- hot keys
- hot auctions/events/documents
- many simultaneous polls or refreshes
- synchronized wakeups after failure
Interview default #
- request coalescing
- single-flight cache fill
- bounded fanout
- virtual waiting room if product allows
- staggered reconnect / retry
Key caution #
If you only scale stateless servers but keep one hot key/path, the herd problem remains.
6. Rate Limiting #
What it solves #
- abusive traffic
- noisy neighbors
- overload from spikes
- unfair resource consumption
Archetypes that need it most #
Current-Value EntitySearch-First ProductAppend-Only Child ObjectRealtime FanoutCritical Transaction ProcessMatching / Assignment
Why #
These systems face:
- public-facing APIs
- expensive queries
- write floods
- hot user/tenant abuse
- fairness-sensitive capacity
Interview default #
- rate limit at the edge for abuse
- sometimes per tenant/user/object
- mention fail-open vs fail-closed explicitly
Fail-open vs fail-closed #
- fail closed:
- payments
- expensive writes
- abuse-sensitive actions
- fail open:
- some read paths during limiter outage
- non-critical observability endpoints
7. Circuit Breaking / Load Shedding #
What it solves #
- dependency meltdown
- cascading failure
- expensive low-value requests consuming scarce capacity
Archetypes that need it most #
Critical Transaction ProcessWorkflow / Lifecycle StateSearch-First ProductRealtime FanoutDerived Projection
Why #
When a downstream dependency fails or slows:
- blindly continuing can take down the caller too
Interview default #
- shed nonessential work first
- preserve core correctness path
- reject low-priority requests early
Key caution #
Do not shed:
- correctness-critical commits
before shedding:
- projections
- notifications
- nonessential read enhancements
8. Queueing / Buffering #
What it solves #
- burst smoothing
- temporary downstream outage
- async decoupling
- durable retry
Archetypes that need it most #
Append-Only Child ObjectWorkflow / Lifecycle StateDerived ProjectionFuture Constraint + Claimable RunFrontier + Claimable RunCritical Transaction Process
Why #
These systems naturally support:
- async side effects
- replayable work
- lag-tolerant projections
Interview default #
- buffer noncritical downstream work
- keep source truth writes independent when possible
Key caution #
Queueing improves resilience only if:
- source truth is durable
- consumers are idempotent
- lag is observable
9. Fallback to Source Truth #
What it solves #
- cache/index/projection failure
- stale secondary view
- missing denormalized read model
Archetypes that need it most #
Current-Value EntityRelation / EdgeAppend-Only Child ObjectDerived ProjectionSearch-First Product
Why #
These archetypes often have:
- primary truth store
- optional cache/index/projection
Interview default #
- if projection fails, fallback to source if load allows
- otherwise serve bounded stale result or partial response
Key caution #
Do not assume source fallback is always safe at scale.
Sometimes fallback causes:
- DB overload
- wider outage
10. Reconciliation #
What it solves #
- uncertain external outcomes
- drift between systems
- missing side effects
- pointer/content mismatch
Archetypes that need it most #
Critical Transaction ProcessVersioned Namespace + Immutable Content-Addressed UnitsWorkflow / Lifecycle StateTime-Bounded Exclusive Allocation
Why #
These systems often interact with:
- external processors
- blob stores
- async workers
- partially published state
Interview default #
- explicit pending/reconciliation state
- periodic drift scan
- repair from durable source of truth
Key caution #
Reconciliation is not a substitute for correctness controls on the main write path.
11. Observability of Resilience #
What it solves #
- hidden failure accumulation
- silent queue lag
- stalled workflows
- slow degradation before outage
Archetypes that need it most #
- all archetypes, but especially:
Frontier + Claimable RunFuture Constraint + Claimable RunDerived ProjectionCritical Transaction ProcessRealtime Fanout
Key things to observe #
- timeouts
- retry rate
- dedup/idempotency hit rate
- queue lag
- consumer lag
- dead-letter volume
- stale projection age
- lease expiration/reclaim counts
- cache stampede rate
- rate limiter rejections
Interview default #
- mention both correctness metrics and liveness metrics
- mention lag and backlog, not just errors
Fast Mapping By Failure Shape #
If the problem is hung calls #
Use:
- timeout
- circuit break
- fallback if safe
If the problem is duplicate execution #
Use:
- idempotency
- guarded transitions
- dedup
If the problem is transient dependency failure #
Use:
- retries
- backoff
- jitter
- queueing
If the problem is hot-key overload #
Use:
- herd protection
- rate limiting
- coalescing
- cache strategy
If the problem is uncertain external outcome #
Use:
- pending state
- reconciliation
- idempotency
If the problem is lagging downstream side effects #
Use:
- queueing
- replay
- lag observability
Most Useful Mechanisms For Interviews #
If you only remember a small subset, remember these:
- timeout
- retry with exponential backoff and jitter
- idempotency key
- queue / buffer for async retry
- circuit break / load shed
- fallback to source truth
- reconciliation for uncertain external outcomes
- thundering herd protection
- rate limiting
- resilience observability
These cover a large fraction of system design failure discussions.