Resilience Mechanisms to Archetypes Mapping #

This note maps cross-cutting resilience mechanisms to the system design archetypes already used in the cheat sheets.

It complements:

Those notes are archetype-first. This note is mechanism-first.

The source inspiration here is resilient.txt, especially its themes around:

timeouts
retries
idempotency
thundering herds
rate limiting
resilience observability

How To Use This Note #

For any path:

identify the archetype
identify the failure shape
choose the resilience mechanism that fits
decide whether the path should:
- fail closed
- fail open
- degrade gracefully
- retry later
- buffer and replay

1. Timeouts #

What it solves #

hung calls
long-tail latency
stuck downstream dependencies
thread/connection pool exhaustion

Archetypes that need it most #

Critical Transaction Process
Workflow / Lifecycle State
Matching / Assignment
Future Constraint + Claimable Run
Frontier + Claimable Run
Shared Mutable Subject

Why #

These archetypes often involve:

external calls
human-in-the-loop waits
claim/lease holding
multi-step workflows
synchronous dependency chains

Interview default #

use timeouts on all remote calls
make timeouts shorter than lease/hold durations
never let a hung dependency silently consume all capacity

Key caution #

Timeout is not the same as failure.

Especially for:

payments
booking
external side effects

A timeout often means:

outcome unknown
move to pending/reconciliation state

2. Retries #

What it solves #

transient failures
packet loss
temporary dependency unavailability
leader failover gaps

Archetypes that need it most #

Append-Only Child Object
Workflow / Lifecycle State
Critical Transaction Process
Frontier + Claimable Run
Future Constraint + Claimable Run
Derived Projection

Why #

These systems frequently have:

asynchronous processing
retryable side effects
queue-based recovery
replayable writes

Interview default #

retry only transient failures
cap retries
use exponential backoff
add jitter
pair retries with idempotency

Key caution #

Retries without idempotency create duplicate effects.

3. Idempotency #

What it solves #

duplicate requests
duplicate deliveries
ambiguous retries
replay after partial failure

Archetypes that need it most #

Critical Transaction Process
Workflow / Lifecycle State
Append-Only Child Object
Relation / Edge
Versioned Namespace + Immutable Content-Addressed Units

Why #

These systems commonly suffer from:

retried API requests
duplicate event delivery
replayed workflow steps
ambiguous success after network loss

Interview default #

put idempotency at the natural operation boundary
use stable request/event keys
persist dedup truth durably

Key caution #

Do not confuse:

idempotency
optimistic concurrency control
leases

They solve different failure shapes.

4. Backoff and Jitter #

What it solves #

synchronized retries
cascading overload
retry storms
coordinated reconnects

Archetypes that need it most #

Realtime Fanout
Future Constraint + Claimable Run
Frontier + Claimable Run
Matching / Assignment
Shared Mutable Subject

Why #

These systems can create bursty behavior:

many clients reconnecting
many workers retrying
many scheduled jobs waking together
many users refreshing the same object

Interview default #

always mention jitter with retries
mention staggered reconnects for client-heavy systems

Key caution #

Backoff alone is not enough. Without jitter, clients still synchronize.

5. Thundering Herd Protection #

What it solves #

many actors hitting same hot object/path at once
cache stampedes
reconnect storms
mass lease reclaims

Archetypes that need it most #

Derived Projection
Search-First Product
Realtime Fanout
Shared Mutable Subject
Time-Bounded Exclusive Allocation
Auction / Competitive Window

Why #

These systems tend to create:

hot keys
hot auctions/events/documents
many simultaneous polls or refreshes
synchronized wakeups after failure

Interview default #

request coalescing
single-flight cache fill
bounded fanout
virtual waiting room if product allows
staggered reconnect / retry

Key caution #

If you only scale stateless servers but keep one hot key/path, the herd problem remains.

6. Rate Limiting #

What it solves #

abusive traffic
noisy neighbors
overload from spikes
unfair resource consumption

Archetypes that need it most #

Current-Value Entity
Search-First Product
Append-Only Child Object
Realtime Fanout
Critical Transaction Process
Matching / Assignment

Why #

These systems face:

public-facing APIs
expensive queries
write floods
hot user/tenant abuse
fairness-sensitive capacity

Interview default #

rate limit at the edge for abuse
sometimes per tenant/user/object
mention fail-open vs fail-closed explicitly

Fail-open vs fail-closed #

fail closed:
- payments
- expensive writes
- abuse-sensitive actions
fail open:
- some read paths during limiter outage
- non-critical observability endpoints

7. Circuit Breaking / Load Shedding #

What it solves #

dependency meltdown
cascading failure
expensive low-value requests consuming scarce capacity

Archetypes that need it most #

Critical Transaction Process
Workflow / Lifecycle State
Search-First Product
Realtime Fanout
Derived Projection

Why #

When a downstream dependency fails or slows:

blindly continuing can take down the caller too

Interview default #

shed nonessential work first
preserve core correctness path
reject low-priority requests early

Key caution #

Do not shed:

correctness-critical commits

before shedding:

projections
notifications
nonessential read enhancements

8. Queueing / Buffering #

What it solves #

burst smoothing
temporary downstream outage
async decoupling
durable retry

Archetypes that need it most #

Append-Only Child Object
Workflow / Lifecycle State
Derived Projection
Future Constraint + Claimable Run
Frontier + Claimable Run
Critical Transaction Process

Why #

These systems naturally support:

async side effects
replayable work
lag-tolerant projections

Interview default #

buffer noncritical downstream work
keep source truth writes independent when possible

Key caution #

Queueing improves resilience only if:

source truth is durable
consumers are idempotent
lag is observable

9. Fallback to Source Truth #

What it solves #

cache/index/projection failure
stale secondary view
missing denormalized read model

Archetypes that need it most #

Current-Value Entity
Relation / Edge
Append-Only Child Object
Derived Projection
Search-First Product

Why #

These archetypes often have:

primary truth store
optional cache/index/projection

Interview default #

if projection fails, fallback to source if load allows
otherwise serve bounded stale result or partial response

Key caution #

Do not assume source fallback is always safe at scale.

Sometimes fallback causes:

DB overload
wider outage

10. Reconciliation #

What it solves #

uncertain external outcomes
drift between systems
missing side effects
pointer/content mismatch

Archetypes that need it most #

Critical Transaction Process
Versioned Namespace + Immutable Content-Addressed Units
Workflow / Lifecycle State
Time-Bounded Exclusive Allocation

Why #

These systems often interact with:

external processors
blob stores
async workers
partially published state

Interview default #

explicit pending/reconciliation state
periodic drift scan
repair from durable source of truth

Key caution #

Reconciliation is not a substitute for correctness controls on the main write path.

11. Observability of Resilience #

What it solves #

hidden failure accumulation
silent queue lag
stalled workflows
slow degradation before outage

Archetypes that need it most #

all archetypes, but especially:
- Frontier + Claimable Run
- Future Constraint + Claimable Run
- Derived Projection
- Critical Transaction Process
- Realtime Fanout

Key things to observe #

timeouts
retry rate
dedup/idempotency hit rate
queue lag
consumer lag
dead-letter volume
stale projection age
lease expiration/reclaim counts
cache stampede rate
rate limiter rejections

Interview default #

mention both correctness metrics and liveness metrics
mention lag and backlog, not just errors

Fast Mapping By Failure Shape #

If the problem is hung calls #

Use:

timeout
circuit break
fallback if safe

If the problem is duplicate execution #

Use:

idempotency
guarded transitions
dedup

If the problem is transient dependency failure #

Use:

retries
backoff
jitter
queueing

If the problem is hot-key overload #

Use:

herd protection
rate limiting
coalescing
cache strategy

If the problem is uncertain external outcome #

Use:

pending state
reconciliation
idempotency

If the problem is lagging downstream side effects #

Use:

queueing
replay
lag observability

Most Useful Mechanisms For Interviews #

If you only remember a small subset, remember these:

timeout
retry with exponential backoff and jitter
idempotency key
queue / buffer for async retry
circuit break / load shed
fallback to source truth
reconciliation for uncertain external outcomes
thundering herd protection
rate limiting
resilience observability

These cover a large fraction of system design failure discussions.