Skip to main content
  1. System Design Components/

Resilience Mechanisms to Archetypes Mapping #

This note maps cross-cutting resilience mechanisms to the system design archetypes already used in the cheat sheets.

It complements:

Those notes are archetype-first. This note is mechanism-first.

The source inspiration here is resilient.txt, especially its themes around:

  • timeouts
  • retries
  • idempotency
  • thundering herds
  • rate limiting
  • resilience observability

How To Use This Note #

For any path:

  1. identify the archetype
  2. identify the failure shape
  3. choose the resilience mechanism that fits
  4. decide whether the path should:
    • fail closed
    • fail open
    • degrade gracefully
    • retry later
    • buffer and replay

1. Timeouts #

What it solves #

  • hung calls
  • long-tail latency
  • stuck downstream dependencies
  • thread/connection pool exhaustion

Archetypes that need it most #

  • Critical Transaction Process
  • Workflow / Lifecycle State
  • Matching / Assignment
  • Future Constraint + Claimable Run
  • Frontier + Claimable Run
  • Shared Mutable Subject

Why #

These archetypes often involve:

  • external calls
  • human-in-the-loop waits
  • claim/lease holding
  • multi-step workflows
  • synchronous dependency chains

Interview default #

  • use timeouts on all remote calls
  • make timeouts shorter than lease/hold durations
  • never let a hung dependency silently consume all capacity

Key caution #

Timeout is not the same as failure.

Especially for:

  • payments
  • booking
  • external side effects

A timeout often means:

  • outcome unknown
  • move to pending/reconciliation state

2. Retries #

What it solves #

  • transient failures
  • packet loss
  • temporary dependency unavailability
  • leader failover gaps

Archetypes that need it most #

  • Append-Only Child Object
  • Workflow / Lifecycle State
  • Critical Transaction Process
  • Frontier + Claimable Run
  • Future Constraint + Claimable Run
  • Derived Projection

Why #

These systems frequently have:

  • asynchronous processing
  • retryable side effects
  • queue-based recovery
  • replayable writes

Interview default #

  • retry only transient failures
  • cap retries
  • use exponential backoff
  • add jitter
  • pair retries with idempotency

Key caution #

Retries without idempotency create duplicate effects.

3. Idempotency #

What it solves #

  • duplicate requests
  • duplicate deliveries
  • ambiguous retries
  • replay after partial failure

Archetypes that need it most #

  • Critical Transaction Process
  • Workflow / Lifecycle State
  • Append-Only Child Object
  • Relation / Edge
  • Versioned Namespace + Immutable Content-Addressed Units

Why #

These systems commonly suffer from:

  • retried API requests
  • duplicate event delivery
  • replayed workflow steps
  • ambiguous success after network loss

Interview default #

  • put idempotency at the natural operation boundary
  • use stable request/event keys
  • persist dedup truth durably

Key caution #

Do not confuse:

  • idempotency
  • optimistic concurrency control
  • leases

They solve different failure shapes.

4. Backoff and Jitter #

What it solves #

  • synchronized retries
  • cascading overload
  • retry storms
  • coordinated reconnects

Archetypes that need it most #

  • Realtime Fanout
  • Future Constraint + Claimable Run
  • Frontier + Claimable Run
  • Matching / Assignment
  • Shared Mutable Subject

Why #

These systems can create bursty behavior:

  • many clients reconnecting
  • many workers retrying
  • many scheduled jobs waking together
  • many users refreshing the same object

Interview default #

  • always mention jitter with retries
  • mention staggered reconnects for client-heavy systems

Key caution #

Backoff alone is not enough. Without jitter, clients still synchronize.

5. Thundering Herd Protection #

What it solves #

  • many actors hitting same hot object/path at once
  • cache stampedes
  • reconnect storms
  • mass lease reclaims

Archetypes that need it most #

  • Derived Projection
  • Search-First Product
  • Realtime Fanout
  • Shared Mutable Subject
  • Time-Bounded Exclusive Allocation
  • Auction / Competitive Window

Why #

These systems tend to create:

  • hot keys
  • hot auctions/events/documents
  • many simultaneous polls or refreshes
  • synchronized wakeups after failure

Interview default #

  • request coalescing
  • single-flight cache fill
  • bounded fanout
  • virtual waiting room if product allows
  • staggered reconnect / retry

Key caution #

If you only scale stateless servers but keep one hot key/path, the herd problem remains.

6. Rate Limiting #

What it solves #

  • abusive traffic
  • noisy neighbors
  • overload from spikes
  • unfair resource consumption

Archetypes that need it most #

  • Current-Value Entity
  • Search-First Product
  • Append-Only Child Object
  • Realtime Fanout
  • Critical Transaction Process
  • Matching / Assignment

Why #

These systems face:

  • public-facing APIs
  • expensive queries
  • write floods
  • hot user/tenant abuse
  • fairness-sensitive capacity

Interview default #

  • rate limit at the edge for abuse
  • sometimes per tenant/user/object
  • mention fail-open vs fail-closed explicitly

Fail-open vs fail-closed #

  • fail closed:
    • payments
    • expensive writes
    • abuse-sensitive actions
  • fail open:
    • some read paths during limiter outage
    • non-critical observability endpoints

7. Circuit Breaking / Load Shedding #

What it solves #

  • dependency meltdown
  • cascading failure
  • expensive low-value requests consuming scarce capacity

Archetypes that need it most #

  • Critical Transaction Process
  • Workflow / Lifecycle State
  • Search-First Product
  • Realtime Fanout
  • Derived Projection

Why #

When a downstream dependency fails or slows:

  • blindly continuing can take down the caller too

Interview default #

  • shed nonessential work first
  • preserve core correctness path
  • reject low-priority requests early

Key caution #

Do not shed:

  • correctness-critical commits

before shedding:

  • projections
  • notifications
  • nonessential read enhancements

8. Queueing / Buffering #

What it solves #

  • burst smoothing
  • temporary downstream outage
  • async decoupling
  • durable retry

Archetypes that need it most #

  • Append-Only Child Object
  • Workflow / Lifecycle State
  • Derived Projection
  • Future Constraint + Claimable Run
  • Frontier + Claimable Run
  • Critical Transaction Process

Why #

These systems naturally support:

  • async side effects
  • replayable work
  • lag-tolerant projections

Interview default #

  • buffer noncritical downstream work
  • keep source truth writes independent when possible

Key caution #

Queueing improves resilience only if:

  • source truth is durable
  • consumers are idempotent
  • lag is observable

9. Fallback to Source Truth #

What it solves #

  • cache/index/projection failure
  • stale secondary view
  • missing denormalized read model

Archetypes that need it most #

  • Current-Value Entity
  • Relation / Edge
  • Append-Only Child Object
  • Derived Projection
  • Search-First Product

Why #

These archetypes often have:

  • primary truth store
  • optional cache/index/projection

Interview default #

  • if projection fails, fallback to source if load allows
  • otherwise serve bounded stale result or partial response

Key caution #

Do not assume source fallback is always safe at scale.

Sometimes fallback causes:

  • DB overload
  • wider outage

10. Reconciliation #

What it solves #

  • uncertain external outcomes
  • drift between systems
  • missing side effects
  • pointer/content mismatch

Archetypes that need it most #

  • Critical Transaction Process
  • Versioned Namespace + Immutable Content-Addressed Units
  • Workflow / Lifecycle State
  • Time-Bounded Exclusive Allocation

Why #

These systems often interact with:

  • external processors
  • blob stores
  • async workers
  • partially published state

Interview default #

  • explicit pending/reconciliation state
  • periodic drift scan
  • repair from durable source of truth

Key caution #

Reconciliation is not a substitute for correctness controls on the main write path.

11. Observability of Resilience #

What it solves #

  • hidden failure accumulation
  • silent queue lag
  • stalled workflows
  • slow degradation before outage

Archetypes that need it most #

  • all archetypes, but especially:
    • Frontier + Claimable Run
    • Future Constraint + Claimable Run
    • Derived Projection
    • Critical Transaction Process
    • Realtime Fanout

Key things to observe #

  • timeouts
  • retry rate
  • dedup/idempotency hit rate
  • queue lag
  • consumer lag
  • dead-letter volume
  • stale projection age
  • lease expiration/reclaim counts
  • cache stampede rate
  • rate limiter rejections

Interview default #

  • mention both correctness metrics and liveness metrics
  • mention lag and backlog, not just errors

Fast Mapping By Failure Shape #

If the problem is hung calls #

Use:

  • timeout
  • circuit break
  • fallback if safe

If the problem is duplicate execution #

Use:

  • idempotency
  • guarded transitions
  • dedup

If the problem is transient dependency failure #

Use:

  • retries
  • backoff
  • jitter
  • queueing

If the problem is hot-key overload #

Use:

  • herd protection
  • rate limiting
  • coalescing
  • cache strategy

If the problem is uncertain external outcome #

Use:

  • pending state
  • reconciliation
  • idempotency

If the problem is lagging downstream side effects #

Use:

  • queueing
  • replay
  • lag observability

Most Useful Mechanisms For Interviews #

If you only remember a small subset, remember these:

  1. timeout
  2. retry with exponential backoff and jitter
  3. idempotency key
  4. queue / buffer for async retry
  5. circuit break / load shed
  6. fallback to source truth
  7. reconciliation for uncertain external outcomes
  8. thundering herd protection
  9. rate limiting
  10. resilience observability

These cover a large fraction of system design failure discussions.