Table of Contents

Pattern Zoo: Distributed Systems Design Patterns #

A complete taxonomy of patterns organized by the functional requirement they fulfill. Every distributed system is a composition of patterns from these eight categories. Discovered and validated through dry-runs on Web Crawler, YouTube Top K, Dropbox, and Uber.

5+1 Primitives #

All patterns below are compositions of six primitives. If you understand these, you can derive the rest.

Primitive	What It Does
Append-only Log	Ordered, immutable sequence of records. The foundation of durability and replication.
State Machine	Explicit states + transitions. Converts ambiguous process into auditable, restartable computation.
Hash Partition	Distribute load across N nodes by key hash. Enables horizontal scale at the cost of cross-partition queries.
Replication	Copy state to N nodes. Buys fault tolerance and read throughput; sells consistency.
Compare-and-Swap	Conditional write: update iff current value matches expected. The universal concurrency primitive (Herlihy consensus number ∞).
Clock	Assign causal order to events. Logical (Lamport), vector, or hybrid (HLC). Everything distributed depends on time.

Translation Layer: FR → Pattern #

Mapping a functional requirement to a pattern is not a lookup — it is a traversal of five discriminant questions. Earlier questions are more load-bearing; getting Q1 wrong produces broken architecture, getting Q4 wrong produces performance problems.

FR
└── Q1 scope      → narrows to a candidate cluster
    └── Q2 failure → eliminates half the cluster
        └── Q3 data → may collapse to a zero-cost solution
            └── Q4 access → picks among read/write tradeoffs
                └── Q5 coupling → finalizes event and sync boundary

Q1 — Coordination scope: where does contention happen? #

Scope	Candidate patterns
Within a data structure	CAS, CRDT
Within a service (multiple instances)	Pessimistic Lock, Optimistic Lock, Lease
Across services	Saga (decomposable) / 2PC (not decomposable)
Across regions / async	Leaderless Replication, CRDT, Gossip

Q2 — Failure model: what breaks? #

Failure	Candidate patterns
Crash-stop	WAL + Checkpoint, Append-only Log, Idempotency Key
Network partition	Quorum, CRDT, Leaderless Replication, State Vector Sync
Slow degradation (not dead, just slow)	Circuit Breaker, Timeout, Bulkhead
Holder crashes while holding a resource	Lease (not Pessimistic Lock)

Q3 — Data properties: what does the data allow? #

Property	Implication
Commutative + associative operations	CRDT — coordination cost drops to zero
Immutable after write	Append-only Log, Event Sourcing — replay is safe
Content-addressable	Hash = natural idempotency key — dedup is free
Ordered by time or key	Range Partition, WAL, Windowing
Spatially structured	Spatial Partition

Q3 can short-circuit the entire traversal. If the data is commutative, reach for CRDT before any lock-based pattern. If it is content-addressable, idempotency is solved at the storage layer without a separate idempotency table.

Q4 — Access pattern: how is it read and written? #

Pattern	Implication
Read » Write	Cache-Aside, CQRS + Materialized View, Denormalization
Write » Read	Hash Partition, Leaderless Replication, Append-only Log
Read shape ≠ Write shape	CQRS (separate models)
Query is geospatial	Spatial Partition before any other read optimization
Query fans across shards	Scatter-Gather
Reads are time-bounded	Windowing, TTL, Temporal Decay

Q5 — Coupling: how tightly must producer and consumer synchronize? #

Coupling	Candidate patterns
Synchronous, atomic	2PC, Pessimistic Lock, Timeout
Synchronous, best-effort	Retry + Backoff + Jitter, Circuit Breaker
Asynchronous, guaranteed delivery	Outbox + Relay, Message Queue
Asynchronous, source-of-truth is the DB row	CDC
Fan-out at write time	Fan-out on Write
Fan-out at read time	Fan-out on Read

Worked example: shared document editing #

Question	Answer	Elimination
Q1 scope	Multiple users, cross-region	Eliminates all lock-based patterns
Q2 failure	Network partition must not block editing	Eliminates OT (requires server round-trip); keeps CRDT
Q3 data	Sequence operations, not commutative by default	Requires sequence CRDT (YATA/RGA), not G-Counter
Q4 access	Each client holds full replica; reads are local	No read-path pattern needed
Q5 coupling	Async sync on reconnect	State Vector Sync for delta exchange

Result: CRDT (YATA) + State Vector Sync. No locks, no 2PC, no Saga.

FR1 — Write Durable #

How do you ensure a write survives failure?

Append-only Log #

Every write is an append to an immutable, ordered log. Reads reconstruct state by replaying the log. Updates and deletes are new entries, not mutations.

When: Event sourcing, audit trail, replication source, undo/redo
Levers: Retention period, compaction policy, segment size
Failure mode: Log grows unbounded without compaction; replay time increases with log depth

WAL + Checkpoint #

Write-Ahead Log: record intent before applying. Checkpoint: snapshot current state to bound replay cost. Recovery = last checkpoint + WAL replay.

When: Any system needing crash recovery without full log replay (PostgreSQL, Flink, etcd)
Levers: Checkpoint interval (shorter = faster recovery, more I/O); WAL sync mode (fsync vs group commit)
Failure mode: Checkpoint too infrequent → long recovery; too frequent → write amplification

Event Sourcing #

Application state is never stored directly. Only events (facts) are stored. Current state = fold over event stream. Snapshots optionally truncate replay.

When: Audit-critical domains (payments, orders, ledgers), CQRS read model rebuild, temporal queries
Levers: Snapshot frequency; event schema versioning (schema evolution gap)
Failure mode: Event schema changes break replay — requires upcasting or versioned handlers

FR2 — Coordinate Concurrency #

How do you prevent conflicting concurrent writes?

Pessimistic Lock #

Acquire exclusive lock before read-modify-write. Other writers block until lock released. Serializable by construction.

When: Low-contention, short critical sections; correctness > throughput (inventory reservation, balance debit)
Levers: Lock timeout; lock granularity (row vs table vs range)
Failure mode: Deadlock (cycle in lock graph); lock timeout = cascading failure under load

Optimistic Lock (OCC) #

Read without locking. Write with version check: UPDATE ... WHERE version = $read_version. Retry on conflict. No locks held during think time.

When: High-read, low-conflict workloads; long transactions; distributed systems where locks don’t compose
Levers: Retry budget; backoff strategy; conflict rate (if > ~10%, OCC degrades to pessimistic)
Failure mode: Starvation under high contention — writers keep losing the version race

CRDT (Conflict-free Replicated Data Type) #

Data structure with a merge function that is commutative, associative, and idempotent. Concurrent writes always merge without conflict. No coordination needed.

When: Collaborative editing, distributed counters, shopping carts, presence systems
Levers: CRDT type (G-Counter, LWW-Register, OR-Set, YATA sequence); tombstone GC policy
Failure mode: Tombstone accumulation (deleted elements remain as metadata); eventual consistency means reads may lag

Saga #

Long-running transaction decomposed into a sequence of local transactions with compensating actions. If any step fails, run compensations in reverse order.

When: Distributed transactions spanning multiple services where 2PC is too expensive or unavailable
Levers: Choreography (event-driven) vs orchestration (central coordinator); compensation idempotency
Failure mode: Compensation failure (“double fault”); intermediate states visible to concurrent readers

Two-Phase Commit (2PC) #

Phase 1 (prepare): coordinator asks all participants to lock and vote yes/no. Phase 2 (commit/abort): coordinator broadcasts decision. All-or-nothing across participants.

When: Cross-shard transactions requiring atomicity; distributed databases; XA transactions
Levers: Coordinator failure recovery (persistent prepare log); participant timeout
Failure mode: Coordinator crashes after prepare, before commit → participants blocked (“in-doubt” state) until coordinator recovers

Compare-and-Swap (CAS) #

Atomic conditional update: write new value only if current value equals expected. Foundation for all lock-free data structures and leader election.

When: Driver status claim in dispatch, leader election, optimistic concurrency without version columns
Levers: Retry on CAS failure; ABA problem mitigation (add version/stamp)
Failure mode: ABA problem: value changes A→B→A; CAS succeeds but state has semantically changed

Lease #

A time-bound exclusive claim on a resource. The holder has exclusive access for the duration; the lease expires automatically if the holder crashes, releasing the resource without explicit unlock.

When: Distributed locks (Redis SETNX + EXPIRE), leader election (etcd lease), driver presence detection, primary shard ownership, any resource that must be exclusively held but safely released on crash
Levers: Lease duration (shorter = faster recovery on crash, more renewal overhead); renewal interval (typically lease_duration / 3); fencing token (monotonically increasing generation number to reject stale lease holders)
Distinction from Pessimistic Lock: Pessimistic Lock has no expiry — a dead holder blocks forever. Lease has expiry — a dead holder is evicted automatically. Lease = Pessimistic Lock + TTL + fencing token.
Failure mode: Clock skew between holder and lease store causes premature expiry — holder believes it still owns the lease but the store has already granted it to another. Fix: use fencing tokens on every resource access, not wall-clock time

FR3 — Read Fast #

How do you serve reads with low latency at scale?

Cache-Aside (Lazy Loading) #

Application checks cache first. On miss, reads from DB, populates cache, returns result. Cache is a read-through acceleration layer.

When: Read-heavy, write-tolerant workloads; cache can be stale briefly
Levers: TTL; eviction policy (LRU, LFU); cache size
Failure mode: Cache stampede on cold start or TTL expiry — many simultaneous misses flood the DB. Fix: probabilistic early expiry or request coalescing (single-flight)

CQRS + Materialized View #

Command Query Responsibility Segregation: write path and read path use separate models. Read model is a pre-computed, denormalized view maintained by consuming the write event stream.

When: Read shape differs from write shape; high read:write ratio; multiple read models from one write model
Levers: View update latency (sync vs async); view rebuild strategy on schema change
Failure mode: Read model lag under write burst; view rebuild cost proportional to full event log

Denormalization #

Duplicate data across entities to avoid joins at read time. Embed related data at write time rather than join at query time.

When: Join-heavy queries on hot paths; NoSQL stores without join capability
Levers: Which fields to embed (high-read, low-churn); update fan-out cost (every embed must update on source change)
Failure mode: Stale embedded data if update fan-out fails or is skipped

Scatter-Gather #

Fan a query out to N shards in parallel. Each shard returns a partial result. Coordinator merges partial results and returns the top-K or aggregated answer.

When: Distributed search (Elasticsearch), distributed sort, global aggregations across shards
Levers: Timeout for slow shard responses; partial result tolerance; merge cost
Failure mode: Slow shard blocks the response (“long tail latency”). Fix: hedged requests; timeout with best-effort partial result

Spatial Partition #

Partition entities by geographic proximity using a hierarchical cell encoding. Queries become prefix lookups or neighbor enumeration in the cell hierarchy rather than range scans over continuous coordinates.

When: Driver matching, surge zone aggregation, geo-search, proximity queries, delivery ETAs
Implementations: Geohash (base-32 prefix), H3 (hexagonal hierarchical grid, Uber), S2 (spherical cells, Google), QuadTree (adaptive, good for non-uniform density)
Levers: Cell resolution (smaller = precise, more boundary effects; larger = coarser, simpler neighbors); neighbor enumeration depth
Failure mode: Cell boundary artifacts — entities near a cell edge are in different cells but physically adjacent. Fix: always query target cell + all neighboring cells

FR4 — Scale Writes #

How do you distribute write load across nodes?

Hash Partition (Consistent Hashing) #

Assign each key to a node by hash(key) mod N or via a consistent hash ring. Writes for a key always go to the same node (or its replicas).

When: Horizontally scaling any key-value or document store; Kafka topic partitioning; DynamoDB
Levers: Number of vnodes (virtual nodes) on the ring; rebalancing strategy on node add/remove
Failure mode: Hot partition — high-cardinality keys all hash to one node. Fix: composite key with random suffix; adaptive capacity (DynamoDB)

Range Partition #

Divide keyspace into contiguous ranges. Each range assigned to a node. Enables efficient range scans at the cost of potential hotspots at range boundaries.

When: Time-series data (partition by time range), ordered data, range queries are primary access pattern
Levers: Range split threshold; split strategy (automatic vs manual); range merge on low load
Failure mode: Write hotspot on the “latest” range for time-series data — all writes go to the current time partition. Fix: write to multiple partitions with time bucketing; use hash partition for write path and range for read path

Leaderless Replication (Dynamo-style) #

No designated leader. Any replica accepts writes. Quorum reads/writes (W + R > N) ensure overlap. Anti-entropy via gossip or Merkle tree sync.

When: High availability > consistency; geographically distributed writes; Cassandra, DynamoDB, Riak
Levers: N (replication factor), W (write quorum), R (read quorum); read repair vs background anti-entropy
Failure mode: Sloppy quorum during partition: W+R may not overlap with actual current data. Fix: read repair; hinted handoff with bounded staleness window

FR5 — Events Flow #

How do you decouple producers from consumers and propagate state changes?

Message Queue #

Producer enqueues messages; consumer dequeues and processes exactly once. Queue provides buffering (C in circuit) and rate decoupling between producer and consumer.

When: Work distribution, task offloading, rate smoothing between services
Levers: Queue depth limit (backpressure); visibility timeout; dead-letter queue threshold
Failure mode: Queue depth grows unbounded under sustained overload (capacitor overcharge). Fix: backpressure to producer; drop with DLQ; scale consumers

Pub/Sub #

Publisher emits events to a topic. Multiple subscribers each receive a copy. Fan-out is handled by the broker, not the publisher.

When: Notification systems, event-driven microservices, real-time feeds
Levers: Delivery guarantee (at-least-once vs exactly-once); subscriber filter expressions; retention period
Failure mode: Slow subscriber blocks topic progress in some implementations. Fix: per-subscriber queue with independent offset tracking (Kafka model)

Outbox + Relay (Transactional Outbox) #

Write event to an outbox table in the same DB transaction as the business write. A relay process reads the outbox and publishes to the message broker. Guarantees at-least-once event publication without distributed transaction.

When: Any service that must publish an event exactly when a DB write commits (payment confirmed → send email)
Levers: Relay polling interval; outbox cleanup after ACK; relay idempotency key
Failure mode: Relay falls behind under write burst → event delay. Fix: tail-based relay using WAL (CDC) instead of polling

CDC (Change Data Capture) #

Stream every row change from the DB write-ahead log to downstream consumers. No application-level outbox required.

When: Real-time replication to read replicas, search index, cache invalidation, audit log
Implementations: Debezium (Kafka Connector reading Postgres/MySQL WAL), DynamoDB Streams
Levers: Lag tolerance; schema change handling in consumer; log retention on source DB
Failure mode: Schema change on source table breaks CDC consumer — requires schema registry and versioned consumers

Fan-out on Write #

When an event occurs, immediately push it to all subscriber inboxes/feeds at write time. Read is O(1): just read your inbox.

When: Social feeds with low follower counts; real-time notifications; Dropbox shared folder change notify
Levers: Async vs sync fan-out; batch size; failure handling per recipient
Failure mode: Celebrity problem — high-follower accounts make write fan-out O(followers) → write amplification. Fix: hybrid fan-out (fan-out on write for normal users, fan-out on read for celebrities)

Fan-out on Read (Pull on Read) #

Events stored once at the source. Each reader fetches and merges from all followed sources at read time. Write is O(1); read is O(sources).

When: High-follower accounts; infrequently read feeds; storage-constrained systems
Levers: Read cache TTL; merge strategy; number of sources per reader
Failure mode: Read latency grows with number of followed sources. Fix: hybrid fan-out; pre-aggregated timeline cache with async refresh

FR6 — Tolerate Failure #

How do you make the system survive partial failures without cascading?

Idempotency Key #

Client generates a unique key per logical operation. Server stores key + result. On retry, server returns stored result instead of re-executing. Makes at-least-once delivery equivalent to exactly-once processing.

When: Payment capture, order submission, any non-idempotent operation over unreliable network
Levers: Key TTL; storage backend (Redis for speed, DB for durability); key scope (per-user vs global)
Failure mode: Key collision if client reuses keys; key store becomes a hot path. Fix: UUIDv4 keys; async key cleanup

Retry + Backoff + Jitter #

On transient failure, retry after a delay. Exponential backoff increases delay geometrically. Jitter adds randomness to prevent synchronized retry storms (underdamped oscillation → reconnect storm).

When: Any network call, S3 upload, DB connection — the default failure-tolerance pattern
Levers: Base delay, max delay, multiplier, jitter range, max retry count
Failure mode: Without jitter: synchronized retries spike load exactly when the server is recovering. Without max delay: retries queue indefinitely and exhaust client resources.

Circuit Breaker #

Track error rate over a sliding window. If error rate exceeds threshold, open the circuit: fail fast without calling the downstream. After a timeout, enter half-open state and probe with one request.

When: Protecting against cascading failure when a dependency degrades; payment gateway isolation
States: Closed (normal) → Open (fail fast) → Half-open (probe) → Closed
Levers: Error rate threshold; window size; open duration; success count to close
Failure mode: Miscalibrated threshold trips circuit on transient spike → legitimate traffic fails. Fix: use percentile-based (p99 latency) not raw error rate

Bulkhead #

Isolate resources (thread pools, connection pools, memory) by caller or service. Failure in one bulkhead doesn’t exhaust resources for others.

When: Multi-tenant systems; high-priority vs low-priority traffic separation; protecting core path from batch jobs
Pattern: Separate connection pools per downstream service; separate thread pools per tenant tier
Failure mode: Under-provisioned bulkhead starves legitimate traffic. Fix: size bulkheads based on measured p99 concurrency, not peak

Timeout #

Every outbound call has a maximum wait time. Caller does not block indefinitely. Timed-out requests are abandoned and counted as errors.

When: Every network call — the baseline fault isolation primitive
Levers: Timeout value (must be less than the caller’s own timeout → timeout budget propagation)
Failure mode: Timeout too long → slow calls hold resources, cascade. Timeout too short → false failures on slow-but-healthy responses. Fix: measure p99 latency, set timeout at ~2–3× p99

FR7 — Nodes Agree #

How do distributed nodes reach consensus or stay in sync?

Leader-Follower (Primary-Replica) #

One node (leader) accepts all writes. Followers replicate from the leader. Reads can go to followers (with staleness). Leader failure triggers election.

When: Single-region databases; Kafka partition leadership; Redis Sentinel
Levers: Replication mode (sync = no data loss, async = lower latency); election timeout; follower lag threshold
Failure mode: Split-brain — both old and new leader accept writes. Fix: fencing token + quorum acknowledgment before stepping down

Quorum #

A write is durable once W of N nodes acknowledge. A read fetches from R of N nodes. Overlap (W + R > N) guarantees at least one node has the latest version.

When: Leaderless replication (Cassandra, DynamoDB), Raft log commit, distributed consensus
Levers: N, W, R values; tunable consistency (QUORUM vs ONE vs ALL)
Failure mode: Network partition: if partition isolates W nodes on one side and R nodes on the other, quorum may succeed on both sides independently → split-brain

Gossip Protocol #

Nodes periodically exchange state with random peers. Information spreads exponentially (like an epidemic). Convergence time = O(log N).

When: Membership management (Cassandra ring membership), failure detection, configuration propagation, distributed counters
Levers: Fanout (peers per round); gossip interval; anti-entropy via Merkle tree comparison
Failure mode: Slow convergence under high churn; “false negative” failure detection if gossip packets drop. Fix: suspicion mechanism before declaring node dead

State Vector Sync (Version Vectors) #

Each node maintains a vector of (node_id → sequence_number) representing the latest event seen from each node. Two nodes compare vectors to identify what each is missing and exchange only the delta.

When: Distributed sync (Dropbox, CRDTs, Dynamo-style conflict detection), collaborative editing
Levers: Vector clock vs hybrid logical clock; delta encoding for large vectors
Failure mode: Vector size grows with number of nodes; stale vectors after node removal leave dangling entries. Fix: dotted version vectors; periodic compaction

Merkle Tree #

Hash tree where each leaf is a block of data and each internal node is a hash of its children. Two nodes compare root hashes; if equal, data is identical. If different, binary search down the tree to find divergent ranges.

When: Efficient anti-entropy between replicas (Cassandra, Dynamo), blockchain integrity, Git objects, S3 multi-part checksum
Levers: Tree depth (log₂ N levels); leaf block size; rehash cost on update
Failure mode: Hot write path invalidates Merkle tree root on every write → expensive rehash. Fix: batch updates before rehashing; async background Merkle tree rebuild

FR8 — Time and Approximation #

How do you reason about time, handle late data, and approximate at scale?

Windowing #

Divide an infinite event stream into finite, bounded chunks for aggregation. Three types: tumbling (non-overlapping fixed intervals), sliding (overlapping intervals), session (gap-based).

When: Metrics aggregation, rate limiting, analytics, surge demand calculation
Levers: Window size; slide interval (sliding); session gap timeout; allowed lateness
Failure mode: Late events arrive after window closes → missed from aggregation. Fix: allowed lateness with watermark; side output for late events

TTL (Time-to-Live) #

Entries expire automatically after a fixed duration. Expiry is handled by the store (Redis, DynamoDB, DNS TTL) without application logic.

When: Cache expiry, session invalidation, driver presence detection, version retention cleanup, DNS caching
Levers: TTL duration; whether expiry is lazy (on access) or eager (background sweep)
Failure mode: TTL too short → excessive cache misses. TTL too long → stale data served. Jitter on TTL prevents cache stampede (add ±10% randomness).

Approximate Counting #

Use probabilistic data structures instead of exact counts when memory or coordination cost of exact counting is prohibitive.

Structures: Count-Min Sketch (frequency estimation, O(ε) error), HyperLogLog (cardinality estimation, ~1.6% error, ~1.5KB for 10⁹ elements), Bloom Filter (membership test, no false negatives)
When: Top-K trending, unique visitor counts, spam filter membership, cache existence check before DB query
Levers: Error tolerance ε; confidence level δ; hash function count
Failure mode: Count-Min saturation — heavy hitters pollute frequency estimates for rare items. Fix: Count-Min with conservative update; separate heavy-hitter tracking

Staged Rollout #

Deploy a change to a small percentage of users/servers, monitor, expand incrementally. Limits blast radius of bugs.

When: Feature releases, infrastructure migrations, ML model updates, client app releases
Levers: Rollout percentage schedule; metric thresholds for automatic pause/rollback; canary population selection
Failure mode: Canary population is unrepresentative (e.g., internal users only) → issue missed until full rollout. Fix: random sampling from production traffic; monitor tail latency not just error rate

Scheduled Trigger #

A periodic job that runs at fixed intervals to perform maintenance, aggregation, or cleanup that would be too expensive inline.

When: Leaderboard snapshot, billing cycle, report generation, cache warming, anti-entropy reconciliation
Levers: Schedule frequency; idempotency of job (must be safe to re-run); job overlap prevention (distributed lock)
Failure mode: Job overlap when previous run exceeds schedule interval. Fix: distributed lock with TTL; skip if lock not acquired; alert on overlap

Temporal Decay #

Score or weight that decreases automatically as a function of time elapsed since the event. Ensures recent events dominate rankings without requiring explicit deletion of old entries.

When: Trending topics (Reddit hot sort), recommendation recency weighting, fraud signal decay, YouTube Top K video ranking
Formula examples: Reddit: score / (age_hours + 2)^gravity; Exponential: score × e^(-λt)
Levers: Decay rate λ; time granularity; floor value (minimum score to prevent numerical underflow)
Failure mode: Decay too fast → viral content disappears before it surfaces. Decay too slow → old viral content permanently occupies top-K slots. Fix: tune λ against measured content half-life in your domain

Composition Rules #

Serial Composition #

Patterns chain output-to-input: Outbox → Message Queue → CDC → Materialized View. Each adds a guarantee or capability. Latency adds; throughput is bounded by the slowest stage.

Parallel Composition #

Patterns run side-by-side: Hash Partition applied simultaneously to write path and read path. Throughput multiplies; correctness requires each path to be independently consistent.

XOR Composition #

Choose one pattern or another based on a condition: Fan-out on Write for normal users, Fan-out on Read for celebrities. The hybrid is a selector, not a combination.

Known Gaps #

Gap	Why Out of Scope
Schema Evolution / Expand-Contract	Needed for live schema migrations; domain-specific to relational stores
Spatial Partition — 3D / indoor	H3/S2 handle Earth surface; 3D (drone routing, indoor navigation) requires different primitives
Byzantine Fault Tolerance (BFT)	Required for blockchain/consensus under adversarial nodes; out of scope for consumer distributed systems

There's no articles to list here yet.