Skip to main content
  1. Distributed Coordination: The Hidden Component/

When Infrastructure, When Application, When Layered

When Infrastructure, When Application, When Layered #

Eleven chapters have laid out the mechanisms, the algorithms, and the failure modes. This final chapter synthesizes them into a decision framework: given a coordination problem, which mechanism, at which layer, with which bridges?

The framework is organized around three questions, answered in order.

Question 1: Is the Error Recoverable? #

The most important question in coordination mechanism selection is not “what mechanism fits?” but “what happens if this goes wrong?”

Unrecoverable errors are errors whose consequence is permanent, uncorrectable data corruption:

  • Two leaders simultaneously in a distributed database (split-brain) — log entries diverge permanently
  • Two nodes simultaneously winning Raft election in the same term — consensus is broken
  • A distributed transaction partially committing — one database records the payment, the other does not record the order

For unrecoverable errors: use infrastructure-level coordination (CP systems). Correctness wins over throughput. Pay the consensus cost. Accept that the minority partition cannot make progress.

Recoverable errors are errors whose consequence is inconvenient but correctable:

  • Two users briefly holding the same seat — one receives an error at checkout, re-selects
  • A payment attempted twice — the second attempt is deduplicated via idempotency key
  • A job scheduled twice — the second execution finds the work already done (idempotent consumer)
  • A Bloom filter false positive — a URL is skipped; crawl quality degrades by 1%

For recoverable errors: use application-level coordination (AP-tolerant systems). Throughput wins for the fast path. Correctness is enforced at the commitment layer with bridges.

The boundary test: if the failure mode is a duplicate execution, a retried operation, or a temporary inconsistency that corrects itself — recoverable, use the fast path with bridges. If the failure mode is corrupted data that cannot be detected or corrected after the fact — unrecoverable, use consensus.

Question 2: What Is the Concurrency Scale? #

The concurrency scale determines the implementation layer.

Small scale (< 1,000 concurrent coordinators): infrastructure-level coordination is appropriate.

  • Raft leader election: 3–7 nodes voting on a leader
  • etcd distributed lock: dozens of services competing for a coordinator role
  • Kubernetes controller leader: exactly one controller instance active per resource type

At this scale, the throughput cost of consensus is negligible. The simplicity of “one strong consistency system handles coordination” is worth the coordination overhead.

Large scale (> 100,000 concurrent users): application-level coordination is necessary.

  • Ticketmaster: 2 million users attempting to buy 60,000 seats
  • Uber: 100,000 concurrent ride requests being matched to drivers
  • Rate limiter: 10 million requests per second across all users

At this scale, running each coordination operation through Raft consensus would saturate the consensus log. Application-level coordination — Redis SET NX, optimistic locks, idempotency keys — handles the high-throughput fast path.

The crossover region: for coordination problems at 1,000–100,000 concurrent operations, either approach may work. The choice depends on the error cost (recoverable vs unrecoverable) and the latency budget.

Question 3: What Is the Mechanism? #

Given that you know the error type (recoverable/unrecoverable) and the scale, select the mechanism from the taxonomy:

Selection (competition for a finite resource) #

ScaleError typeMechanism
Small (nodes competing for leadership)UnrecoverableRaft + etcd lease + fencing token
Large (users competing for a seat/resource)RecoverableRedis SET NX + TTL + CAS at commitment

Implementation pattern (large scale):

  1. Redis SET NX for the intent claim (optimistic, atomic, TTL-bounded)
  2. CAS at the database for the commitment (belt-and-suspenders)
  3. Idempotency key on the payment/charge step
  4. Fencing token if multiple processes may commit for the same intent

Agreement (all commit or all abort) #

ScaleError typeMechanism
Within one databaseN/ALocal ACID transaction
Across two databases (same organization)UnrecoverableXA / 2PC with replicated coordinators
Across services (microservice boundary)RecoverableSaga with compensation + idempotency keys
Replicated state machineUnrecoverableRaft log replication

The key distinction: 2PC for operations that must be atomic AND where both parties support XA. Saga for operations where at least one party does not support XA, or where long-held locks are unacceptable.

Assignment (distributing work) #

Assignment typeMechanism
Data to nodes (unbounded, grows over time)Consistent hashing + virtual nodes
Work to consumers (streams, queues)Kafka partition assignment (cooperative rebalance)
Range queries requiredRange-based assignment (CockroachDB, etcd ranges)
Load balancing requestsConsistent hashing with bounded load

Suppression (deduplication) #

VolumeMechanism
Low volume (< 1M operations/day)Database unique constraint + ON CONFLICT DO NOTHING
High volume (< 10B distinct items)Bloom filter (1% false positive)
Event publishing exactly-onceOutbox pattern + idempotent consumer
Kafka message deduplicationIdempotent producer (PID + sequence)

Aggregation (combining values) #

Accuracy requiredMechanism
Exact count, low cardinalityExact counter (Redis INCR, database COUNT)
Frequency estimation (heavy hitters)Count-Min Sketch
Cardinality estimation (distinct count)HyperLogLog
Rate limiting (burst allowed)Token bucket
Rate limiting (smooth)Sliding window
Distributed rate limitingCentralized Redis counter or local + periodic sync

Convergence (replicas that diverge must reconcile) #

Data typeCRDT
Monotonic counterG-Counter
Increment/decrement counterPN-Counter
Set (add only)G-Set
Set (add and remove)OR-Set
Single value (last write wins)LWW-Register
Collaborative text editing (server-mediated)OT (server total order)
Collaborative text editing (P2P)RGA / YATA
Time-series metricHyperLogLog (cardinality) / TDigest (percentiles)

The Full Decision Tree #

1. What coordination problem is this?
   ├── Selection (competition) → Q2
   ├── Agreement (atomicity) → Q3
   ├── Assignment (routing) → Q4
   ├── Suppression (deduplication) → Q5
   ├── Aggregation (combining) → Q6
   └── Convergence (reconciliation) → Q7

2. Selection
   Is error unrecoverable (split-brain)?
   ├── Yes → Raft + etcd lease + fencing token
   └── No → Redis SET NX + TTL + CAS commitment + idempotency key

3. Agreement
   Boundary type?
   ├── Single database → Local transaction (BEGIN/COMMIT)
   ├── Cross-database, XA supported → 2PC / XA
   ├── Cross-service, long-lived → Saga + compensation
   └── Replicated state machine → Raft

4. Assignment
   Range queries needed?
   ├── Yes → Range-based (CockroachDB ranges, etcd)
   └── No → Consistent hashing + vnodes

5. Suppression
   Volume and accuracy?
   ├── Low volume, exact → Unique constraint
   ├── High volume, approximate OK → Bloom filter
   └── Event publishing → Outbox + idempotent consumer

6. Aggregation
   What is being aggregated?
   ├── Request rate → Token bucket or sliding window
   ├── Frequency (top-K) → Count-Min Sketch
   └── Distinct count → HyperLogLog

7. Convergence
   Can writes be structured as monotonic increments?
   ├── Yes → CRDT (G-Counter, OR-Set, LWW-Register)
   └── No (text, structured documents) → OT (server-ordered or P2P)

Worked Examples #

Ticketmaster (Seat Reservation) #

Mechanism: Selection (users compete for finite seats) Error type: Recoverable (double-hold is inconvenient; double-booking is prevented at commitment) Scale: Large (millions of users)

Implementation:

  1. Virtual waiting room (Sequence number bridge) — metered access, fair FIFO
  2. Redis SET NX with 10-minute TTL (intent layer)
  3. CAS at database: INSERT INTO seat_purchases WHERE NOT EXISTS (commitment layer)
  4. Idempotency key on payment charge (Suppression bridge)
  5. Saga: payment → booking → ticket issuance (Agreement via Saga, not 2PC)

What not to use: Raft for seat assignment (scale mismatch — Raft handles tens of nodes, not millions of concurrent users). etcd for each seat lock (too many keys, wrong granularity).


Payment Processor (Exactly-Once Charge) #

Mechanism: Suppression (prevent double charges on retry) Error type: Recoverable (duplicate charge is caught by idempotency; may require refund) Scale: Medium (thousands of concurrent payments)

Implementation:

  1. Client generates idempotency key (UUID per payment intent)
  2. Server checks idempotency store before executing charge
  3. Idempotency key + charge result stored in same transaction
  4. Payment + order creation via Saga (refund compensation if order fails)
  5. Outbox pattern: payment event published to fulfillment service

What not to use: relying on distributed locks for idempotency (wrong mechanism — idempotency keys are simpler and correct). 2PC across payment + order services (Saga is more appropriate across service boundaries).


Uber Driver Assignment #

Mechanism: Selection (one driver per ride) Error type: Recoverable (failed assignment means retry with next driver) Scale: Large (hundreds of thousands of concurrent ride requests)

Implementation:

  1. Geospatial index (H3 hex cells in Redis) for candidate selection
  2. Redis SET NX with 15-second TTL for exclusive offer to driver (intent layer)
  3. Database INSERT with ON CONFLICT for durable assignment (commitment layer)
  4. Idempotency on assignment commit (Suppression bridge)

Distributed Rate Limiter #

Mechanism: Aggregation (count requests across nodes) Error type: Recoverable (occasional over-counting or under-counting at scale boundary) Scale: Large (millions of requests per second)

Implementation:

  1. Local counter with local budget (reduces coordination frequency)
  2. Periodic sync to Redis (global count aggregation)
  3. Redis Lua atomic increment for exact counting when local budget exhausted
  4. Sliding window approximation with two time buckets (O(1) per user)

Web Crawler URL Deduplication #

Mechanism: Suppression (each URL crawled once) Error type: Recoverable (Bloom filter false positive means a URL is skipped; acceptable) Scale: Large (billions of URLs)

Implementation:

  1. Bloom filter in Redis (9.6 bits/URL for 1% false positive rate)
  2. Exact deduplication for high-priority URLs (database unique index)
  3. URL assignment by domain hash (Assignment mechanism) to avoid concurrent crawler overlap

Google Docs Collaborative Editing #

Mechanism: Convergence (concurrent edits from multiple users) Error type: Recoverable (conflicting edits produce merged document; no data lost) Scale: Medium (tens of concurrent users per document)

Implementation:

  1. Server-ordered OT (operational transforms with server as total order authority)
  2. Server assigns sequential revision to each operation
  3. Client transforms incoming operations against locally-applied-but-unacknowledged operations
  4. Selection mechanism: cursor position (each user’s cursor is a resource they exclusively own — no coordination needed, cursors are independent)

Anti-Patterns to Avoid #

Using Redis locks for correctness: Redis SET NX is an efficiency lock. Using it as a correctness lock (the final arbiter of whether a payment processes) creates a category of silent data corruption when Redis fails or when the lock holder pauses beyond the TTL. Always back Redis intent locks with database-level constraints.

Using 2PC across service boundaries: 2PC requires all participants to hold locks until the commit decision arrives. For HTTP-based microservices with variable latency, lock hold times can extend into seconds, blocking other requests. Use Saga instead — accept the lack of isolation in exchange for lock-free inter-service coordination.

Building consensus when you need broadcast: If your requirement is “all nodes receive the same updates in the same order,” that is total order broadcast — implemented by Kafka, not Raft. Raft is for agreement on a value among a small set of replicas. Kafka is for ordering and distributing messages at scale. Using etcd as a message bus (writing every event as an etcd key) does not scale.

Using CRDTs when you need Agreement: CRDTs guarantee eventual convergence but not immediate consistency. If your requirement is “these two operations must both succeed or both fail,” CRDTs do not help. You need a transaction or a Saga.

Skipping the deduplication window: Idempotency stores have a retention window (typically 24 hours). Requests with idempotency keys older than the window are treated as new requests. For long-running sagas (days-long workflows), ensure the idempotency key is fresh for each retry, not reused from the original submission.

The Coordination Hierarchy #

The series has traced coordination from the infrastructure substrate to the application layer:

Application Layer
├── Selection (seat reservation, driver assignment)
├── Suppression (idempotency keys, Bloom filter dedup)
├── Aggregation (rate limiting, cardinality, Top-K)
└── Convergence (CRDTs, collaborative editing)
    Bridges: TTL, Idempotency Key, CAS, Fencing, Compensation, Sequence Number
Infrastructure Layer
├── Membership (gossip, Phi accrual, Raft joint consensus)
├── Selection (Raft leader election, etcd leases + fencing)
├── Agreement (Raft log replication, 2PC, Saga)
├── Assignment (consistent hashing, Kafka partition assignment)
└── Ordering (Lamport clocks, vector clocks, total order broadcast)
    Platform: etcd, Cassandra, Kafka, ZooKeeper, Redis Cluster, CockroachDB

Every distributed system, at every layer, is implementing combinations of these six mechanisms. The names change — leader election instead of Selection, ACID transaction instead of Agreement, rate limiter instead of Aggregation — but the underlying coordination problem remains the same.

Naming the mechanism does not make it simpler. But it makes the tradeoffs legible. You can reason about what happens if the mechanism fails, what properties it must satisfy, and what bridges you need between layers. That reasoning — not the choice of Redis vs etcd vs PostgreSQL — is what determines whether your system is correct.