When Infrastructure, When Application, When Layered

When Infrastructure, When Application, When Layered #

Eleven chapters have laid out the mechanisms, the algorithms, and the failure modes. This final chapter synthesizes them into a decision framework: given a coordination problem, which mechanism, at which layer, with which bridges?

The framework is organized around three questions, answered in order.

Question 1: Is the Error Recoverable? #

The most important question in coordination mechanism selection is not “what mechanism fits?” but “what happens if this goes wrong?”

Unrecoverable errors are errors whose consequence is permanent, uncorrectable data corruption:

Two leaders simultaneously in a distributed database (split-brain) — log entries diverge permanently
Two nodes simultaneously winning Raft election in the same term — consensus is broken
A distributed transaction partially committing — one database records the payment, the other does not record the order

For unrecoverable errors: use infrastructure-level coordination (CP systems). Correctness wins over throughput. Pay the consensus cost. Accept that the minority partition cannot make progress.

Recoverable errors are errors whose consequence is inconvenient but correctable:

Two users briefly holding the same seat — one receives an error at checkout, re-selects
A payment attempted twice — the second attempt is deduplicated via idempotency key
A job scheduled twice — the second execution finds the work already done (idempotent consumer)
A Bloom filter false positive — a URL is skipped; crawl quality degrades by 1%

For recoverable errors: use application-level coordination (AP-tolerant systems). Throughput wins for the fast path. Correctness is enforced at the commitment layer with bridges.

The boundary test: if the failure mode is a duplicate execution, a retried operation, or a temporary inconsistency that corrects itself — recoverable, use the fast path with bridges. If the failure mode is corrupted data that cannot be detected or corrected after the fact — unrecoverable, use consensus.

Question 2: What Is the Concurrency Scale? #

The concurrency scale determines the implementation layer.

Small scale (< 1,000 concurrent coordinators): infrastructure-level coordination is appropriate.

Raft leader election: 3–7 nodes voting on a leader
etcd distributed lock: dozens of services competing for a coordinator role
Kubernetes controller leader: exactly one controller instance active per resource type

At this scale, the throughput cost of consensus is negligible. The simplicity of “one strong consistency system handles coordination” is worth the coordination overhead.

Large scale (> 100,000 concurrent users): application-level coordination is necessary.

Ticketmaster: 2 million users attempting to buy 60,000 seats
Uber: 100,000 concurrent ride requests being matched to drivers
Rate limiter: 10 million requests per second across all users

At this scale, running each coordination operation through Raft consensus would saturate the consensus log. Application-level coordination — Redis SET NX, optimistic locks, idempotency keys — handles the high-throughput fast path.

The crossover region: for coordination problems at 1,000–100,000 concurrent operations, either approach may work. The choice depends on the error cost (recoverable vs unrecoverable) and the latency budget.

Question 3: What Is the Mechanism? #

Given that you know the error type (recoverable/unrecoverable) and the scale, select the mechanism from the taxonomy:

Selection (competition for a finite resource) #

Scale	Error type	Mechanism
Small (nodes competing for leadership)	Unrecoverable	Raft + etcd lease + fencing token
Large (users competing for a seat/resource)	Recoverable	Redis SET NX + TTL + CAS at commitment

Implementation pattern (large scale):

Redis SET NX for the intent claim (optimistic, atomic, TTL-bounded)
CAS at the database for the commitment (belt-and-suspenders)
Idempotency key on the payment/charge step
Fencing token if multiple processes may commit for the same intent

Agreement (all commit or all abort) #

Scale	Error type	Mechanism
Within one database	N/A	Local ACID transaction
Across two databases (same organization)	Unrecoverable	XA / 2PC with replicated coordinators
Across services (microservice boundary)	Recoverable	Saga with compensation + idempotency keys
Replicated state machine	Unrecoverable	Raft log replication

The key distinction: 2PC for operations that must be atomic AND where both parties support XA. Saga for operations where at least one party does not support XA, or where long-held locks are unacceptable.

Assignment (distributing work) #

Assignment type	Mechanism
Data to nodes (unbounded, grows over time)	Consistent hashing + virtual nodes
Work to consumers (streams, queues)	Kafka partition assignment (cooperative rebalance)
Range queries required	Range-based assignment (CockroachDB, etcd ranges)
Load balancing requests	Consistent hashing with bounded load

Suppression (deduplication) #

Volume	Mechanism
Low volume (< 1M operations/day)	Database unique constraint + ON CONFLICT DO NOTHING
High volume (< 10B distinct items)	Bloom filter (1% false positive)
Event publishing exactly-once	Outbox pattern + idempotent consumer
Kafka message deduplication	Idempotent producer (PID + sequence)

Aggregation (combining values) #

Accuracy required	Mechanism
Exact count, low cardinality	Exact counter (Redis INCR, database COUNT)
Frequency estimation (heavy hitters)	Count-Min Sketch
Cardinality estimation (distinct count)	HyperLogLog
Rate limiting (burst allowed)	Token bucket
Rate limiting (smooth)	Sliding window
Distributed rate limiting	Centralized Redis counter or local + periodic sync

Convergence (replicas that diverge must reconcile) #

Data type	CRDT
Monotonic counter	G-Counter
Increment/decrement counter	PN-Counter
Set (add only)	G-Set
Set (add and remove)	OR-Set
Single value (last write wins)	LWW-Register
Collaborative text editing (server-mediated)	OT (server total order)
Collaborative text editing (P2P)	RGA / YATA
Time-series metric	HyperLogLog (cardinality) / TDigest (percentiles)

The Full Decision Tree #

1. What coordination problem is this?
   ├── Selection (competition) → Q2
   ├── Agreement (atomicity) → Q3
   ├── Assignment (routing) → Q4
   ├── Suppression (deduplication) → Q5
   ├── Aggregation (combining) → Q6
   └── Convergence (reconciliation) → Q7

2. Selection
   Is error unrecoverable (split-brain)?
   ├── Yes → Raft + etcd lease + fencing token
   └── No → Redis SET NX + TTL + CAS commitment + idempotency key

3. Agreement
   Boundary type?
   ├── Single database → Local transaction (BEGIN/COMMIT)
   ├── Cross-database, XA supported → 2PC / XA
   ├── Cross-service, long-lived → Saga + compensation
   └── Replicated state machine → Raft

4. Assignment
   Range queries needed?
   ├── Yes → Range-based (CockroachDB ranges, etcd)
   └── No → Consistent hashing + vnodes

5. Suppression
   Volume and accuracy?
   ├── Low volume, exact → Unique constraint
   ├── High volume, approximate OK → Bloom filter
   └── Event publishing → Outbox + idempotent consumer

6. Aggregation
   What is being aggregated?
   ├── Request rate → Token bucket or sliding window
   ├── Frequency (top-K) → Count-Min Sketch
   └── Distinct count → HyperLogLog

7. Convergence
   Can writes be structured as monotonic increments?
   ├── Yes → CRDT (G-Counter, OR-Set, LWW-Register)
   └── No (text, structured documents) → OT (server-ordered or P2P)

Worked Examples #

Ticketmaster (Seat Reservation) #

Mechanism: Selection (users compete for finite seats) Error type: Recoverable (double-hold is inconvenient; double-booking is prevented at commitment) Scale: Large (millions of users)

Implementation:

Virtual waiting room (Sequence number bridge) — metered access, fair FIFO
Redis SET NX with 10-minute TTL (intent layer)
CAS at database: INSERT INTO seat_purchases WHERE NOT EXISTS (commitment layer)
Idempotency key on payment charge (Suppression bridge)
Saga: payment → booking → ticket issuance (Agreement via Saga, not 2PC)

What not to use: Raft for seat assignment (scale mismatch — Raft handles tens of nodes, not millions of concurrent users). etcd for each seat lock (too many keys, wrong granularity).

Payment Processor (Exactly-Once Charge) #

Mechanism: Suppression (prevent double charges on retry) Error type: Recoverable (duplicate charge is caught by idempotency; may require refund) Scale: Medium (thousands of concurrent payments)

Implementation:

Client generates idempotency key (UUID per payment intent)
Server checks idempotency store before executing charge
Idempotency key + charge result stored in same transaction
Payment + order creation via Saga (refund compensation if order fails)
Outbox pattern: payment event published to fulfillment service

What not to use: relying on distributed locks for idempotency (wrong mechanism — idempotency keys are simpler and correct). 2PC across payment + order services (Saga is more appropriate across service boundaries).

Uber Driver Assignment #

Mechanism: Selection (one driver per ride) Error type: Recoverable (failed assignment means retry with next driver) Scale: Large (hundreds of thousands of concurrent ride requests)

Implementation:

Geospatial index (H3 hex cells in Redis) for candidate selection
Redis SET NX with 15-second TTL for exclusive offer to driver (intent layer)
Database INSERT with ON CONFLICT for durable assignment (commitment layer)
Idempotency on assignment commit (Suppression bridge)

Distributed Rate Limiter #

Mechanism: Aggregation (count requests across nodes) Error type: Recoverable (occasional over-counting or under-counting at scale boundary) Scale: Large (millions of requests per second)

Implementation:

Local counter with local budget (reduces coordination frequency)
Periodic sync to Redis (global count aggregation)
Redis Lua atomic increment for exact counting when local budget exhausted
Sliding window approximation with two time buckets (O(1) per user)

Web Crawler URL Deduplication #

Mechanism: Suppression (each URL crawled once) Error type: Recoverable (Bloom filter false positive means a URL is skipped; acceptable) Scale: Large (billions of URLs)

Implementation:

Bloom filter in Redis (9.6 bits/URL for 1% false positive rate)
Exact deduplication for high-priority URLs (database unique index)
URL assignment by domain hash (Assignment mechanism) to avoid concurrent crawler overlap

Google Docs Collaborative Editing #

Mechanism: Convergence (concurrent edits from multiple users) Error type: Recoverable (conflicting edits produce merged document; no data lost) Scale: Medium (tens of concurrent users per document)

Implementation:

Server-ordered OT (operational transforms with server as total order authority)
Server assigns sequential revision to each operation
Client transforms incoming operations against locally-applied-but-unacknowledged operations
Selection mechanism: cursor position (each user’s cursor is a resource they exclusively own — no coordination needed, cursors are independent)

Anti-Patterns to Avoid #

Using Redis locks for correctness: Redis SET NX is an efficiency lock. Using it as a correctness lock (the final arbiter of whether a payment processes) creates a category of silent data corruption when Redis fails or when the lock holder pauses beyond the TTL. Always back Redis intent locks with database-level constraints.

Using 2PC across service boundaries: 2PC requires all participants to hold locks until the commit decision arrives. For HTTP-based microservices with variable latency, lock hold times can extend into seconds, blocking other requests. Use Saga instead — accept the lack of isolation in exchange for lock-free inter-service coordination.

Building consensus when you need broadcast: If your requirement is “all nodes receive the same updates in the same order,” that is total order broadcast — implemented by Kafka, not Raft. Raft is for agreement on a value among a small set of replicas. Kafka is for ordering and distributing messages at scale. Using etcd as a message bus (writing every event as an etcd key) does not scale.

Using CRDTs when you need Agreement: CRDTs guarantee eventual convergence but not immediate consistency. If your requirement is “these two operations must both succeed or both fail,” CRDTs do not help. You need a transaction or a Saga.

Skipping the deduplication window: Idempotency stores have a retention window (typically 24 hours). Requests with idempotency keys older than the window are treated as new requests. For long-running sagas (days-long workflows), ensure the idempotency key is fresh for each retry, not reused from the original submission.

The Coordination Hierarchy #

The series has traced coordination from the infrastructure substrate to the application layer:

Application Layer
├── Selection (seat reservation, driver assignment)
├── Suppression (idempotency keys, Bloom filter dedup)
├── Aggregation (rate limiting, cardinality, Top-K)
└── Convergence (CRDTs, collaborative editing)
    Bridges: TTL, Idempotency Key, CAS, Fencing, Compensation, Sequence Number
Infrastructure Layer
├── Membership (gossip, Phi accrual, Raft joint consensus)
├── Selection (Raft leader election, etcd leases + fencing)
├── Agreement (Raft log replication, 2PC, Saga)
├── Assignment (consistent hashing, Kafka partition assignment)
└── Ordering (Lamport clocks, vector clocks, total order broadcast)
    Platform: etcd, Cassandra, Kafka, ZooKeeper, Redis Cluster, CockroachDB

Every distributed system, at every layer, is implementing combinations of these six mechanisms. The names change — leader election instead of Selection, ACID transaction instead of Agreement, rate limiter instead of Aggregation — but the underlying coordination problem remains the same.

Naming the mechanism does not make it simpler. But it makes the tradeoffs legible. You can reason about what happens if the mechanism fails, what properties it must satisfy, and what bridges you need between layers. That reasoning — not the choice of Redis vs etcd vs PostgreSQL — is what determines whether your system is correct.