Distributed Coordination: The Hidden Component

Table of Contents

Every distributed system coordinates. When Ticketmaster prevents double-booking, when a payment processor charges exactly once across retries, when Google Docs merges concurrent edits from two users into the same document — coordination is happening. But unlike message brokers (Kafka), search indices (OpenSearch), or databases (PostgreSQL), coordination has no canonical wire protocol. Every system implements it independently, from scratch, under different names.

This series makes that hidden layer explicit. It names the mechanisms, traces them through infrastructure (etcd, Raft, Cassandra) and application logic (Ticketmaster, Payment, Uber), and shows the failure modes that arise when the mechanisms are applied incorrectly or at the wrong level.

The Framework #

Before reaching for a specific mechanism, any coordination problem can be analyzed with six questions:

Who participates? → Membership
What’s the structure? → Shape (centralized, ring, mesh)
What needs resolving? → Coordination mechanism
How is it resolved? → Algorithm and protocol
What if it fails? → Reliability and recovery
How fast must it work? → QoS constraints

Question 3 maps to six core mechanisms:

Mechanism	Question answered
Selection	Multiple actors compete for a resource — who wins?
Agreement	Multiple participants must commit to the same decision — how do they agree?
Aggregation	Values are spread across nodes — how are they combined?
Assignment	Work must be distributed — who handles what?
Suppression	Operations may be duplicated — how are duplicates eliminated?
Convergence	State diverges across replicas — how does it reconcile?

Chapters #

Part 1: Infrastructure Coordination #

Coordination as a Hidden Component — the framework, the taxonomy, why no uniform protocol exists
Membership — Knowing Who Is Alive — gossip protocols, Phi accrual failure detection, join and leave propagation
Selection — Leader Election and Distributed Locks — Raft leader election, fencing tokens, etcd leases, why Redis locks are unsafe for correctness
Agreement — Consensus, Quorum, and Transactions — Raft log replication, two-phase commit, the blocking problem, Saga as an alternative
Assignment — Consistent Hashing and Work Distribution — hash rings, virtual nodes, Cassandra’s coordination model, rebalancing under membership change
Ordering — Sequence Numbers and Total Order — Lamport clocks, vector clocks, total order broadcast, how etcd’s revision counter enables ordering
Aggregation — Scatter-Gather and Distributed Query Execution — OpenSearch scatter-gather, Cassandra quorum read merge, CockroachDB DistSQL, Kafka Streams windowed aggregation
Suppression — Exactly-Once at the Infrastructure Layer — Kafka idempotent producer, transactions, etcd watch revision guarantees, Raft log deduplication

Part 2: Application Coordination #

Selection at Application Level — Reservations and Assignments — seat reservation (Ticketmaster), driver assignment (Uber), optimistic locking, two-phase reservation
Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity — Saga design for e-commerce and booking, compensation patterns, isolation gap, orchestration vs choreography
Assignment at Application Level — Work Distribution and Routing — task queues, job scheduler exactly-one execution, web crawler domain assignment, database sharding
Suppression — Idempotency and Deduplication — idempotency key design, database-level idempotency, Bloom filters, the outbox pattern
Aggregation — Distributed Counting and Rate Limiting — Count-Min Sketch, HyperLogLog, token bucket, sliding window, distributed rate limiting
Convergence — CRDTs and Collaborative Editing — state-based vs operation-based CRDTs, G-Counter, OR-Set, Operational Transforms, Google Docs
Bridging Intent and Commitment — the throughput-correctness split, TTL, CAS, fencing, compensation, sequence numbers as bridges
When Infrastructure, When Application, When Layered — correctness vs throughput decision framework, the recoverable vs unrecoverable error boundary

The Framework #

Chapters #

Part 1: Infrastructure Coordination #

Part 2: Application Coordination #

Ordering — Sequence Numbers and Total Order

Selection — Leader Election and Distributed Locks

Selection at Application Level — Reservations and Assignments

Suppression — Exactly-Once at the Infrastructure Layer

Suppression — Idempotency and Deduplication

When Infrastructure, When Application, When Layered