Skip to main content

Distributed Coordination: The Hidden Component

Every distributed system coordinates. When Ticketmaster prevents double-booking, when a payment processor charges exactly once across retries, when Google Docs merges concurrent edits from two users into the same document — coordination is happening. But unlike message brokers (Kafka), search indices (OpenSearch), or databases (PostgreSQL), coordination has no canonical wire protocol. Every system implements it independently, from scratch, under different names.

This series makes that hidden layer explicit. It names the mechanisms, traces them through infrastructure (etcd, Raft, Cassandra) and application logic (Ticketmaster, Payment, Uber), and shows the failure modes that arise when the mechanisms are applied incorrectly or at the wrong level.

The Framework #

Before reaching for a specific mechanism, any coordination problem can be analyzed with six questions:

  1. Who participates? → Membership
  2. What’s the structure? → Shape (centralized, ring, mesh)
  3. What needs resolving? → Coordination mechanism
  4. How is it resolved? → Algorithm and protocol
  5. What if it fails? → Reliability and recovery
  6. How fast must it work? → QoS constraints

Question 3 maps to six core mechanisms:

MechanismQuestion answered
SelectionMultiple actors compete for a resource — who wins?
AgreementMultiple participants must commit to the same decision — how do they agree?
AggregationValues are spread across nodes — how are they combined?
AssignmentWork must be distributed — who handles what?
SuppressionOperations may be duplicated — how are duplicates eliminated?
ConvergenceState diverges across replicas — how does it reconcile?

Chapters #

Part 1: Infrastructure Coordination #

  1. Coordination as a Hidden Component — the framework, the taxonomy, why no uniform protocol exists
  2. Membership — Knowing Who Is Alive — gossip protocols, Phi accrual failure detection, join and leave propagation
  3. Selection — Leader Election and Distributed Locks — Raft leader election, fencing tokens, etcd leases, why Redis locks are unsafe for correctness
  4. Agreement — Consensus, Quorum, and Transactions — Raft log replication, two-phase commit, the blocking problem, Saga as an alternative
  5. Assignment — Consistent Hashing and Work Distribution — hash rings, virtual nodes, Cassandra’s coordination model, rebalancing under membership change
  6. Ordering — Sequence Numbers and Total Order — Lamport clocks, vector clocks, total order broadcast, how etcd’s revision counter enables ordering
  7. Aggregation — Scatter-Gather and Distributed Query Execution — OpenSearch scatter-gather, Cassandra quorum read merge, CockroachDB DistSQL, Kafka Streams windowed aggregation
  8. Suppression — Exactly-Once at the Infrastructure Layer — Kafka idempotent producer, transactions, etcd watch revision guarantees, Raft log deduplication

Part 2: Application Coordination #

  1. Selection at Application Level — Reservations and Assignments — seat reservation (Ticketmaster), driver assignment (Uber), optimistic locking, two-phase reservation
  2. Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity — Saga design for e-commerce and booking, compensation patterns, isolation gap, orchestration vs choreography
  3. Assignment at Application Level — Work Distribution and Routing — task queues, job scheduler exactly-one execution, web crawler domain assignment, database sharding
  4. Suppression — Idempotency and Deduplication — idempotency key design, database-level idempotency, Bloom filters, the outbox pattern
  5. Aggregation — Distributed Counting and Rate Limiting — Count-Min Sketch, HyperLogLog, token bucket, sliding window, distributed rate limiting
  6. Convergence — CRDTs and Collaborative Editing — state-based vs operation-based CRDTs, G-Counter, OR-Set, Operational Transforms, Google Docs
  7. Bridging Intent and Commitment — the throughput-correctness split, TTL, CAS, fencing, compensation, sequence numbers as bridges
  8. When Infrastructure, When Application, When Layered — correctness vs throughput decision framework, the recoverable vs unrecoverable error boundary