Coordination as a Hidden Component

Table of Contents

Coordination as a Hidden Component #

Draw a system design diagram for Ticketmaster. You will draw boxes for a load balancer, an API gateway, a database, a cache, a message queue, a search index. You will label them. But when the interviewer asks how two users are prevented from booking the same seat simultaneously, the answer involves a coordination mechanism that has no box in the diagram. It is implicit. It is assumed. It is handled somewhere.

This is the defining characteristic of distributed coordination: it is the component that every distributed system needs and no one explicitly names.

Why There Is No Canonical Protocol #

Every other component in a system design diagram has a canonical wire protocol that multiple implementations adopt:

Message broker → Kafka wire protocol (Redpanda implements it independently)
Search index → OpenSearch REST/DSL (Elasticsearch implements the same protocol)
Database → PostgreSQL wire protocol (CockroachDB, Neon, RisingWave implement it)

Coordination has no equivalent. Each system implements coordination from scratch, under its own protocol:

System	Coordination mechanism	Protocol
Kafka	KRaft — leader election, ISR	Proprietary
Cassandra	Gossip + Paxos variants	Proprietary
Kubernetes	etcd leases and watches	etcd gRPC
CockroachDB	Raft per range	Proprietary
Redis Cluster	Cluster bus gossip	Proprietary
ZooKeeper	ZAB (ZooKeeper Atomic Broadcast)	Proprietary

The reason is structural. Unlike message brokers — which are external services consumed by applications — coordination is typically embedded infrastructure. Kafka does not call etcd for leader election; it implements KRaft internally. CockroachDB does not call ZooKeeper; it runs Raft per range internally. Every system builds coordination rather than consuming a standard protocol, because:

Performance requirements vary too much — linearizable reads for etcd, high-throughput ordering for Kafka, per-range consensus for CockroachDB
Coordination is a prerequisite, not a feature — it cannot be behind a network hop in the hot path
There is no “implement the coordination protocol and get clients for free” incentive the way implementing the PostgreSQL wire protocol gives you all PostgreSQL tooling

The Coordination Problem Appears Across System Design #

Coordination is present — implicitly — in approximately 40% of canonical system design problems:

System	Hidden coordination need
Ticketmaster	Seat hold — prevent double-booking during checkout
Online Auction	Bid acceptance — exactly one winner among concurrent bids
Job Scheduler	Leader election — only one node triggers each job
Google Docs	Collaborative editing — concurrent conflict resolution
Payment System	Idempotency — exactly-once execution across retries
Rate Limiter	Distributed counting — the problem is coordination itself
Web Crawler	URL frontier — distributed deduplication and assignment
Uber	Driver assignment — exactly one driver per ride
Robinhood	Order processing — exactly-once trade execution

In every case, the standard answer is “use Redis for locking” or “use ZooKeeper” — a component name, never a protocol, never an analysis of why that mechanism is correct or what goes wrong if it is not.

The Analytical Framework #

Before selecting a mechanism, any coordination problem can be analyzed with six questions:

Distributed coordination: infrastructure and application layers

1. Who participates? → Membership

How many actors? Are they homogeneous (identical nodes) or heterogeneous (coordinator and workers)? Is membership static or dynamic (nodes join and leave)? What happens when members fail?

You cannot run consensus, assign quorums, or distribute work without knowing who is in the system. Membership is the substrate that all other coordination depends on.

2. What is the structure? → Shape

What is the topology of the coordination?

Centralized: One coordinator, N participants (2PC, leader-follower)
Ring: Ordered participant set (token ring, some consensus variants)
Mesh: Every node talks to every other (gossip, Raft)
Hierarchical: Coordinator delegates to sub-coordinators (Kafka consumer groups with a group coordinator)

Shape determines communication cost and single points of failure.

3. What needs resolving? → Coordination mechanism

The core question. One of six mechanisms:

Mechanism	The problem
Selection	Multiple actors compete for a finite resource — who wins?
Agreement	Multiple participants must all commit or all abort — how do they agree?
Aggregation	Values are spread across nodes/users — how are they combined?
Assignment	Work must be distributed across participants — who handles what?
Suppression	Operations may be duplicated — how are duplicates eliminated?
Convergence	State diverges across replicas — how does it reconcile without central coordination?

4. How is it resolved? → Algorithm and protocol

Given the mechanism, which algorithm? Raft for agreement, consistent hashing for assignment, HyperLogLog for approximate aggregation, CRDTs for convergence. The algorithm must satisfy the required properties of the mechanism.

5. What if it fails? → Reliability

What are the failure modes? Network partition, node crash, clock drift, GC pause. Every mechanism must state:

Safety property: What must never happen (two leaders, double charge, split-brain)
Liveness property: What must eventually happen (a leader is elected, payment commits)

FLP impossibility establishes that no deterministic algorithm can simultaneously guarantee both safety and liveness in an asynchronous system with even one possible failure. Every mechanism makes an explicit tradeoff.

6. How fast must it work? → QoS

Latency, throughput, availability requirements. These constraints determine whether a single mechanism can serve both throughput and correctness needs, or whether layering is required (a fast approximate layer over a slow authoritative layer).

The Mechanism Taxonomy in Detail #

Selection #

Selection resolves a race: multiple actors compete for a finite resource and exactly one must win.

Infrastructure manifestations:

Raft leader election: one node among N becomes the leader for a term
etcd distributed lock: one process among N holds the lock at any time

Application manifestations:

Seat reservation: one user among N concurrent buyers holds a seat
Driver assignment: one driver among N available is assigned to a ride

Key properties required:

Exactly one winner — no split-brain
Fencing — a stale winner (whose lease expired) must be prevented from acting as though they still won

First-write-wins (FWW) and last-write-wins (LWW) are the two resolution policies. FWW is appropriate when the first actor to claim a resource should win (seat reservation). LWW is appropriate when the most recent write should win (document versioning).

Agreement #

Agreement resolves a commit decision: multiple participants must all commit or all abort an operation.

Infrastructure manifestations:

Raft log replication: all replicas must agree on the same log entry at the same index
Two-phase commit: all participants in a distributed transaction must agree to commit

Application manifestations:

Payment + fulfillment: payment service and order service must both succeed or both roll back
Booking + inventory: reservation and stock decrement must be atomic

Key properties required:

Atomicity: all commit or all abort — no partial commits
Durability: committed decisions survive crashes

Agreement is the mechanism most constrained by the CAP theorem. In the presence of a network partition, you must choose between consistency (wait for agreement, possibly block) and availability (proceed without agreement, risk divergence).

Aggregation #

Aggregation resolves a computation: values distributed across nodes must be combined into a single result.

Infrastructure manifestations:

OpenSearch scatter-gather: partial results from each shard merged by the coordinating node
Cassandra quorum read: responses from R replicas merged by the coordinator

Application manifestations:

Rate limiter: request counts from all nodes combined to determine if limit is exceeded
Ad click aggregator: click counts from all edge nodes summed for billing
YouTube Top-K: view counts from all shards merged for leaderboard

Key properties:

Exact or approximate: exact aggregation requires all values; approximate (HyperLogLog, Count-Min Sketch) is sufficient for many use cases
Mergeable: partial results must be combinable — merging two HLL sketches is a union operation

Assignment #

Assignment resolves a distribution problem: work, data, or responsibility must be allocated across participants.

Infrastructure manifestations:

Cassandra consistent hashing: keys assigned to nodes by position on hash ring
Kafka partition assignment: partitions assigned to consumer group members
etcd range assignment: key ranges assigned to nodes in distributed KV stores

Application manifestations:

Job scheduler: tasks assigned to available workers
Web crawler: URL domains assigned to crawler instances

Key properties:

Stability: the same key should map to the same node across time (consistent hashing achieves this)
Minimal disruption: when membership changes, only the minimum necessary work should move

Suppression #

Suppression resolves a duplicate problem: an operation that may be executed multiple times must produce the same effect as executing once.

Infrastructure manifestations:

Kafka idempotent producer: duplicate messages filtered using producer ID + sequence number
etcd watch: each event has a unique revision; clients can replay from a revision without duplicate processing

Application manifestations:

Payment idempotency: the same charge attempted twice should produce one charge
Web crawler deduplication: a URL encountered multiple times should be crawled once

Key properties:

Idempotency: the operation produces the same result regardless of how many times it executes
Deduplication window: how long must the system remember that an operation was already processed?

Convergence #

Convergence resolves a divergence problem: replicas that accepted writes independently must eventually reach identical state without central coordination.

Infrastructure manifestations:

Cassandra eventual consistency: replicas diverge during partition, converge via anti-entropy repair
Redis Cluster gossip: node metadata propagates and converges across the cluster bus

Application manifestations:

Google Docs: concurrent edits from multiple users merge into a consistent document
Dropbox: file edits on multiple devices synchronize to the same final state

Key properties:

Commutativity: merge(A, B) = merge(B, A) — order of merging does not matter
Associativity: merge(merge(A, B), C) = merge(A, merge(B, C)) — grouping does not matter
Idempotency: merge(A, A) = A — merging with the same state twice is safe

The Two Levels #

The mechanisms operate at two levels that are structurally distinct:

Infrastructure coordination answers: how do nodes in a distributed system coordinate with each other?

Small number of participants (hundreds, not millions of concurrent operations)
Errors are unrecoverable — split-brain in leader election corrupts data permanently
Correctness wins over throughput — etcd accepts write latency to guarantee linearizability

Application coordination answers: how do concurrent user requests resolve?

Large number of participants (millions of concurrent user operations)
Errors are often recoverable — a failed seat hold releases via TTL, a failed payment can be retried with an idempotency key
Throughput wins for the fast path, correctness is enforced at the commitment layer

The same mechanism (Selection) appears at both levels — Raft leader election at the infrastructure level, seat reservation at the application level — but with different implementations, different guarantees, and different failure modes.

How DDD and Event Storming Surface Coordination Concerns #

Domain-Driven Design and Event Storming are discovery tools. They do not specify coordination mechanisms, but they surface the coordination problems that need them.

The mapping:

Event Storming artifact	Coordination mechanism
Aggregate boundary	Agreement — atomic consistency within
Cross-aggregate command	Agreement (Saga) or Suppression (idempotency)
Competing commands on same aggregate	Selection
Hotspot with multiple concurrent actors	Selection or Agreement
Eventual consistency between bounded contexts	Convergence
Policy (when X, do Y)	Notification (watch)
Read model / projection	Aggregation
External system boundary	Suppression (idempotency at entry point)

The coordination framework translates the “what” surfaced by Event Storming into the “how” required by implementation: the mechanism type, its required properties, and the infrastructure or application pattern that satisfies them.