- My Development Notes/
- Distributed Coordination: The Hidden Component/
- Agreement — Consensus, Quorum, and Transactions/
Agreement — Consensus, Quorum, and Transactions
Table of Contents
Agreement — Consensus, Quorum, and Transactions #
Agreement is the coordination mechanism that ensures multiple participants reach the same decision. Unlike Selection — where exactly one actor wins — Agreement requires that all participants commit or all abort. Agreement is the mechanism behind distributed transactions, replicated state machines, and any operation where partial completion is worse than no completion.
The correctness requirements are:
Agreement: All non-faulty processes decide on the same value. Validity: The decided value must have been proposed by some process (no arbitrary decisions). Termination: All non-faulty processes eventually decide.
Like Selection, Agreement runs into FLP impossibility in a fully asynchronous model. Every practical Agreement mechanism assumes bounds on message delivery or crash detection.
Raft Log Replication as Agreement #
Chapter 3 covered Raft leader election. The Agreement mechanism in Raft is log replication — the protocol by which the leader ensures that all replicas have identical copies of the same log entries in the same order.
Log Replication Protocol #
For each client write, the leader:
- Appends the entry to its local log at the next index position.
- Sends
AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries[], leaderCommit)to all followers in parallel. - A follower accepts the entry if its local log contains an entry at
prevLogIndexwith termprevLogTerm— the log matching invariant. If not, the follower rejects and the leader decrementsnextIndexfor that follower and retries. - The leader commits the entry once a majority of nodes (including itself) have appended it.
- The leader advances its
commitIndexand applies the entry to its state machine. - The leader’s next AppendEntries (or a heartbeat) carries the updated
commitIndex, which tells followers to commit the entry.
The Log Matching Invariant #
Raft guarantees that if two log entries at the same index have the same term, their content is identical and all preceding entries are also identical. This invariant is maintained by the prevLogIndex/prevLogTerm check: a follower only appends an entry if its log matches the leader’s log up to that point.
This means the leader’s AppendEntries RPC is idempotent: retrying a write never creates a duplicate entry. If a follower has already appended an entry at index i, an AppendEntries with the same index overwrites only if the terms differ (the follower has a conflicting entry from a stale leader).
Leader Completeness #
The Raft election protocol (Chapter 3) guarantees that only a candidate with all committed entries can become leader. Combined with log replication, this means: once an entry is committed, it will never be lost. Every future leader will have it. This is the fundamental safety guarantee of Raft.
Linearizability #
Raft log replication provides linearizability: every read-write operation appears to take effect instantaneously at some point between its invocation and completion. This requires:
- Reads must be served by the leader (or a follower that has confirmed it is up-to-date with the leader via a lease or a round-trip confirmation).
- A stale leader that has been partitioned from the cluster must not serve reads — it may have stale state.
etcd achieves this by default: all reads go through the leader. Read performance can be traded for consistency via Serializable read requests (which can be served by any node but may be stale).
Two-Phase Commit #
Two-phase commit (2PC) is the standard protocol for distributed transactions — operations that span multiple participants (databases, services) that must all commit or all abort.
Protocol #
Phase 1: Prepare
The coordinator sends PREPARE(txn_id) to all participants. Each participant:
- Writes all transaction operations to a durable redo log (but does not commit them).
- Acquires all locks required for the transaction.
- Votes
YESif it can commit (has the data, has the locks, redo log is durable) orNOif it cannot.
The YES vote is a promise: the participant is guaranteeing that it will be able to commit if asked, even if it crashes and recovers. This is what makes the redo log critical — on recovery, the participant can see the uncommitted transaction in its log and honor its promise.
Phase 2: Commit or Abort
If the coordinator receives YES from all participants:
- The coordinator writes
COMMITto its own durable log. - The coordinator sends
COMMIT(txn_id)to all participants. - Each participant commits the transaction (applies redo log to state, releases locks), writes a COMMIT record to its own log, and acknowledges.
If any participant votes NO (or if a participant is unreachable after timeout):
- The coordinator writes
ABORTto its own log. - The coordinator sends
ABORT(txn_id)to all participants. - Each participant rolls back (discards redo log, releases locks).
The Blocking Problem #
2PC has a fundamental failure mode: if the coordinator crashes after participants have voted YES but before the coordinator has sent COMMIT or ABORT, the participants are blocked.
Each participant:
- Has written to its redo log.
- Holds locks for the transaction.
- Has voted YES (promised it can commit).
- Cannot unilaterally abort (another participant may have already committed based on the coordinator’s decision).
- Cannot unilaterally commit (other participants may have been sent ABORT).
The participants are stuck. They must wait for the coordinator to recover and re-send the decision. During this window, any row touched by the transaction is locked and unavailable.
This is not just a theoretical problem. In production systems, coordinator crashes during the prepare phase cause “in-doubt transactions” that hold locks for minutes until the coordinator recovers. Every major database (PostgreSQL, Oracle, MySQL) has mechanisms to detect and investigate in-doubt transactions because they surface in support incidents.
Three-phase commit (3PC) was proposed to resolve the blocking problem by adding a pre-commit phase. 3PC is non-blocking under the assumption that the network is reliable (no partition). Under a network partition, 3PC can still diverge — it trades blocking for the possibility of inconsistent decisions. In practice, 3PC is not used in production systems.
2PC in Modern Systems #
Despite its problems, 2PC remains the standard for strongly consistent distributed transactions:
- PostgreSQL distributed transactions (via
postgres_fdwor application-level coordination) use 2PC withPREPARE TRANSACTIONandCOMMIT PREPARED. - XA transactions (Java EE, JTA) are a standard interface over 2PC, exposing
prepare(),commit(), androllback()methods. - Google Spanner uses 2PC with Paxos-replicated participant groups and TrueTime-bounded timestamps. Each participant is a Paxos group, so no single node crash can block the transaction — the Paxos group recovers.
The key insight behind Spanner’s design: make the coordinator and each participant a replicated state machine (Paxos group). Now a single node crash does not block 2PC because the Paxos group can elect a new leader and continue. The blocking problem is not eliminated — it requires a Paxos quorum failure, which is the same threshold as any other availability guarantee.
Quorum Intersection #
Quorum-based systems generalize Agreement by defining a threshold — a quorum — that must be reached before an operation is considered complete.
Quorum intersection theorem: for any two quorums Q₁ and Q₂ in a system, |Q₁ ∩ Q₂| ≥ 1. Every pair of quorums must share at least one node.
This ensures that a read quorum will always include at least one node that participated in the most recent write quorum, so reads always see the latest write.
For N replicas with write quorum W and read quorum R:
W + R > N → quorum intersection guaranteed
Common configurations:
| N | W | R | Notes |
|---|---|---|---|
| 3 | 2 | 2 | QUORUM reads and writes; tolerates 1 failure |
| 3 | 3 | 1 | Strong write, fast read; no write tolerance |
| 3 | 1 | 3 | Fast write, strong read; no read tolerance |
| 5 | 3 | 3 | QUORUM; tolerates 2 failures |
| 5 | 5 | 1 | ALL write, ONE read; no write fault tolerance |
Cassandra’s consistency levels map directly to quorum thresholds:
ONE: W or R = 1QUORUM: W or R = floor(RF/2) + 1ALL: W or R = RFLOCAL_QUORUM: quorum within the local datacenter only (used for multi-datacenter deployments where cross-datacenter latency is unacceptable for the write path)
The interplay with membership: quorum size is calculated from the current membership. If a node is removed from the cluster (membership change), the quorum denominator shrinks, which may cause pending writes to succeed with fewer acknowledgments than intended. Cassandra’s decommission process must complete data streaming to remaining nodes before removing the node from the ring to avoid this.
Paxos #
Paxos is the theoretical foundation of Agreement protocols. Raft was designed as a more understandable alternative to Paxos; they achieve the same correctness properties via different mechanisms.
Single-decree Paxos (agreeing on a single value):
Phase 1: Prepare
- A proposer sends
PREPARE(n)where n is a proposal number (must be unique per proposer, monotonically increasing). - An acceptor receiving PREPARE(n) promises to never accept a proposal with number < n, and returns any previously accepted value.
Phase 2: Accept
- If the proposer receives PREPARE responses from a majority, it sends
ACCEPT(n, v)where v is either the value from the highest-numbered previously accepted proposal (from the PREPARE responses), or the proposer’s own value if no prior acceptance exists. - An acceptor accepts ACCEPT(n, v) if it has not promised to ignore proposals with number ≤ n.
Multi-Paxos extends single-decree Paxos to a log by electing a distinguished leader (via a single round of Paxos) that can then propose entries directly without phase 1 for each log entry — essentially reducing to the same structure as Raft.
Raft vs Paxos in practice: Raft is more widely implemented because its leader-centric design and explicit term/index structure make reasoning about correctness simpler. Paxos is more general (no requirement for a single leader) but harder to implement correctly for a replicated log.
Saga: Agreement Without 2PC #
In microservice architectures, 2PC is often impractical:
- Services may not expose transactional semantics to external callers.
- Long-lived transactions (holding locks across service calls) harm throughput.
- The coordinator (typically the API gateway or orchestrator) may not be the right place to implement distributed transaction coordination.
The Saga pattern (Garcia-Molina & Salem, 1987) replaces atomicity with compensatability: instead of requiring all participants to hold locks and vote, a saga breaks the transaction into a sequence of local transactions, each with a corresponding compensation transaction that undoes its effect if a later step fails.
Saga Execution Patterns #
Choreography: each service publishes events and subscribes to events from other services. No central coordinator. Services react to events and publish their own outcomes.
OrderService PaymentService InventoryService
│ │ │
├──OrderCreated──────►│ │
│ ├──PaymentCompleted──►│
│ │ ├──InventoryReserved
│ │ │ ✓ Done
│ (failure path)
│ │◄─PaymentFailed──────┤
│◄──OrderCancelled───┤ │
Orchestration: a central orchestrator sends commands to each service and receives responses. The orchestrator maintains the saga state and decides whether to proceed or compensate.
// Saga orchestrator
func (s *OrderSaga) Execute(ctx context.Context, order Order) error {
// Step 1: Reserve inventory
if err := s.inventory.Reserve(ctx, order.Items); err != nil {
return err // nothing to compensate yet
}
// Step 2: Charge payment
paymentResult, err := s.payment.Charge(ctx, order.Amount)
if err != nil {
// Compensate: release inventory reservation
s.inventory.Release(ctx, order.Items)
return err
}
// Step 3: Fulfill order
if err := s.fulfillment.Ship(ctx, order); err != nil {
// Compensate: refund payment, release inventory
s.payment.Refund(ctx, paymentResult.ChargeID)
s.inventory.Release(ctx, order.Items)
return err
}
return nil
}
Saga Properties and Failure Modes #
What Saga guarantees: eventual consistency. If the saga completes successfully, all local transactions are committed. If the saga fails at step k, steps 1 through k-1 are compensated. The system eventually reaches a consistent state.
What Saga does not guarantee: isolation. Between the commit of step k and the compensation of earlier steps, intermediate states are visible. Another request may see an inventory reservation without a corresponding payment.
Idempotency requirement: every step and every compensation must be idempotent. If the orchestrator crashes after sending a command but before receiving acknowledgment, it will retry. The service must handle duplicate commands.
// Compensation must be idempotent
func (s *InventoryService) Release(ctx context.Context, items []Item, reservationID string) error {
// If already released (idempotency key match), return success
existing, _ := s.db.GetReservation(ctx, reservationID)
if existing.Status == "released" {
return nil // already compensated
}
return s.db.ReleaseReservation(ctx, reservationID, items)
}
Saga vs 2PC: When to Use Each #
| Criterion | 2PC | Saga |
|---|---|---|
| Isolation required | ✓ Use 2PC | ✗ Saga has no isolation |
| Long-lived transactions | ✗ 2PC blocks | ✓ Saga releases locks between steps |
| Cross-service boundary | ✗ Services may not support 2PC | ✓ Works with any service |
| Partial failure handling | Coordinator decides | Explicit compensation |
| Implementation complexity | Low (protocol-level) | High (compensation logic per step) |
The rule: use 2PC within a single database or within a system that supports XA. Use Saga across services or microservice boundaries where holding distributed locks is impractical.
Paxos-Based Systems: etcd and ZooKeeper #
etcd uses Raft (a Paxos-equivalent) for its key-value store. Every write goes through Raft log replication before it is acknowledged to the client. This makes all writes linearizable:
- Write latency is the Raft round-trip time: leader appends, followers acknowledge, leader commits, response returned.
- Typical p50 write latency for etcd on local SSDs: ~1–5ms. On spinning disks or high-latency networks: 10–50ms.
ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a Paxos-adjacent protocol with similar properties to Raft. ZAB uses an explicit epoch concept (equivalent to Raft terms) and a ZXID (transaction ID) that encodes epoch + counter (equivalent to Raft’s term + log index).
etcd vs ZooKeeper for Agreement:
- Both are CP systems — they trade availability for consistency under partition.
- etcd exposes a richer transaction model (multi-key CAS, Watch).
- ZooKeeper exposes ephemeral nodes and sequential nodes, which are primitives for leader election and distributed locks.
- In Kubernetes-native systems, etcd is the standard. ZooKeeper remains common in JVM ecosystems (Kafka pre-KRaft, Solr, HBase).
CAP Position of Agreement Mechanisms #
| Mechanism | CAP choice | Behavior under partition |
|---|---|---|
| Raft log replication | CP | Minority partition cannot commit; majority partition continues |
| Two-phase commit (standard) | CA | Blocks if coordinator or any participant is unreachable |
| Saga | AP | Each step is local; overall saga may be partially committed |
| Cassandra QUORUM | CP | QUORUM write fails if fewer than RF/2+1 nodes reachable |
| Cassandra ONE | AP | Write succeeds even to a single node in a partition |
| etcd transaction | CP | Requires Raft quorum; unavailable if quorum lost |
Agreement at the infrastructure level is always CP — the price of consistency is availability under partition. The minority partition waits. This is the correct tradeoff: a partial commit in a distributed transaction is permanently inconsistent data.
Practical Checklist for Agreement #
| Requirement | Mechanism |
|---|---|
| Single service, PostgreSQL | BEGIN / COMMIT — local ACID |
| Two services, both PostgreSQL | XA or application-level 2PC with PREPARE TRANSACTION |
| Multiple microservices, long-lived | Saga with compensation + idempotency keys |
| Replicated state machine (etcd, CockroachDB) | Raft with linearizable writes |
| Strong consistency, high write throughput | Raft with pipelining (send AppendEntries before prior acknowledged) |
| Read scale without sacrificing consistency | Raft follower reads with lease (bounded staleness) |