- My Development Notes/
- Distributed Coordination: The Hidden Component/
- Selection — Leader Election and Distributed Locks/
Selection — Leader Election and Distributed Locks
Table of Contents
Selection — Leader Election and Distributed Locks #
Selection is the most fundamental coordination mechanism: multiple actors compete for a resource, and exactly one must win. At the infrastructure level, Selection manifests as leader election — a process by which one node among N is designated to act as the authoritative coordinator for some period of time. The correctness requirements are strict: two leaders simultaneously (split-brain) is not a degraded state — it is a catastrophic one that corrupts data.
What Selection Requires #
For any Selection mechanism to be correct, two properties must hold:
Safety: At most one winner at any given time. Liveness: Eventually a winner is selected (the system does not deadlock waiting for a decision).
These two properties are in tension under the FLP impossibility theorem: in a fully asynchronous system with even one possible failure, no deterministic algorithm can guarantee both. Every practical Selection mechanism makes an explicit assumption — typically about message delivery bounds — to escape FLP.
Raft assumes that messages are eventually delivered and that election timeouts are long enough relative to actual message delivery times. Under these assumptions, it achieves both safety and liveness. Under a fully asynchronous failure model (where messages may be delayed indefinitely), neither is guaranteed.
Raft Leader Election #
Raft structures time into terms — monotonically increasing integers that serve as logical clocks. Each term begins with an election. At most one leader is elected per term.
Node States #
Every Raft node is in one of three states at all times:
Follower: The default state. Followers receive heartbeats (AppendEntries RPC with empty log) from the leader. If no heartbeat arrives within the election timeout (a random interval between 150ms and 300ms, configurable), the follower transitions to Candidate.
Candidate: A node seeking election. The candidate increments the current term, votes for itself, and sends RequestVote RPCs to all other nodes.
Leader: A node that has received votes from a majority of nodes in the current term. The leader sends periodic heartbeats to suppress elections by resetting followers’ election timeouts.
The RequestVote Protocol #
A Candidate sends RequestVote(term, candidateId, lastLogIndex, lastLogTerm) to all peers.
A node grants a vote if:
- The candidate’s term is at least as large as the voter’s current term
- The voter has not already voted in this term (each node votes at most once per term)
- The candidate’s log is at least as up-to-date as the voter’s log — specifically, if the candidate’s last log term is higher than the voter’s, or if terms are equal and the candidate’s log is at least as long
Condition 3 is the leader completeness guarantee: it ensures that only a candidate with all committed log entries can become leader. A candidate that is missing committed entries will have an outdated log and will not receive votes from nodes that have those entries.
Split Vote and Randomized Timeouts #
If two nodes simultaneously become candidates, their votes may split — neither receives a majority. Both start a new election (incrementing the term again).
Raft resolves this with randomized election timeouts. Each follower waits a random duration before starting an election. The node with the shortest timeout becomes a candidate first, sends RequestVote before others timeout, and typically wins before any other node starts competing.
The randomization window must be larger than the typical round-trip time for RequestVote RPCs. If the window is too narrow, simultaneous timeouts are still common. etcd uses timeouts of 100–500ms with a heartbeat interval of 100ms by default.
Log Replication and Commitment #
Once elected, the leader handles all client writes. For each write:
- Leader appends the entry to its local log
- Leader sends
AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries[], leaderCommit)to all followers in parallel - Followers append the entry if the prevLogIndex and prevLogTerm match their local log (the log matching property)
- Leader waits for a majority of nodes (including itself) to acknowledge the append
- Leader commits the entry (advances its commit index) and applies it to the state machine
- Leader includes the updated commitIndex in subsequent AppendEntries, which causes followers to commit the entry
An entry is committed once it has been replicated to a majority. Committed entries are durable — they will survive any single leader failure. A new leader will always have all committed entries (enforced by the vote restriction in step 3 of RequestVote).
Failure Modes #
Network partition: If the leader is partitioned from a majority, it can no longer commit new entries (it cannot reach a majority). A new leader will be elected in the majority partition. The old leader may still accept writes from clients that can reach it, but those writes will not be committed. When the partition heals, the old leader discovers the higher term, steps down, and its uncommitted entries are overwritten.
Slow follower: If a follower is slow but not failed, the leader retries AppendEntries. Progress is not blocked — the leader commits entries once a majority responds, without waiting for the slow follower. The slow follower catches up asynchronously.
Leader crash: The leader crashes after committing an entry but before some followers have replicated it. The new leader may be one of the nodes that has the committed entry (its log is up-to-date, so it can win the election), and it will re-send the committed entry to lagging followers. Safety is preserved.
etcd Leases: Time-Bounded Selection #
etcd builds a lease mechanism on top of Raft that is the foundation for distributed locks and leader election in Kubernetes-native systems.
A lease is a time-bounded grant issued by the etcd cluster. The lease has a TTL. The holder must periodically call LeaseKeepAlive to renew the lease before it expires. If the holder crashes, the lease expires and the grant is released.
// Grant a lease with 10-second TTL
lease, _ := client.Grant(ctx, 10)
// Attach a key to the lease
client.Put(ctx, "/election/leader", nodeId, clientv3.WithLease(lease.ID))
// Keep alive in background
keepAlive, _ := client.KeepAlive(ctx, lease.ID)
// If this node crashes, the lease expires, the key is deleted,
// and another node can acquire it
When the key attached to the lease expires, etcd deletes it atomically. Any watcher on that key receives a DELETE event and can attempt to acquire leadership.
etcd election API: etcd provides a higher-level election API that manages the lease acquisition and watch semantics:
election := concurrency.NewElection(session, "/election")
// Campaign blocks until this node becomes leader
election.Campaign(ctx, nodeId)
// Observe: watch for leadership changes
ch := election.Observe(ctx)
// Resign
election.Resign(ctx)
The election API uses a lease-backed key with a revision as the election mechanism. The node with the lowest creation revision wins — it was first to acquire the key in the current election cycle.
Fencing Tokens: Preventing Stale Winners #
A lease-based lock has a critical failure mode: the lease holder may pause (GC pause, CPU starvation, OS scheduling delay) for longer than the TTL. During the pause, the lease expires. A new node acquires the lease. The paused node resumes, unaware that its lease has expired, and attempts to write to the protected resource as though it is still the leader.
Two nodes now believe they are the leader. Without a fencing mechanism, both will write to the protected resource.
The fencing token solution (Kleppmann, 2016):
The lock service issues a monotonically increasing token with each lease grant. The protected resource tracks the highest token it has seen. Any write request with a token lower than the highest seen is rejected.
Lock service grants:
Node A: token=33, lease TTL=10s
(Node A pauses for 30s)
(Lease expires, new election)
Node B: token=34, lease TTL=10s
Node B writes to storage: { token=34, data=... } → ACCEPTED
last_seen_token = 34
Node A resumes, writes to storage: { token=33, data=... } → REJECTED
(33 < 34)
etcd implements fencing via its revision counter — a cluster-wide monotonically increasing integer that increments with every write. The creation revision of the lease key serves as the fencing token.
Why CAS alone is insufficient: CAS checks that the value has not changed since it was last read. A stale actor may read the value and find it unchanged (no other write occurred during the pause) and proceed with a stale CAS. Fencing checks actor authority, not value state. A stale actor has an expired lease regardless of whether the value changed.
Why Redis Locks Are Unsafe for Correctness #
Redis SETNX key value EX ttl (SET if Not eXists with expiry) is the most common distributed lock implementation. It is correct for efficiency locks (where a duplicate execution is inconvenient but recoverable) and incorrect for correctness locks (where a duplicate execution corrupts data).
The problems:
1. No fencing tokens. Redis SETNX does not issue monotonically increasing tokens. There is no mechanism for the protected resource to reject requests from stale lock holders. Both the expired holder and the new holder can write simultaneously.
2. Async replication. Redis replication is asynchronous by default. A lock acquired on the primary may not have been replicated to replicas before the primary crashes. If a replica is promoted, the lock is gone — another client can acquire it. The original holder still believes it holds the lock.
3. Clock-dependent TTL. The TTL depends on the lock holder’s and server’s wall clocks being synchronized. Clock drift between nodes changes the effective TTL. NTP corrections can cause time jumps that expire locks prematurely or extend them beyond their intended duration.
Redlock — Redis’s multi-node distributed lock algorithm — attempts to address the async replication problem by acquiring the lock on N/2+1 Redis nodes simultaneously. It still does not solve the fencing token problem and still depends on timing assumptions that can be violated by GC pauses and clock drift.
The rule: Use Redis locks only where a duplicate execution is recoverable. Use etcd leases with fencing tokens where a duplicate execution corrupts state.
| Scenario | Acceptable lock | Reason |
|---|---|---|
| Cache warming (avoid duplicate computation) | Redis SETNX | Duplicate computation is merely wasteful |
| Leader election for a singleton job scheduler | etcd lease + fencing | Two schedulers running the same job causes duplicate side effects |
| Seat reservation in Ticketmaster | etcd lease + fencing OR DB transaction | Double-booking is a correctness violation |
| Rate limiting | Redis SETNX | Occasional over-counting is acceptable |
The etcd Transaction Model for Safe Selection #
For selection that requires both atomicity and fencing, etcd’s transaction model combines a compare-and-swap with multi-key atomicity:
// Attempt to acquire leadership atomically
txn := client.Txn(ctx)
resp, _ := txn.If(
clientv3.Compare(clientv3.CreateRevision("/election/leader"), "=", 0),
).Then(
clientv3.OpPut("/election/leader", nodeId, clientv3.WithLease(leaseID)),
).Else(
clientv3.OpGet("/election/leader"),
).Commit()
if resp.Succeeded {
// This node is now leader
// The create revision of the key is the fencing token
fencingToken = resp.Header.Revision
} else {
// Another node holds leadership
currentLeader = resp.Responses[0].GetResponseRange().Kvs[0]
}
The CreateRevision == 0 check is atomic with the put. No two nodes can simultaneously see CreateRevision == 0 and both succeed — one will lose the CAS and receive the else branch.
CAP Position of Selection Mechanisms #
| Mechanism | CAP choice | Behavior under partition |
|---|---|---|
| Raft leader election | CP | Minority partition cannot elect a leader; majority partition elects one |
| etcd lease | CP | Lease acquisition requires quorum write; unavailable if etcd quorum lost |
| Redis SETNX | AP | Lock can be acquired even in a partitioned minority node; split-brain possible |
| ZooKeeper ephemeral node | CP | ZAB consensus required; unavailable without quorum |
Selection at the infrastructure level is always CP — correctness requires consensus, which requires a majority. The minority partition cannot make progress. This is the correct tradeoff: two simultaneous leaders is worse than temporary unavailability.