Ordering — Sequence Numbers and Total Order

Table of Contents

Ordering — Sequence Numbers and Total Order #

Every distributed coordination mechanism depends on ordering. Leader election requires knowing which term is current. Log replication requires knowing which entry comes before which. Fencing requires knowing which token was issued last. Conflict resolution requires knowing which write happened first — or whether two writes were concurrent and must be merged.

The fundamental problem: distributed systems have no shared clock. Different nodes may have wall clocks that differ by milliseconds or seconds. NTP synchronizes clocks to within 100ms on a LAN, but this is insufficient for ordering events that occur microseconds apart. TrueTime (Google Spanner) provides bounded clock uncertainty, but requires specialized GPS/atomic clock hardware.

Without a shared clock, we need logical time — mechanisms that capture causal relationships between events without relying on wall-clock accuracy.

Why Wall Clocks Are Insufficient #

Consider: node A records event e1 at wall clock time T, node B records event e2 at wall clock time T + 1ms. Can we conclude that e1 happened before e2? No — node A’s clock may be 5ms ahead of node B’s due to clock drift. e2 may have actually caused e1 (or the two may be genuinely concurrent).

The consequences of relying on wall clocks for ordering:

Last-write-wins with wall clocks: if a database uses wall clock timestamps to resolve concurrent writes, the winner depends on which node’s clock is ahead — not which write should win. Clock drift causes writes to be silently discarded.

Distributed tracing with wall clocks: if trace spans are ordered by wall clock time, a network round-trip may appear to precede the initiating request if the receiving node’s clock is behind.

Kafka offset ordering: Kafka uses a monotonic offset counter per partition (not wall clock time) to order messages. Two messages with the same wall clock timestamp have distinct offsets and are strictly ordered.

The solution is logical clocks.

Lamport Clocks #

Vector clocks: partial ordering and concurrent detection

Leslie Lamport (1978) introduced the happens-before relation and the first logical clock:

Happens-before (→) is defined as:

If a and b are events on the same process, and a occurred before b, then a → b.
If a is the sending of a message and b is the receipt of that message, then a → b.
Transitivity: if a → b and b → c, then a → c.

If neither a → b nor b → a, events a and b are concurrent.

Lamport Clock Rules #

Each process maintains a counter C, initialized to 0.

When a process executes a local event: C = C + 1.
When a process sends a message: C = C + 1, include C in the message.
When a process receives a message with timestamp T: C = max(C, T) + 1.

The clock condition: if a → b, then C(a) < C(b). The converse is not necessarily true: C(a) < C(b) does not imply a → b.

Lamport clocks impose a total order on all events by breaking ties with process ID: if C(a) = C(b) and process(a) < process(b), then a < b in the total order.

What Lamport clocks provide: a consistent global ordering of events that respects causality. If a → b, then a’s Lamport timestamp is less than b’s.

What Lamport clocks do not provide: they cannot distinguish between “a happened before b” and “a and b were concurrent but a happened to get a lower timestamp.” Two events with timestamps 5 and 6 might be causally related or concurrent — you cannot tell from Lamport timestamps alone.

Practical Use of Lamport Clocks #

Database versioning: each write is tagged with a Lamport timestamp. Last-write-wins (LWW) conflict resolution uses the Lamport timestamp to determine the winner. Since Lamport timestamps respect happens-before, if one write causally follows another, the later write wins. If two writes are concurrent, one wins based on the arbitrary tie-breaking order — this is acceptable because concurrent writes represent a genuine conflict that must be resolved somehow.

Kafka: each message in a partition has a monotonically increasing offset. This is a Lamport clock within the partition. Across partitions, there is no ordering guarantee.

etcd revision counter: every mutation to the etcd cluster increments a cluster-wide revision counter. The revision is a Lamport clock for the entire etcd store — it captures causality across all keys.

Vector Clocks #

Lamport clocks impose a total order but lose information about concurrency. Vector clocks (Fidge, 1988; Mattern, 1988) capture the full partial order — they can determine whether two events are causally related or genuinely concurrent.

Vector Clock Rules #

Each process i maintains a vector V of N counters (one per process).

When process i executes a local event: V[i] = V[i] + 1.
When process i sends a message: V[i] = V[i] + 1, include V in the message.
When process i receives a message with vector W: V[j] = max(V[j], W[j]) for all j, then V[i] = V[i] + 1.

Comparison: for vector clocks V(a) and V(b):

a → b: V(a)[i] ≤ V(b)[i] for all i, and V(a)[j] < V(b)[j] for at least one j.
b → a: V(b)[i] ≤ V(a)[i] for all i, and V(b)[j] < V(a)[j] for at least one j.
a ∥ b (concurrent): neither a → b nor b → a. Formally, V(a)[i] > V(b)[i] for some i AND V(b)[j] > V(a)[j] for some j.

Three processes: A, B, C
Initial state: A=[0,0,0], B=[0,0,0], C=[0,0,0]

A sends message to B:
  A local event: A=[1,0,0]
  A sends: V=[1,0,0]
  B receives: B=[max(0,1)+1, ...] = [1,1,0]

B sends message to C:
  B local event: B=[1,2,0]
  B sends: V=[1,2,0]
  C receives: C=[1,2,1]

A sends message to C (independently):
  A local event: A=[2,0,0]
  A sends: V=[2,0,0]
  C receives: C=[max(1,2), max(2,0), 1+1] = [2,2,2]

Concurrent detection: suppose A has vector [2,0,0] and B has [1,2,0]. Is A’s event causally related to B’s?

A → B? A[0]=2 > B[0]=1. No, not A → B.
B → A? B[1]=2 > A[1]=0. No, not B → A.
A ∥ B: concurrent. These events happened without causal relationship.

Vector Clocks in Practice: Dynamo-Style Databases #

Amazon Dynamo and Cassandra (in versions before LWW became dominant) used vector clocks for conflict detection. Each object carried a vector clock updated on every write. On read, if two versions had vector clocks where neither dominated the other, the application received both versions and was expected to merge them.

The versioned object problem: vector clocks have a practical limitation — they grow proportionally to the number of processes that have written to a key. In a system where any node can handle writes and nodes are added over time, vector clocks grow without bound.

Dynamo addressed this by pruning old vector clock entries beyond a configurable size threshold, at the cost of occasionally losing the ability to detect concurrency (treated as happened-before instead of concurrent). This is an engineering tradeoff: smaller vector clocks at the cost of occasionally incorrect concurrency detection.

Version Vectors vs Vector Clocks #

A version vector (not to be confused with vector clock) tracks the number of writes from each replica, not per-process event counts. Used in Riak (Basho) and CRDTs to track causality between replicas:

Object at replica 1: {data: "v1", vv: {replica1: 3, replica2: 2}}
Object at replica 2: {data: "v2", vv: {replica1: 3, replica2: 3}}

replica2’s version dominates (all entries ≥), so “v2” is the current value.

Total Order Broadcast #

A total order broadcast (also called atomic broadcast) is a communication primitive that delivers messages to all nodes in the same order. This is the key building block for replicated state machines.

Properties of total order broadcast:

Reliable delivery: if a correct node delivers a message, all correct nodes eventually deliver it.
Total order: all correct nodes deliver messages in the same order.

Total order broadcast and consensus are equivalent — each can be implemented using the other:

Given consensus, implement total order broadcast: use consensus to agree on the next message in the sequence.
Given total order broadcast, implement consensus: the first message delivered is the agreed-upon value.

This equivalence means that any system implementing a replicated log (Raft, Paxos, ZAB) is implementing total order broadcast.

Raft as Total Order Broadcast #

Raft’s log replication is total order broadcast:

Each log entry is a message.
All nodes deliver log entries in the same order (same log index, same term, same content — enforced by the log matching invariant).
If the leader delivers an entry at index i, all nodes that receive it will have the same entry at index i.

The total order is the log index. There is exactly one entry per log position (modulo term changes, where conflicting entries are overwritten).

etcd’s Revision Counter #

etcd’s revision counter is a global, monotonically increasing integer that serves as the total order for all events in the etcd cluster:

Every write operation (Put, Delete, Txn) increments the revision.
The response to every operation includes the current revision.
Watch events include the revision at which the event occurred.

// Get current revision
resp, _ := client.Get(ctx, "somekey")
currentRevision := resp.Header.Revision

// Watch from a specific revision (guaranteed no gaps)
watcher := client.Watch(ctx, "prefix", clientv3.WithPrefix(),
    clientv3.WithRev(currentRevision))

This gives clients the ability to reconstruct the complete history of any key (or prefix) from a given revision forward. No events are missed, even if the client disconnects temporarily and reconnects — it can request events from the last seen revision.

The revision as fencing token (Chapter 3): the creation revision of a lease-backed key serves as a fencing token for distributed locks. Because the revision is globally monotonic, a new lock holder always has a higher revision than any previous holder. The protected resource can reject writes with lower revisions.

Hybrid Logical Clocks #

Lamport clocks are logical and cannot be compared to wall clock time. Vector clocks grow with system size. Pure wall clocks have drift problems. Hybrid Logical Clocks (HLC) (Kulkarni et al., 2014) combine logical clocks with wall clocks:

HLC(ts) = (l, c)
  where l = max(physical_clock, l_max_seen)
        c = counter (incremented when l does not advance)

HLC tracks wall clock time but uses a logical component to break ties and maintain causality when wall clocks would produce equal or decreasing timestamps.

CockroachDB uses HLC for its transaction timestamps. CockroachDB needs to compare timestamps across nodes for serializable isolation, but Lamport clocks alone cannot be compared to wall clock time (important for TTL-based expiration). HLC gives CockroachDB:

Causal ordering (logical clock component)
Approximate wall clock time (physical component)
Clock skew resistance (max of physical clock and observed timestamps)

Google Spanner’s TrueTime is a different approach: hardware (GPS + atomic clocks) that bounds clock uncertainty to ε ≤ 7ms globally. Spanner’s commit protocol waits out the uncertainty window before committing, ensuring that the transaction’s timestamp is definitely past. This gives external consistency — transactions appear in serial order as observed from any external clock.

Total Order at Application Level #

Total order broadcast is expensive — every write goes through consensus. Most applications do not need total order across all operations; they need order within a specific scope.

Partition-level ordering: Kafka guarantees total order within a partition, not across partitions. Events in the same partition are delivered in the order they were written. Events in different partitions have no relative ordering guarantee. Applications achieve the needed ordering by routing events that must be ordered to the same partition (using a consistent partition key — order_id, user_id, etc.).

Per-entity ordering: in event-sourced systems, events for the same aggregate are stored in the same stream with monotonic sequence numbers. Order is guaranteed within the stream. Cross-aggregate ordering is not guaranteed and typically not required (different aggregates are independent).

Causal ordering: in some systems, ordering matters only along causal paths, not across unrelated events. Vector clocks enable causal consistency without total order. Operations that are causally related are always observed in causal order; operations that are concurrent may be observed in any order.

Ordering Summary #

Clock type	Provides	Does not provide	Used in
Wall clock	Human-readable time	Causal ordering under drift	Logging, metrics, display
Lamport clock	Total order respecting causality	Concurrency detection	Kafka offsets, etcd revision
Vector clock	Partial order, concurrency detection	Total order	Dynamo, Riak, CRDTs
HLC	Causality + approximate wall time	Hard real-time bounds	CockroachDB, YugabyteDB
TrueTime	Bounded wall clock uncertainty	Zero uncertainty	Google Spanner
Raft log index	Total order for replicated log	Cross-log ordering	etcd, CockroachDB ranges