Replication #
replication = maintain multiple copies of state
It answers:
where else does this state exist, and how do copies converge?
Role in the catalog: the topology layer — and the block whose rewrite is mostly treaty-writing, because it arrives after quorum.md, log.md, and lease_fencing.md claimed the strong machinery. The treaty:
quorum.md owns the commit proof (sync/quorum replication, SMR,
leaderless R/W — its two arms)
log.md owns the shipped structure (log shipping, ordering, the ladder)
lease_fencing.md owns authority transfer (failover fencing, split brain)
replication.md owns the TOPOLOGY (the geometry of copies, placement,
and everything that happens where
the strong machinery ISN'T running:
async tails, accepted conflict,
repair, promotion)
The native territory is precisely the part of the copy-universe that consensus does not govern.
Central tension:
availability, durability, locality, throughput
vs
consistency, latency, conflict complexity
Design Axes (the core module) #
Axis 1 — Write Topology (the structural cleave) #
single-writer: one primary; followers copy
(Postgres primary/replica, Kafka partition leader)
multi-writer coordinated: per-key/per-partition single-writers —
THE PRACTICAL DOMINANT CASE the classic list omits:
many leaders, zero conflicts, because no key has two
(Kafka's partition set; sharded primaries)
multi-writer conflicting: active-active; conflict ACCEPTED at write time,
deferred to resolution machinery (axis 3)
no-designated-writer: leaderless; coordination per request
(Dynamo/Cassandra → quorum.md's overlap arm)
The deep-lesson warning belongs here:
multi-region ≠ active-active correctness.
geography (axis 4) does not choose your position on THIS axis — you do,
and conflicts are the standing price of the right half.
Interrogation:
Who may write, per key — one, one-of-many-by-partition, or anyone?
If two writers CAN touch one key: is the conflict machinery (axis 3)
chosen, tested, and explained to users — or discovered?
Could partitioned single-writers deliver the locality that active-active
was about to buy with conflicts?
Axis 2 — Ack Discipline (what is true when the client hears “yes”) #
sync-to-quorum: the ack is a commit certificate → quorum.md, fully
semi-sync: at least one replica has it before ack —
the window shrunk to zero, at latency cost
async: THE NATIVE CASE — the ack precedes replication
The async tail is log.md’s uncommitted-suffix* relocated:
not the log lying about its tail —
the SYSTEM lying about durability.
the acked-but-unreplicated window is real, mobile, and usually unmeasured.
its name in disaster-recovery currency is RPO.
Interrogation:
How wide is the window right now — measured, graphed, alerted?
Who agreed to the RPO it implies, and before or after reading it?
"Failover with no data loss" — against which ack discipline is that
claim even coherent?
Axis 3 — Convergence Machinery (how divergent copies re-agree) #
ordered apply: ship the log, apply in order — divergence prevented,
determinism inherited (state_machine.md axis 1;
structure → log.md)
version + resolve: divergence represented, then resolved —
siblings + vector clocks (honest, pushes work to readers)
LWW (simple, and its clock lies — a chosen data loss)
CRDTs (convergence bought by ALGEBRA: commutative,
associative, idempotent operations converge under ANY
delivery order — the same mathematical requirement as
materialized.md's non-associative-aggregation trap,
in a different block)
anti-entropy: background repair — Merkle-tree diff, read repair;
with tombstone resurrection as its signature hazard:
GC's phase-two violation appearing in SPACE rather
than time (a replica that missed the delete "helpfully"
restores it)
And the boundary this axis does NOT cross:
convergence machinery converges REPLICAS, not READER EXPERIENCES.
read-your-writes and monotonic reads are session contracts
layered on axis 5 — "eventual repair vs user-visible consistency."
Interrogation:
When two copies disagree, what is the resolution — by order, by version
logic, or by algebra? Named, or emergent?
LWW: whose clock, and what does it silently discard?
Anti-entropy: does the repair respect tombstone grace (GC's treaty)?
What session guarantees do users actually assume, and what provides them?
Axis 4 — Placement Geometry (what the copies actually protect against) #
capacity.md’s statistical bet, with durability as collateral:
three replicas in one rack are one replica with extra steps.
the replica COUNT is a headline; the failure-domain SPREAD is the contract.
placement: rack / zone / region spread (boundary.md's failure domains;
scheduler.md's spread objective executes it)
geo: not a type — this axis at its widest, plus boundary.md's
data-residency constraint riding along (a copy in the wrong
jurisdiction is a compliance event, not a durability win)
erasure coding: the economics substitution on this axis —
same failure-domain math, storage overhead traded for
repair cost + small-write amplification
(capacity.md's amplification triangle deciding between
copy-count and coding; cold data votes coding, hot data
votes copies)
Interrogation:
Which correlated failures does THIS placement survive — enumerated?
What does the residency boundary forbid, and does async replication
respect it mid-stream?
Coding vs copies: what is the repair bandwidth bill when a domain dies?
Axis 5 — Read Contract (what readers may observe) #
One sentence dissolves the “read replica” type:
a replica serving reads is a CACHE whose invalidation protocol
is the replication stream itself.
So cache.md’s freshness ladder applies wholesale: follower-stale reads, lag-bounded reads, session-pinned reads (read-your-writes = route the session to a caught-up copy or the primary), up to quorum.md’s read rungs at the top. “Replica exists ≠ replica is current” is the ladder’s bottom rung, honestly labeled.
Interrogation:
Which rung does each read path buy — and is the routing enforcing it,
or hoping?
After a write, what stops the very next read landing on a lagging replica?
Is lag a metric or a contract? (per-replica staleness bound, or vibes?)
Special Seats #
Immutable-object replication — the demonstration that immutability is the cheapest replication strategy:
content-addressing collapses the problem: a digest cannot diverge,
cannot conflict, needs no convergence machinery. you replicate FACTS,
not state. (S3 objects, Kafka segments, Parquet files, OCI layers)
residual hazards — both ordering, both owned elsewhere:
manifest-before-replica (GC's publication protocol, inverted:
publish AFTER the copies exist)
source-GC races target-durability (the pin registry must count
in-flight replication as a pin)
Snapshot + delta bootstrap — materialized.md’s cutover steps 2–3, verbatim: backfill from a consistency point, catch up on the stream from that coordinate. The gap/overlap dial, sixth appearance — gap = changes between snapshot and stream missed; overlap = double-apply, paid for by idempotent apply. Plus the retention treaty: the snapshot must be young enough that the log still reaches back to it.
Technical Bottleneck: The Promotion Decision* #
when the writer dies, someone must choose a successor
from replicas whose positions differ —
a forced trade, executed under pressure:
promote the most-caught-up NOW → lose the acked-but-unshipped window
(RPO breach, silent truncation of
acknowledged history)
wait for certainty → extend the outage (RTO breach)
promote wrongly → manufacture split brain — which
fencing then CONTAINS but cannot
un-choose (lease_fencing.md)
Consensus dissolves this* — promotion becomes an election over quorum-committed state, and no window exists to lose ( quorum.md’s manufactured facts). Outside consensus, every recipe is partial:
semi-sync ack the window is zero by construction, latency pays
mandatory fencing the old primary is stopped by the world,
not by its own good judgment (lease_fencing.md)
lag-bounded eligibility only replicas within X may stand for election
honest RPO/RTO arithmetic declared BEFORE the incident —
the trade is designed, not improvised at 3am
A strong design says explicitly:
who owns writes, per key (axis 1),
what the ack certifies and how wide the window is (axis 2),
how divergence re-agrees — by order, version, or algebra (axis 3),
which failure domains the placement actually survives (axis 4),
which freshness rung each reader buys (axis 5),
and the promotion trade — chosen in daylight, executed by fencing.
Replication As Protocol (the crossing-point spec — keep) #
define replica set and placement (axis 4)
choose write authority (axis 1)
write/append → ship (structure: log.md)
replicas persist/apply
advance commit/freshness marker (proof: quorum.md; coordinate: log.md's ladder)
serve reads per contract (axis 5)
detect lag/divergence → repair (axis 3)
fail over: fence, elect, promote (bottleneck*; lease_fencing.md)
change membership safely (quorum.md's self-referential seat)
Named Configurations (lookup table) #
Vector = {topology, ack, convergence, placement, read contract}. Rows marked → are owned elsewhere.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Leader/follower | single-writer, sync or async, ordered apply, varies, lag-bounded reads | Postgres streaming; Kafka partition | stale-replica promotion*; split brain (→ fence); lag-blind reads |
| Sync/quorum → quorum.md | single-writer, quorum ack, ordered, spread, up to linearizable | Raft majority | (owned: intersection*, availability under quorum loss) |
| Async replication | single-writer, async — the native window, ordered, often geo, stale | Postgres async | data loss on failover*; silent lag growth; RYW violation |
| Multi-leader (partitioned) | multi-writer coordinated, per-key sync-ish, none needed, spread, per-key fresh | sharded primaries; Kafka | hot partition; rebalance (→ lease_fencing) — but zero conflicts |
| Multi-leader (active-active) | multi-writer conflicting, async, version/CRDT/LWW, geo, eventual | CRDT systems; multi-master DBs | conflict surprise; non-commutative ops; LWW clock-loss |
| Leaderless → quorum.md | no-designated-writer, R/W, version+repair, spread, R-rung | Dynamo paper | (owned: siblings, sloppy quorum; repair lag lands here) |
| Log shipping → log.md | single-writer, async, ordered apply, —, — | WAL shipping; CDC | retention outruns replica (treaty); schema breaks apply; duplicate on reconnect |
| Snapshot + delta | seat: bootstrap, —, coordinate-bound catch-up, —, — | Debezium snapshot+stream; base backup+WAL | gap-or-overlap at the seam (dial #6); snapshot older than retention |
| Anti-entropy | any, —, background repair, —, — | Dynamo Merkle trees; Cassandra repair | tombstone resurrection (GC in space); repair storms; wrong granularity |
| Read replicas | single-writer, async, ordered, locality, cache ladder | Postgres read replicas | = cache.md rungs: stale read, RYW, replica-after-write routing |
| Geo-replication | axis-4-at-widest + axis-1 choice, async mostly, chosen per above, regions + residency, — | DynamoDB Global Tables (AP-ish); Spanner (strong) | regional partition; cross-region conflict; residency violation; failback loss |
| Immutable objects | seat: facts not state, —, none needed (digest), spread, — | S3 CRR; OCI layers | manifest-before-replica; source-GC races target; partial copy |
| Erasure coding | axis-4 economics substitution, —, —, fragment placement, — | Ceph EC pools; HDFS EC | repair bandwidth bill; correlated fragment loss; small-write amplification |
Vocabulary #
replica primary follower topology placement failure domain
ack window RPO RTO semi-sync lag
sibling vector clock LWW CRDT commutative convergent
anti-entropy Merkle tree read repair hinted handoff resurrection
promotion eligibility failover failback split brain fence
snapshot delta consistency point catch-up
digest content-addressed manifest
erasure code fragment parity repair bandwidth
session guarantee read-your-writes monotonic reads
Deep Lesson #
Replication bugs come from confusing pairs on different axes:
copy vs commit (axis 2: the ack certifies a rung — log.md/quorum.md)
replication vs consensus (the treaty: topology vs proof)
replica exists vs replica is current (axis 5: bottom rung of the cache ladder)
failover vs no data loss (bottleneck*: the trade was always there)
read replica vs fresh read (axis 5: a replica is a cache)
multi-region vs active-active correctness (axis 4 doesn't choose axis 1)
eventual repair vs user-visible consistency (axis 3 converges replicas, not sessions)
tombstone vs permanent deletion (axis 3 + GC: resurrection in space)
Design procedure: choose the write topology per key, name the ack window and its RPO, pick convergence by order, version, or algebra, enumerate the failure domains the placement actually survives, assign every read path a rung — and settle the promotion trade in daylight, because the incident will not wait for the meeting. The named types are recognition shortcuts, not the design space.