Skip to main content
  1. concepts/

Replication #

replication = maintain multiple copies of state

It answers:

where else does this state exist, and how do copies converge?

Role in the catalog: the topology layer — and the block whose rewrite is mostly treaty-writing, because it arrives after quorum.md, log.md, and lease_fencing.md claimed the strong machinery. The treaty:

quorum.md owns the commit proof        (sync/quorum replication, SMR,
                                        leaderless R/W — its two arms)
log.md owns the shipped structure      (log shipping, ordering, the ladder)
lease_fencing.md owns authority transfer (failover fencing, split brain)
replication.md owns the TOPOLOGY       (the geometry of copies, placement,
                                        and everything that happens where
                                        the strong machinery ISN'T running:
                                        async tails, accepted conflict,
                                        repair, promotion)

The native territory is precisely the part of the copy-universe that consensus does not govern.

Central tension:

availability, durability, locality, throughput
        vs
consistency, latency, conflict complexity

Design Axes (the core module) #

Axis 1 — Write Topology (the structural cleave) #

single-writer:           one primary; followers copy
                         (Postgres primary/replica, Kafka partition leader)
multi-writer coordinated: per-key/per-partition single-writers —
                         THE PRACTICAL DOMINANT CASE the classic list omits:
                         many leaders, zero conflicts, because no key has two
                         (Kafka's partition set; sharded primaries)
multi-writer conflicting: active-active; conflict ACCEPTED at write time,
                         deferred to resolution machinery (axis 3)
no-designated-writer:    leaderless; coordination per request
                         (Dynamo/Cassandra → quorum.md's overlap arm)

The deep-lesson warning belongs here:

multi-region ≠ active-active correctness.
geography (axis 4) does not choose your position on THIS axis — you do,
and conflicts are the standing price of the right half.

Interrogation:

Who may write, per key — one, one-of-many-by-partition, or anyone?
If two writers CAN touch one key: is the conflict machinery (axis 3)
  chosen, tested, and explained to users — or discovered?
Could partitioned single-writers deliver the locality that active-active
  was about to buy with conflicts?

Axis 2 — Ack Discipline (what is true when the client hears “yes”) #

sync-to-quorum:   the ack is a commit certificate → quorum.md, fully
semi-sync:        at least one replica has it before ack —
                  the window shrunk to zero, at latency cost
async:            THE NATIVE CASE — the ack precedes replication

The async tail is log.md’s uncommitted-suffix* relocated:

not the log lying about its tail —
the SYSTEM lying about durability.
the acked-but-unreplicated window is real, mobile, and usually unmeasured.
its name in disaster-recovery currency is RPO.

Interrogation:

How wide is the window right now — measured, graphed, alerted?
Who agreed to the RPO it implies, and before or after reading it?
"Failover with no data loss" — against which ack discipline is that
  claim even coherent?

Axis 3 — Convergence Machinery (how divergent copies re-agree) #

ordered apply:      ship the log, apply in order — divergence prevented,
                    determinism inherited (state_machine.md axis 1;
                    structure → log.md)
version + resolve:  divergence represented, then resolved —
                    siblings + vector clocks (honest, pushes work to readers)
                    LWW (simple, and its clock lies — a chosen data loss)
                    CRDTs (convergence bought by ALGEBRA: commutative,
                    associative, idempotent operations converge under ANY
                    delivery order — the same mathematical requirement as
                    materialized.md's non-associative-aggregation trap,
                    in a different block)
anti-entropy:       background repair — Merkle-tree diff, read repair;
                    with tombstone resurrection as its signature hazard:
                    GC's phase-two violation appearing in SPACE rather
                    than time (a replica that missed the delete "helpfully"
                    restores it)

And the boundary this axis does NOT cross:

convergence machinery converges REPLICAS, not READER EXPERIENCES.
read-your-writes and monotonic reads are session contracts
layered on axis 5 — "eventual repair vs user-visible consistency."

Interrogation:

When two copies disagree, what is the resolution — by order, by version
  logic, or by algebra? Named, or emergent?
LWW: whose clock, and what does it silently discard?
Anti-entropy: does the repair respect tombstone grace (GC's treaty)?
What session guarantees do users actually assume, and what provides them?

Axis 4 — Placement Geometry (what the copies actually protect against) #

capacity.md’s statistical bet, with durability as collateral:

three replicas in one rack are one replica with extra steps.
the replica COUNT is a headline; the failure-domain SPREAD is the contract.
placement:      rack / zone / region spread (boundary.md's failure domains;
                scheduler.md's spread objective executes it)
geo:            not a type — this axis at its widest, plus boundary.md's
                data-residency constraint riding along (a copy in the wrong
                jurisdiction is a compliance event, not a durability win)
erasure coding: the economics substitution on this axis —
                same failure-domain math, storage overhead traded for
                repair cost + small-write amplification
                (capacity.md's amplification triangle deciding between
                copy-count and coding; cold data votes coding, hot data
                votes copies)

Interrogation:

Which correlated failures does THIS placement survive — enumerated?
What does the residency boundary forbid, and does async replication
  respect it mid-stream?
Coding vs copies: what is the repair bandwidth bill when a domain dies?

Axis 5 — Read Contract (what readers may observe) #

One sentence dissolves the “read replica” type:

a replica serving reads is a CACHE whose invalidation protocol
is the replication stream itself.

So cache.md’s freshness ladder applies wholesale: follower-stale reads, lag-bounded reads, session-pinned reads (read-your-writes = route the session to a caught-up copy or the primary), up to quorum.md’s read rungs at the top. “Replica exists ≠ replica is current” is the ladder’s bottom rung, honestly labeled.

Interrogation:

Which rung does each read path buy — and is the routing enforcing it,
  or hoping?
After a write, what stops the very next read landing on a lagging replica?
Is lag a metric or a contract? (per-replica staleness bound, or vibes?)

Special Seats #

Immutable-object replication — the demonstration that immutability is the cheapest replication strategy:

content-addressing collapses the problem: a digest cannot diverge,
cannot conflict, needs no convergence machinery. you replicate FACTS,
not state. (S3 objects, Kafka segments, Parquet files, OCI layers)
residual hazards — both ordering, both owned elsewhere:
  manifest-before-replica  (GC's publication protocol, inverted:
                            publish AFTER the copies exist)
  source-GC races target-durability (the pin registry must count
                            in-flight replication as a pin)

Snapshot + delta bootstrapmaterialized.md’s cutover steps 2–3, verbatim: backfill from a consistency point, catch up on the stream from that coordinate. The gap/overlap dial, sixth appearance — gap = changes between snapshot and stream missed; overlap = double-apply, paid for by idempotent apply. Plus the retention treaty: the snapshot must be young enough that the log still reaches back to it.


Technical Bottleneck: The Promotion Decision* #

when the writer dies, someone must choose a successor
from replicas whose positions differ —
a forced trade, executed under pressure:
promote the most-caught-up NOW    → lose the acked-but-unshipped window
                                    (RPO breach, silent truncation of
                                     acknowledged history)
wait for certainty                → extend the outage (RTO breach)
promote wrongly                   → manufacture split brain — which
                                    fencing then CONTAINS but cannot
                                    un-choose (lease_fencing.md)

Consensus dissolves this* — promotion becomes an election over quorum-committed state, and no window exists to lose ( quorum.md’s manufactured facts). Outside consensus, every recipe is partial:

semi-sync ack            the window is zero by construction, latency pays
mandatory fencing        the old primary is stopped by the world,
                         not by its own good judgment (lease_fencing.md)
lag-bounded eligibility  only replicas within X may stand for election
honest RPO/RTO arithmetic declared BEFORE the incident —
                         the trade is designed, not improvised at 3am

A strong design says explicitly:

who owns writes, per key (axis 1),
what the ack certifies and how wide the window is (axis 2),
how divergence re-agrees — by order, version, or algebra (axis 3),
which failure domains the placement actually survives (axis 4),
which freshness rung each reader buys (axis 5),
and the promotion trade — chosen in daylight, executed by fencing.

Replication As Protocol (the crossing-point spec — keep) #

define replica set and placement (axis 4)
choose write authority (axis 1)
write/append → ship (structure: log.md)
replicas persist/apply
advance commit/freshness marker (proof: quorum.md; coordinate: log.md's ladder)
serve reads per contract (axis 5)
detect lag/divergence → repair (axis 3)
fail over: fence, elect, promote (bottleneck*; lease_fencing.md)
change membership safely (quorum.md's self-referential seat)

Named Configurations (lookup table) #

Vector = {topology, ack, convergence, placement, read contract}. Rows marked → are owned elsewhere.

NameVectorCanonical study objectSignature failure
Leader/followersingle-writer, sync or async, ordered apply, varies, lag-bounded readsPostgres streaming; Kafka partitionstale-replica promotion*; split brain (→ fence); lag-blind reads
Sync/quorum → quorum.mdsingle-writer, quorum ack, ordered, spread, up to linearizableRaft majority(owned: intersection*, availability under quorum loss)
Async replicationsingle-writer, async — the native window, ordered, often geo, stalePostgres asyncdata loss on failover*; silent lag growth; RYW violation
Multi-leader (partitioned)multi-writer coordinated, per-key sync-ish, none needed, spread, per-key freshsharded primaries; Kafkahot partition; rebalance (→ lease_fencing) — but zero conflicts
Multi-leader (active-active)multi-writer conflicting, async, version/CRDT/LWW, geo, eventualCRDT systems; multi-master DBsconflict surprise; non-commutative ops; LWW clock-loss
Leaderless → quorum.mdno-designated-writer, R/W, version+repair, spread, R-rungDynamo paper(owned: siblings, sloppy quorum; repair lag lands here)
Log shipping → log.mdsingle-writer, async, ordered apply, —, —WAL shipping; CDCretention outruns replica (treaty); schema breaks apply; duplicate on reconnect
Snapshot + deltaseat: bootstrap, —, coordinate-bound catch-up, —, —Debezium snapshot+stream; base backup+WALgap-or-overlap at the seam (dial #6); snapshot older than retention
Anti-entropyany, —, background repair, —, —Dynamo Merkle trees; Cassandra repairtombstone resurrection (GC in space); repair storms; wrong granularity
Read replicassingle-writer, async, ordered, locality, cache ladderPostgres read replicas= cache.md rungs: stale read, RYW, replica-after-write routing
Geo-replicationaxis-4-at-widest + axis-1 choice, async mostly, chosen per above, regions + residency, —DynamoDB Global Tables (AP-ish); Spanner (strong)regional partition; cross-region conflict; residency violation; failback loss
Immutable objectsseat: facts not state, —, none needed (digest), spread, —S3 CRR; OCI layersmanifest-before-replica; source-GC races target; partial copy
Erasure codingaxis-4 economics substitution, —, —, fragment placement, —Ceph EC pools; HDFS ECrepair bandwidth bill; correlated fragment loss; small-write amplification

Vocabulary #

replica  primary  follower  topology  placement  failure domain
ack  window  RPO  RTO  semi-sync  lag
sibling  vector clock  LWW  CRDT  commutative  convergent
anti-entropy  Merkle tree  read repair  hinted handoff  resurrection
promotion  eligibility  failover  failback  split brain  fence
snapshot  delta  consistency point  catch-up
digest  content-addressed  manifest
erasure code  fragment  parity  repair bandwidth
session guarantee  read-your-writes  monotonic reads

Deep Lesson #

Replication bugs come from confusing pairs on different axes:

copy                vs  commit                 (axis 2: the ack certifies a rung — log.md/quorum.md)
replication         vs  consensus              (the treaty: topology vs proof)
replica exists      vs  replica is current     (axis 5: bottom rung of the cache ladder)
failover            vs  no data loss           (bottleneck*: the trade was always there)
read replica        vs  fresh read             (axis 5: a replica is a cache)
multi-region        vs  active-active correctness (axis 4 doesn't choose axis 1)
eventual repair     vs  user-visible consistency (axis 3 converges replicas, not sessions)
tombstone           vs  permanent deletion     (axis 3 + GC: resurrection in space)

Design procedure: choose the write topology per key, name the ack window and its RPO, pick convergence by order, version, or algebra, enumerate the failure domains the placement actually survives, assign every read path a rung — and settle the promotion trade in daylight, because the incident will not wait for the meeting. The named types are recognition shortcuts, not the design space.