Garbage Collection / Compaction #
The block’s own memory trick, promoted from appendix to thesis — it IS the top-level cleave:
GC asks: what PROVES data is no longer needed?
compaction asks: what PRESERVES MEANING while changing representation?
GC removes. Compaction rewrites.
Two operations, two axes, one shared spine and one shared bottleneck.
Role in the catalog: tier-two protocol block — the maintenance protocol every storage component conjugates. Structurally, checkpoint_replay.md’s dark twin: recovery preserves history, this block destroys it, and the pin registry (below) is the treaty line between them.
Central tension:
space / cost / read efficiency vs safety / freshness / write amplification
And the block’s distinguishing hazard, stated up front:
deletion is the catalog's only irreversible operation.
every other block's stale-claim mistake can be retried, compensated,
fenced, or reconciled. GC's mistake is permanent.
this is WHY everything here is conservative and two-phase.
Part I — GC: The Proof of Death (axis) #
What proves the candidate is garbage? Ordered roughly by strength:
age/policy: proof by decree — retention windows, TTL
(weakest rung: the clock knows nothing about need;
Kafka retention, Prometheus retention, backup expiry)
visibility: no active or promised read view needs it
(MVCC vacuum, snapshot expiration, oldest-active-transaction;
Iceberg expire-snapshots)
convergence: every replica has observed the deletion
(tombstone grace — Cassandra gc_grace_seconds)
reachability: no live root references it
(Git GC, Iceberg orphan files, manifest reachability;
strongest, and only as strong as the completeness
of the root set and reference graph)
Not on this axis — recomputability is not a death proof:
cache eviction removes data that is NOT garbage, merely reloadable.
owned by cache.md (eviction ≠ invalidation there; eviction ≠ GC here).
boundary confirmed: eviction needs a recompute proof; GC needs a death proof.
Interrogation:
Which proof is claimed — decree, visibility, convergence, or reachability?
Who are the roots, and is the root set COMPLETE? (hidden references:
external consumers, backups, manual pins, cross-system links)
Can a reader you don't know about exist? (then visibility is unprovable —
fall back to decree + generous grace)
What replay/recovery/audit promise floors the retention window?
(the checkpoint_replay.md treaty: retention floor = oldest checkpoint
anyone might restore from)
Part II — Compaction: The Equivalence Ladder (axis) #
What equivalence does the rewrite preserve? A weakening ladder:
exact visible-set: new segments ≡ old visible segments, bit-for-answer
(LSM SSTable merge, Lucene segment merge)
keyed-latest: latest state per key reconstructible; history discarded
(Kafka compacted topics, changelog compaction)
query-sufficient: coarser data, DECLARED resolution + aggregation semantics
(Prometheus/Thanos downsampling, rollups)
semantic summary: many facts → one summary, sufficient only for the
invariants it CLAIMS to serve
(balance replacing ledger events, event-sourced snapshot —
the rebuild side is checkpoint_replay.md; this block
owns the safety of discarding what the snapshot subsumes)
The ladder makes the deep-lesson confusion checkable:
"compacted = semantically identical" — identical UNDER WHICH EQUIVALENCE?
name the rung, and the invariant becomes testable.
Interrogation:
Which rung? Is the preserved equivalence written down, or assumed?
Lossy rungs: are resolution/aggregation semantics carried WITH the data?
Non-associative aggregation? (then rollup-of-rollups lies)
Who might later need what this rung discards? (audit, debugging, replay)
Can queries mix rungs without knowing? (mixed-resolution double-count)
Shared Spine 1 — The Pin Registry #
Everything that keeps a candidate alive, adjudicated in one place:
reader pins active snapshots, open transactions, time-travel promises
replica watermarks slowest replica's observed position (convergence proof input)
consumer lag compact only below the low-water-mark of all consumers
recovery checkpoints the checkpoint_replay.md treaty line
legal/compliance holds decree in the other direction — MUST retain
Signature failures are registry failures:
long-running transaction blocks vacuum forever (a pin with no expiry)
compact before consumers catch up (lag not registered)
delete data needed for replay (treaty violated)
break time travel / active reader (pin ignored)
Interrogation:
Is every liveness claim REGISTERED, or does some reader rely on convention?
Do pins expire, and what happens to a pinner who outlives the grace?
What is the oldest-needed-version computation, and what feeds it?
Shared Spine 2 — The Publication Protocol #
The safe ordering, identical for GC and compaction:
discover candidates
prove safety (Part I proof, or Part II equivalence + pin check)
write replacement state (if compacting)
publish new metadata ATOMICALLY <- checkpoint_replay.md's
binding coordinate* recipe:
manifest update, snapshot commit —
state and coordinate visible
together or not at all
wait for old readers/replicas to release
physically delete obsolete state <- the irreversible step, LAST
record progress; retry idempotently on crash
Ordering violations are the classic bugs:
publish output without all inputs merged delete input before output visible
crash between publish and delete (must be idempotent — re-run cleans up,
never re-deletes the wrong generation)
Technical Bottleneck: Proving Death Against a Stale View of Life* #
the reference graph you observed is a recorded claim about a moving world —
readers you don't know about, replicas that lag,
references published concurrently with your mark phase.
staleness-family enemy, with an amplifier no other block has:
the mistake is IRREVERSIBLE.
Known recipes (all conservative, all two-phase in spirit):
grace periods time as a proxy for a convergence proof you cannot obtain
pins force readers to declare themselves or lose protection
watermarks delete only below the minimum of all observed positions
safety windows orphan deletion delayed past any plausible in-flight publish
epoch/generation a concurrent publisher stamps a newer generation;
the collector's candidate list is fenced to its own view
Flagship: the tombstone — deletion’s own two-phase commit.
phase 1: replicate the FACT of deletion as data (the marker)
phase 2: remove the marker, only after convergence is proven
resurrection is exactly the 2PC violation:
remove the marker before every replica observed it, and anti-entropy
helpfully restores the value from a replica that never saw the delete.
gc_grace_seconds is the timeout on phase 2 —
and a replica down longer than the grace is a correctness event, not an ops event.
A strong design says explicitly:
what roots and pins keep data alive,
which proof of death is claimed and what makes it sound,
which equivalence compaction preserves (named rung, testable invariant),
the publication order (atomic publish first, physical delete last),
and how cleanup retries idempotently after a crash.
Imports (owned elsewhere — referenced, not restated) #
amplification triangle capacity.md axis 5 — compaction is where you PAY
read/write/space amplification, by merge-policy choice
compaction backlog backpressure.md — background work racing foreground;
a goodput problem (falling behind = read-amp debt compounding)
merge policy a scheduler (scheduler.md axis 2: which segments, when,
objective = amplification tradeoff)
compacted-revision watch checkpoint_replay.md control-plane row — "watcher asks
for compacted revision -> relist"; the invariant
"clients must be TOLD history is gone" is that row's contract
eviction cache.md — recompute proof, not death proof
event-sourced snapshot checkpoint_replay.md owns rebuild; this block owns
discard safety
Named Configurations (lookup table) #
Vector = {operation, proof or rung, pin sources, canonical study object}.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Reachability GC | GC, reachability, refs+manifests | Git GC / pack pruning; Iceberg orphan cleanup | hidden reference; race with concurrent publish; incomplete graph |
| Retention GC | GC, decree, recovery treaty + legal holds | Kafka log retention | deletes replay-needed data; clock skew; compliance breach |
| Tombstone GC | GC, convergence, replica watermarks | Cassandra gc_grace_seconds | resurrection (phase-2 too early); replica down past grace; tombstone buildup |
| Snapshot expiration | GC, visibility, reader pins + oldest-active-txn | Iceberg expire-snapshots; MVCC vacuum | broken time travel; eternal pin blocks vacuum; version needed by backup |
| Log compaction | compaction, keyed-latest, consumer-lag watermark | Kafka compacted topics | needed history dropped; compact past a lagging consumer; tombstone removed early |
| Segment merge | compaction, exact visible-set, snapshot pins | RocksDB/LevelDB LSM | dropped live version; duplicate row; publish-order violation; backlog |
| Index merge | compaction, exact visible-set, open searchers | Lucene segment merge | doc-ID remap error; deleted doc returns; merge IO starves serving |
| Downsampling | compaction, query-sufficient, retention tiers | Prometheus/Thanos blocks | lost needed resolution; double-count on overlap; mixed-resolution queries |
| Metadata compaction | compaction, keyed-latest-ish, watcher revisions | etcd compaction; Raft snapshot | watcher past compaction -> relist; snapshot inconsistent with log tail |
| Semantic compaction | compaction, semantic summary, audit requirements | event-sourced snapshot; aggregates | summary insufficient later; semantics drift; auditability lost; non-associative rollup |
One-Line Study Prompts (kept — the best study aid in the source doc) #
Git GC: what are the roots, and when is an object unreachable?
Kafka retention: what replay guarantee dies with a deleted segment?
Cassandra tombstones: why does deleting the delete marker resurrect data?
Kafka compaction: how does latest-per-key preserve state reconstruction?
RocksDB compaction: how does merge policy trade write amp for read/space?
Lucene merge: how do immutable segments handle deletes and ID remap?
Prometheus blocks: how do blocks merge without double-count or lost resolution?
Iceberg expiration: how do snapshots and manifests keep data files alive?
etcd compaction: what happens when a watcher asks for a compacted revision?
Event snapshot: what facts are lost when events become summary state?
Highest-yield five: Kafka retention+compaction, RocksDB LSM, Lucene merge, Iceberg expiration, etcd/Raft snapshotting.
Vocabulary #
root reference reachability orphan pin manifest
retention TTL grace period legal hold watermark oldest-needed-version
tombstone delete marker resurrection anti-entropy convergence
merge rewrite vacuum downsample rollup resolution
snapshot read view time travel oldest active transaction
publish obsolete safety window idempotent cleanup
read/write/space amplification (→ capacity.md) merge policy backlog
Deep Lesson #
GC/compaction bugs come from confusing pairs across the two parts:
old vs unreachable (Part I: decree is not proof)
deleted vs safe to forget (tombstone: fact first, marker later)
compacted vs semantically identical (Part II: identical under WHICH rung?)
cache eviction vs durable deletion (recompute proof vs death proof)
retention vs recovery guarantee (the checkpoint_replay treaty)
snapshot expiration vs file deletion (metadata death ≠ physical death;
publication order stands between them)
tombstone removal vs value deletion (phase 2 vs phase 1)
Design procedure: register every pin, choose the proof of death and defend its soundness, name the equivalence rung and test its invariant, publish atomically, delete last, retry idempotently — and remember the mistake here is the only one in the catalog you cannot take back.