Checkpoint / Replay #
checkpoint = compact record of progress/state
replay = reapply history after the checkpoint
The block’s founding claim:
I do not need to remember everything as current state,
as long as I can recover current state
from a checkpoint plus surviving history.
Role in the catalog: the recovery-side elaboration of state_machine.md axis 2 (persistence & reconstruction). That axis says whether a machine can be rebuilt; this file owns how, at what cost, bound by what coordinate.
Central tension (this is axis 4’s tradeoff):
checkpoint cost (pay now, continuously)
vs
replay cost (pay later, at recovery, against an RTO budget)
The recurring composite — five of the classic “types” are one recipe on different substrates:
authoritative history + state cache + binding coordinate
Postgres: WAL + pages + LSN
Kafka Streams: changelog + state store + applied offset
event sourcing: event log + snapshot + sequence number
Flink: source stream + operator state + barrier ID
Raft: log + snapshot + applied index
Recognizing this composite is the block’s main compression.
The Checkpoint Vector Space #
Traverse this decision tree to map your recovery requirements to the correct system architecture:
- Study: Temporal history
- Safety: Determinism mandatory] HistCap -->|Both Bound| HistScope2{3. Scope} HistScope2 -->|Single Machine| WAL[WAL + Checkpoint
- Study: Postgres WAL / LSN
- Safety: Idempotency / redo] HistScope2 -->|Distributed| Stream[Snapshot + Changelog
- Study: Kafka Streams state store
- Safety: Double-apply guards] HistCap -->|Events Only| EventSource[Event Sourcing
- Study: Domain Aggregate Log
- Safety: Projection idempotency] %% State-Authoritative Branch Auth -->|State is Truth| StateCap{2. Captured Content: What is stored?} StateCap -->|Progress Only| Progress[Progress Checkpoint
- Study: Kafka Committed Offsets / CDC
- Safety: At-least-once / dedupe] StateCap -->|State + Offsets| StateScope{3. Scope} StateScope -->|Distributed Cut| DistSnap[Distributed Snapshot
- Study: Flink / Chandy-Lamport
- Safety: Transactional sinks] StateScope -->|Single Machine| IncrCheck[Incremental Checkpoint
- Study: RocksDB incrementals
- Safety: Chain-restore validation] StateCap -->|State Only - Digest-Keyed| Cache[Artifact Memoization
- Study: Bazel Action Cache
- Safety: Hermeticity check] %% Styling classDef default fill:#1e293b,stroke:#475569,stroke-width:1px,color:#f1f5f9; classDef leaf fill:#0f172a,stroke:#3182ce,stroke-width:2px,color:#fff; class WF,WAL,Stream,EventSource,Progress,DistSnap,IncrCheck,Cache leaf;
Design Axes (the core module) #
Axis 1 — What Is Authoritative (the structural cleave) #
history-authoritative: the log is truth; checkpoints are performance
optimizations, deletable and rebuildable
(WAL, event sourcing, Temporal, Raft)
state-authoritative: the checkpoint IS truth; history is discarded
(ML model checkpoints, backups, CI artifacts)
Consequences:
history-auth: rebuild any point; survive checkpoint corruption;
pay retention + replay machinery
state-auth: cheap and simple; a bad/corrupt checkpoint is unrecoverable;
"replay" does not exist — only restore
This is queue.md axis 1 (remove-on-ack vs retained-log) asked from the recovery side.
Interrogation:
What is authoritative: checkpoint, log, or some external system?
If the latest checkpoint is corrupt, what happens?
Can old checkpoints be deleted? Can old history? Who decides, on what signal?
Axis 2 — What Is Captured #
progress only: input coordinate; state lives elsewhere or nowhere
(Kafka committed offset, CDC cursor, crawler frontier)
state only: computation state; input identity implicit or absent
(model weights, game save)
both, bound: state + the exact input position it reflects
(the composite above — the workhorse configuration)
Interrogation:
What exact progress coordinate is recorded? (offset? LSN? event ID? digest?)
If state and progress are both captured, what binds them? (see bottleneck*)
Progress-only: is the state it implies actually reconstructible elsewhere?
Axis 3 — Consistency Scope of the Capture #
single actor: trivial — a local write
single machine: an LSN / sequence point suffices
distributed cut: "a moment in time" does not exist and must be
manufactured — markers flow through the dataflow
(Chandy-Lamport; Flink barriers)
Distributed-cut costs:
barrier alignment -> backpressure while fast inputs wait for slow ones
slow partition delays the whole checkpoint
repeated coordinated-checkpoint failure = no recovery point advancing
in-flight messages must be in the cut or replayable — never lost between
Interrogation:
Across how many actors must this capture be consistent?
Is the cut consistent (no message received-but-not-sent within it)?
What does the slowest participant cost everyone else?
Axis 4 — Checkpoint Economics #
full: simple restore; expensive to take
incremental: deltas over a base — cheap to take, restore walks a chain
(RocksDB incrementals, copy-on-write, block-level backup)
frequency: checkpoint interval ≈ replay bound ≈ recovery-time budget
Incremental hazards (all reachability problems):
GC deletes a base still referenced by a live delta chain
delta chain too long -> restore latency explodes
manifest inconsistency -> fragments exist, checkpoint doesn't
Interrogation:
How much replay is acceptable? (RTO decides frequency, not aesthetics)
Full or incremental? Who garbage-collects, using what reachability model?
Does taking the checkpoint stall processing? (large sync snapshot = pause)
Is restore TESTED? (an unrestored backup is a hope, not a checkpoint)
Axis 5 — Replay Safety #
determinism: does replaying the same history rebuild the same state?
(wall clock, randomness, map order, external reads = violations;
Temporal's whole discipline; "usually same result" is not determinism)
idempotency: internal reapplication guarded by coordinate comparison
("LSN already applied -> skip")
external effects: the replayed path fired side effects the first time —
the commit point* (owned by queue.md), appearing here as
"offset commit vs successful side effect"
recipes: side-effect markers in history (Temporal),
transactional sinks (Flink two-phase commit),
idempotent effects + dedupe keys
Interrogation:
Is replay deterministic — provably, not habitually?
What guards double-application of a history entry?
Which external effects sit on the replay path, and what makes each safe?
Can old checkpoints/history be read after a code or schema change?
(schema evolution: bad event formats live in history forever)
Off-Axis Seat: Digest-Keyed Memoization #
Artifact/build checkpointing (Bazel action cache, CI caches, compiler incremental state, resumable training) is not on the temporal axis at all:
coordinate = content digest of inputs, not a position in time
resume = cache hit
determinism = called hermeticity here
Signature failures are key failures: non-hermetic input omitted from the digest, environment not in the key, partial artifact reused. Same block, different geometry — checkpointing as memoization.
Technical Bottleneck: The Binding Coordinate* #
captured state is only meaningful if tied to the exact point
in history it represents — and to the ownership epoch that took it.
Essential, no general solution. Count the failure modes that are this one problem:
checkpoint inconsistent with input offsets snapshot/changelog mismatch
partial checkpoint published checkpoint of a stale assignment
watch too old -> relist status belongs to old generation
ACK applies to wrong nonce
Known recipes (the block’s crown jewels):
LSN / sequence a total order to bind against
barrier / consistent cut manufacture a coordinate where no global clock exists
atomic manifest publish state + coordinate become visible together or not at all
generation-stamped checkpoint bind to ownership epoch; a zombie's checkpoint
is rejectable (the fencing seam, again)
Corollary — retention is the binding failing in time:
the checkpoint is only as good as the history that survives after it.
"WAL truncated too early" and "changelog retention too short"
are binding failures along the time dimension.
retention floor = oldest checkpoint anyone might restore from.
A strong design says explicitly:
what state is captured,
what history remains after it,
what coordinate ties them together (and to which owner's epoch),
and which effects are safe to replay.
Recovery Protocol (the crossing-point spec — keep) #
General:
record progress coordinate
capture state / publish snapshot atomically with its coordinate
continue processing
on failure: choose a checkpoint
restore state
replay surviving history after the coordinate
dedupe / resolve repeated effects
publish recovered progress
Database instantiation:
mutation -> WAL (before data page, always)
apply to memory/pages
checkpoint dirty state, record recovery LSN
crash -> load checkpoint -> redo WAL past LSN -> undo incomplete txns
Stream-processor instantiation:
coordinator injects barrier
operators snapshot state at barrier, record source offsets
manifest persisted atomically
failure -> restore state, rewind sources to checkpoint offsets
Workflow instantiation:
append event -> worker replays full history deterministically
-> rebuilds state -> schedules next command
-> result appended as new event
(compaction = snapshot + continue-as-new)
Named Configurations (lookup table) #
Vector = {authority, captured, scope, economics, replay safety}.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Progress checkpoint | external/state elsewhere, progress only, per-partition, trivial, at-least-once + idempotency | Kafka committed offset; CDC cursor | commit-before/after-effect; stale-assignment commit; retention outruns cursor |
| State checkpoint | state-auth or bound, state (+offsets), single or distributed, full/incremental, restore-compat | Flink operator state; model checkpoint | state/offset mismatch; format breaks on deploy; partial publish |
| WAL + checkpoint | history-auth, both bound by LSN, single machine, periodic, idempotent redo | Postgres WAL; RocksDB | page-before-WAL; early truncation; lying checkpoint metadata |
| Snapshot + changelog | history-auth, both bound by offset, per-store, compacted, double-apply guards | Kafka Streams state store | offset mismatch; short changelog retention; missing tombstone |
| Workflow replay | history-auth, progress-as-history, single logical actor, continue-as-new compaction, determinism mandatory | Temporal history + replay | non-deterministic code; replayed side effect; unbounded history; version mismatch |
| Event sourcing | history-auth, events + snapshot, per-aggregate, snapshot cadence, projection idempotency | domain aggregate log | bad schema immortal in history; replay semantics drift; correction pain |
| Distributed snapshot | source-auth, state+offsets bound by barrier, distributed cut, coordinated, transactional sinks | Chandy-Lamport; Flink barriers | inconsistent cut; alignment backpressure; slow partition stalls all |
| Incremental checkpoint | as parent, deltas+manifest, —, incremental, chain-restore | RocksDB incrementals in Flink | GC eats shared base; long chains; manifest inconsistency |
| Artifact/memoization | state-auth, digest-keyed, per-action, content-addressed, hermeticity | Bazel action cache | non-hermetic key; environment omitted; partial artifact reuse |
| Control-plane checkpoint | history-auth (watch stream), progress coordinate, per-resource, —, level-triggered tolerance | resourceVersion/relist; observedGeneration; xDS nonce | too-old watch; status of old generation; ACK/nonce mismatch (see xDS notes) |
Vocabulary #
checkpoint snapshot restore replay redo undo
offset cursor LSN revision sequence applied index commit index
generation epoch nonce
barrier consistent cut alignment in-flight message
manifest delta base reference count reachability
changelog compaction tombstone retention
determinism hermeticity idempotency digest
continue-as-new relist observedGeneration
Deep Lesson #
Checkpoint/replay bugs come from confusing pairs on different axes:
liveness heartbeat vs recoverable progress (axis 2: alive ≠ resumable)
offset commit vs successful side effect (axis 5: commit point* — queue.md)
snapshot vs authoritative truth (axis 1: cache vs source)
checkpoint vs transaction commit (recovery point ≠ atomicity point)
replay vs safe re-execution (axis 5: history reapply ≠ effect reapply)
determinism vs "usually same result" (axis 5: habit is not a proof)
retention vs recoverability (bottleneck* corollary: history must outlive its checkpoints)
Design procedure: decide what is authoritative, capture state with its coordinate atomically, scope the cut, budget checkpoint frequency against RTO, then prove replay safe — and test restore before you need it. The named types are recognition shortcuts, not the design space.