Skip to main content
  1. concepts/

Checkpoint / Replay #

checkpoint = compact record of progress/state
replay     = reapply history after the checkpoint

The block’s founding claim:

I do not need to remember everything as current state,
as long as I can recover current state
from a checkpoint plus surviving history.

Role in the catalog: the recovery-side elaboration of state_machine.md axis 2 (persistence & reconstruction). That axis says whether a machine can be rebuilt; this file owns how, at what cost, bound by what coordinate.

Central tension (this is axis 4’s tradeoff):

checkpoint cost (pay now, continuously)
        vs
replay cost (pay later, at recovery, against an RTO budget)

The recurring composite — five of the classic “types” are one recipe on different substrates:

authoritative history + state cache + binding coordinate
  Postgres:      WAL        + pages     + LSN
  Kafka Streams: changelog  + state store + applied offset
  event sourcing: event log + snapshot  + sequence number
  Flink:         source stream + operator state + barrier ID
  Raft:          log        + snapshot  + applied index

Recognizing this composite is the block’s main compression.


The Checkpoint Vector Space #

Traverse this decision tree to map your recovery requirements to the correct system architecture:

graph TD Start[Start: Choose Recovery Design] --> Auth{1. Authority: What is truth?} %% History-Authoritative Branch Auth -->|History is Truth| HistCap{2. Captured Content: What is stored?} HistCap -->|Progress-as-History| HistScope1{3. Scope} HistScope1 -->|Single Actor| WF[Workflow Replay
- Study: Temporal history
- Safety: Determinism mandatory] HistCap -->|Both Bound| HistScope2{3. Scope} HistScope2 -->|Single Machine| WAL[WAL + Checkpoint
- Study: Postgres WAL / LSN
- Safety: Idempotency / redo] HistScope2 -->|Distributed| Stream[Snapshot + Changelog
- Study: Kafka Streams state store
- Safety: Double-apply guards] HistCap -->|Events Only| EventSource[Event Sourcing
- Study: Domain Aggregate Log
- Safety: Projection idempotency] %% State-Authoritative Branch Auth -->|State is Truth| StateCap{2. Captured Content: What is stored?} StateCap -->|Progress Only| Progress[Progress Checkpoint
- Study: Kafka Committed Offsets / CDC
- Safety: At-least-once / dedupe] StateCap -->|State + Offsets| StateScope{3. Scope} StateScope -->|Distributed Cut| DistSnap[Distributed Snapshot
- Study: Flink / Chandy-Lamport
- Safety: Transactional sinks] StateScope -->|Single Machine| IncrCheck[Incremental Checkpoint
- Study: RocksDB incrementals
- Safety: Chain-restore validation] StateCap -->|State Only - Digest-Keyed| Cache[Artifact Memoization
- Study: Bazel Action Cache
- Safety: Hermeticity check] %% Styling classDef default fill:#1e293b,stroke:#475569,stroke-width:1px,color:#f1f5f9; classDef leaf fill:#0f172a,stroke:#3182ce,stroke-width:2px,color:#fff; class WF,WAL,Stream,EventSource,Progress,DistSnap,IncrCheck,Cache leaf;

Design Axes (the core module) #

Axis 1 — What Is Authoritative (the structural cleave) #

history-authoritative:  the log is truth; checkpoints are performance
                        optimizations, deletable and rebuildable
                        (WAL, event sourcing, Temporal, Raft)
state-authoritative:    the checkpoint IS truth; history is discarded
                        (ML model checkpoints, backups, CI artifacts)

Consequences:

history-auth: rebuild any point; survive checkpoint corruption;
              pay retention + replay machinery
state-auth:   cheap and simple; a bad/corrupt checkpoint is unrecoverable;
              "replay" does not exist — only restore

This is queue.md axis 1 (remove-on-ack vs retained-log) asked from the recovery side.

Interrogation:

What is authoritative: checkpoint, log, or some external system?
If the latest checkpoint is corrupt, what happens?
Can old checkpoints be deleted? Can old history? Who decides, on what signal?

Axis 2 — What Is Captured #

progress only:   input coordinate; state lives elsewhere or nowhere
                 (Kafka committed offset, CDC cursor, crawler frontier)
state only:      computation state; input identity implicit or absent
                 (model weights, game save)
both, bound:     state + the exact input position it reflects
                 (the composite above — the workhorse configuration)

Interrogation:

What exact progress coordinate is recorded? (offset? LSN? event ID? digest?)
If state and progress are both captured, what binds them? (see bottleneck*)
Progress-only: is the state it implies actually reconstructible elsewhere?

Axis 3 — Consistency Scope of the Capture #

single actor:       trivial — a local write
single machine:     an LSN / sequence point suffices
distributed cut:    "a moment in time" does not exist and must be
                    manufactured — markers flow through the dataflow
                    (Chandy-Lamport; Flink barriers)

Distributed-cut costs:

barrier alignment -> backpressure while fast inputs wait for slow ones
slow partition delays the whole checkpoint
repeated coordinated-checkpoint failure = no recovery point advancing
in-flight messages must be in the cut or replayable — never lost between

Interrogation:

Across how many actors must this capture be consistent?
Is the cut consistent (no message received-but-not-sent within it)?
What does the slowest participant cost everyone else?

Axis 4 — Checkpoint Economics #

full:         simple restore; expensive to take
incremental:  deltas over a base — cheap to take, restore walks a chain
              (RocksDB incrementals, copy-on-write, block-level backup)
frequency:    checkpoint interval ≈ replay bound ≈ recovery-time budget

Incremental hazards (all reachability problems):

GC deletes a base still referenced by a live delta chain
delta chain too long -> restore latency explodes
manifest inconsistency -> fragments exist, checkpoint doesn't

Interrogation:

How much replay is acceptable? (RTO decides frequency, not aesthetics)
Full or incremental? Who garbage-collects, using what reachability model?
Does taking the checkpoint stall processing? (large sync snapshot = pause)
Is restore TESTED? (an unrestored backup is a hope, not a checkpoint)

Axis 5 — Replay Safety #

determinism:   does replaying the same history rebuild the same state?
               (wall clock, randomness, map order, external reads = violations;
               Temporal's whole discipline; "usually same result" is not determinism)
idempotency:   internal reapplication guarded by coordinate comparison
               ("LSN already applied -> skip")
external effects: the replayed path fired side effects the first time —
               the commit point* (owned by queue.md), appearing here as
               "offset commit vs successful side effect"
               recipes: side-effect markers in history (Temporal),
               transactional sinks (Flink two-phase commit),
               idempotent effects + dedupe keys

Interrogation:

Is replay deterministic — provably, not habitually?
What guards double-application of a history entry?
Which external effects sit on the replay path, and what makes each safe?
Can old checkpoints/history be read after a code or schema change?
(schema evolution: bad event formats live in history forever)

Off-Axis Seat: Digest-Keyed Memoization #

Artifact/build checkpointing (Bazel action cache, CI caches, compiler incremental state, resumable training) is not on the temporal axis at all:

coordinate = content digest of inputs, not a position in time
resume     = cache hit
determinism = called hermeticity here

Signature failures are key failures: non-hermetic input omitted from the digest, environment not in the key, partial artifact reused. Same block, different geometry — checkpointing as memoization.


Technical Bottleneck: The Binding Coordinate* #

captured state is only meaningful if tied to the exact point
in history it represents — and to the ownership epoch that took it.

Essential, no general solution. Count the failure modes that are this one problem:

checkpoint inconsistent with input offsets     snapshot/changelog mismatch
partial checkpoint published                   checkpoint of a stale assignment
watch too old -> relist                        status belongs to old generation
ACK applies to wrong nonce

Known recipes (the block’s crown jewels):

LSN / sequence            a total order to bind against
barrier / consistent cut  manufacture a coordinate where no global clock exists
atomic manifest publish   state + coordinate become visible together or not at all
generation-stamped checkpoint  bind to ownership epoch; a zombie's checkpoint
                          is rejectable (the fencing seam, again)

Corollary — retention is the binding failing in time:

the checkpoint is only as good as the history that survives after it.
"WAL truncated too early" and "changelog retention too short"
are binding failures along the time dimension.
retention floor = oldest checkpoint anyone might restore from.

A strong design says explicitly:

what state is captured,
what history remains after it,
what coordinate ties them together (and to which owner's epoch),
and which effects are safe to replay.

Recovery Protocol (the crossing-point spec — keep) #

General:

record progress coordinate
capture state / publish snapshot atomically with its coordinate
continue processing
on failure: choose a checkpoint
restore state
replay surviving history after the coordinate
dedupe / resolve repeated effects
publish recovered progress

Database instantiation:

mutation -> WAL (before data page, always)
apply to memory/pages
checkpoint dirty state, record recovery LSN
crash -> load checkpoint -> redo WAL past LSN -> undo incomplete txns

Stream-processor instantiation:

coordinator injects barrier
operators snapshot state at barrier, record source offsets
manifest persisted atomically
failure -> restore state, rewind sources to checkpoint offsets

Workflow instantiation:

append event -> worker replays full history deterministically
             -> rebuilds state -> schedules next command
             -> result appended as new event
(compaction = snapshot + continue-as-new)

Named Configurations (lookup table) #

Vector = {authority, captured, scope, economics, replay safety}.

NameVectorCanonical study objectSignature failure
Progress checkpointexternal/state elsewhere, progress only, per-partition, trivial, at-least-once + idempotencyKafka committed offset; CDC cursorcommit-before/after-effect; stale-assignment commit; retention outruns cursor
State checkpointstate-auth or bound, state (+offsets), single or distributed, full/incremental, restore-compatFlink operator state; model checkpointstate/offset mismatch; format breaks on deploy; partial publish
WAL + checkpointhistory-auth, both bound by LSN, single machine, periodic, idempotent redoPostgres WAL; RocksDBpage-before-WAL; early truncation; lying checkpoint metadata
Snapshot + changeloghistory-auth, both bound by offset, per-store, compacted, double-apply guardsKafka Streams state storeoffset mismatch; short changelog retention; missing tombstone
Workflow replayhistory-auth, progress-as-history, single logical actor, continue-as-new compaction, determinism mandatoryTemporal history + replaynon-deterministic code; replayed side effect; unbounded history; version mismatch
Event sourcinghistory-auth, events + snapshot, per-aggregate, snapshot cadence, projection idempotencydomain aggregate logbad schema immortal in history; replay semantics drift; correction pain
Distributed snapshotsource-auth, state+offsets bound by barrier, distributed cut, coordinated, transactional sinksChandy-Lamport; Flink barriersinconsistent cut; alignment backpressure; slow partition stalls all
Incremental checkpointas parent, deltas+manifest, —, incremental, chain-restoreRocksDB incrementals in FlinkGC eats shared base; long chains; manifest inconsistency
Artifact/memoizationstate-auth, digest-keyed, per-action, content-addressed, hermeticityBazel action cachenon-hermetic key; environment omitted; partial artifact reuse
Control-plane checkpointhistory-auth (watch stream), progress coordinate, per-resource, —, level-triggered toleranceresourceVersion/relist; observedGeneration; xDS noncetoo-old watch; status of old generation; ACK/nonce mismatch (see xDS notes)

Vocabulary #

checkpoint  snapshot  restore  replay  redo  undo
offset  cursor  LSN  revision  sequence  applied index  commit index
generation  epoch  nonce
barrier  consistent cut  alignment  in-flight message
manifest  delta  base  reference count  reachability
changelog  compaction  tombstone  retention
determinism  hermeticity  idempotency  digest
continue-as-new  relist  observedGeneration

Deep Lesson #

Checkpoint/replay bugs come from confusing pairs on different axes:

liveness heartbeat   vs  recoverable progress   (axis 2: alive ≠ resumable)
offset commit        vs  successful side effect (axis 5: commit point* — queue.md)
snapshot             vs  authoritative truth    (axis 1: cache vs source)
checkpoint           vs  transaction commit     (recovery point ≠ atomicity point)
replay               vs  safe re-execution      (axis 5: history reapply ≠ effect reapply)
determinism          vs  "usually same result"  (axis 5: habit is not a proof)
retention            vs  recoverability         (bottleneck* corollary: history must outlive its checkpoints)

Design procedure: decide what is authoritative, capture state with its coordinate atomically, scope the cut, budget checkpoint frequency against RTO, then prove replay safe — and test restore before you need it. The named types are recognition shortcuts, not the design space.