Materialized View / Projection #
projection = derived state computed from source state/events
materialized view = persisted projection optimized for reads
It answers:
what should we precompute so reads do not redo expensive work?
Role in the catalog: the family portrait — a Combination block. Cache, index, and materialized view are three siblings of one species:
derived, keyed, rebuildable copies of truth
This block creates no new primitive; it names the composite and states the family axes once. Roughly 70% of the classic “projection types” are already owned:
index/search projection → index_structures.md (whole)
cache projection → cache.md (whole, incl. key completeness)
streaming projection → checkpoint_replay.md's composite
(state + offsets bound by coordinate)
+ log.md's applied rung
batch publish discipline → garbage_collection.md's publication protocol
aggregate/rollup → GC Part II query-sufficient rung + index_structures.md
denormalized read model → cache.md's write-path axis, at schema level
What is native here: event time (the star) and the cutover protocol (a composition worth writing once). Everything else points.
Central tension:
read speed and query simplicity vs freshness, correctness, maintenance cost
Family Axes (stated once for all three siblings) #
Axis A — Fill Trigger #
on-read / lazy: populated by misses; DEMAND decides what exists
(cache: pay per miss, coverage emergent)
on-write / eager: maintained by changes; COVERAGE decided upfront
(MV, index: pay per write — capacity.md's
write amplification, by subscription)
hybrid: read-through fill + background refresh
Interrogation:
Who pays — the reader on miss, or every writer forever?
Is the coverage set known upfront (eager viable) or emergent (lazy)?
What is the hit pattern that amortizes eager maintenance? (index_structures.md axis 5)
Axis B — Codomain #
the answer itself: cache, MV — serve directly
a pointer to candidates: index — verification step mandatory,
silent omission* lives there (index_structures.md)
Everything else the family needs is already owned: freshness contract → cache.md axis 1; maintenance coupling → index_structures.md axis 4; rebuildability + progress coordinate → checkpoint_replay.md; authoritative-vs-derived → log.md axis 1 / cache.md deep lesson.
Native Content 1 — Event Time (the block’s own territory) #
No other file owns event time. A projection consuming events must decide what “now” and “complete” mean:
event time: when it happened (the domain's truth)
processing time: when the projection saw it
the gap between them is unbounded and arrives OUT OF ORDER
The machinery:
watermark: "I declare hour 14 closed" — a completeness BET made
explicit. arrivals after it are LATE by definition.
allowed lateness: a grace window; late-but-tolerated events retract
and reissue the affected result
late-data policy: drop / side-output / retract-and-reissue — chosen,
not discovered in an incident
bitemporal record: store event time AND processing time, so
"what did we know when" is answerable
idempotent upsert: a late correction is an overwrite, not a double-count
The feature-store material is this axis under an ML costume:
point-in-time leakage = a training join that saw the future
(bitemporal join done wrong)
training/serving skew = the same feature computed under two different
event-time disciplines
And two aggregation traps that only exist under reordering:
non-associative aggregation: reordered application changes the answer —
the function itself must be commutative-
associative, or inputs must be sequenced first
double-count: the same event applied twice because the
progress coordinate and the state diverged
(binding coordinate*, checkpoint_replay.md)
Interrogation:
Event time or processing time — and does every window say which?
Where is the watermark, who advances it, and on what evidence?
The late-data policy: drop, side-output, or retract? Written down?
Can the aggregate absorb out-of-order application, or must input be sequenced?
For features: is the offline join point-in-time correct, and is the
online path computing the SAME function?
Native Content 2 — The Rebuild / Cutover Protocol #
A composition of three owned pieces, written once because “reindex without downtime” is a perennial question:
1. build new generation ALONGSIDE the old (never in place)
2. backfill from a source snapshot (checkpoint_replay.md rebuild path)
3. catch up on the live stream from the
snapshot's coordinate (the snapshot→stream seam:
the GAP/OVERLAP DIAL, 4th appearance —
gap = writes during rebuild missed;
overlap = old and new views mix /
events double-applied; you choose,
and idempotent upserts pay for overlap)
4. validate the new generation against source (drift check before, not after)
5. cut over ATOMICALLY (GC's publication protocol:
generation pointer flips once;
readers see old or new, never a blend)
6. retire the old after readers drain (GC's pin registry: visibility
proof before physical deletion)
Interrogation:
Which step is skipped in the current design? (usually 4 or 6)
Cutover: is there a single generation pointer, or can queries straddle?
Catch-up seam: gap or overlap — and what makes overlap safe? (idempotency)
Can the source still produce the backfill? (the retention treaty:
a projection is only as rebuildable as the history that survives)
Technical Bottleneck: Completeness Is a Bet* #
a projection over event time can never know it has seen everything.
state_machine.md’s ignorance*, applied to COMPLETENESS rather than outcome: the projection is permanently in Unknown about whether more input exists, and the watermark is the timeout-driven exit from that state — a declared bet, with a declared policy for losing it.
"late events" is not a failure mode; it is the terrain.
the failure mode is a watermark policy chosen implicitly.
Known recipes (all above, gathered):
watermarks + declared lateness bounds
retraction/reissue for tolerated latecomers
bitemporal storage ("what did we know when")
idempotent upserts (corrections overwrite, never accumulate)
completeness metrics (watermark lag observed, not assumed)
A strong design says explicitly:
what source truth is projected, under which time semantics,
where the watermark is and what advancing it costs,
the late-data policy by name,
how progress and state stay bound (the coordinate),
and the six cutover steps for the day the projection is wrong.
Named Configurations (lookup table — mostly arrows, by design) #
| Name | Owner / native content | Signature failure |
|---|---|---|
| Denormalized read model | → cache.md write-path, schema-level; CQRS fanout | source/view divergence; delete not propagated; partial update |
| Aggregate view / rollup | → GC Part II rung + index_structures.md aggregate row; event-time traps native | double-count; late events; non-associative reorder; wrong grain |
| Index / search projection | → index_structures.md, whole | (see index_structures.md; refresh-lag surprise is cache freshness) |
| Streaming projection | → checkpoint_replay composite + log.md applied rung; watermarks native | offset ahead of state (binding coordinate*); duplicate apply; stale restore |
| Batch projection | → GC publication protocol (stage, validate, atomic publish, drain) | partial publish; rerun overwrites good output; late input missed |
| Incremental MV | → checkpoint_replay deltas + dependency invalidation ( cache.md surrogate keys) | wrong delta logic; accumulated drift; missed dependency |
| Rebuild / reindex | native: the cutover protocol | mixed generations; writes-during-rebuild gap; cutover race; source history gone |
| Serving analytical (Pinot/Druid) | composition: segments (log) + indexes ( index_structures.md) + freshness (cache) + routing ( index_structures.md owner codomain) | freshness lag; segment/routing mismatch; wrong grain |
| Cache projection | → cache.md, whole | (see cache.md axis 4 for the auth-in-key failure) |
| Audit/provenance projection | lineage = the derivation’s own log ( log.md fact records); Zanzibar decision trace | answers without explanation; lineage expired before question asked |
| Feature/ML projection | native: point-in-time discipline; offline/online = two coupled projections | training/serving skew; PIT leakage; backfill mismatch |
Vocabulary #
projection materialized view read model derived state
event time processing time watermark allowed lateness
late event retraction reissue bitemporal point-in-time
grain rollup window non-associative
backfill catch-up cutover generation drain
training/serving skew leakage feature freshness
(coordinate, rebuild, publication, freshness, key → owning files)
Deep Lesson #
Projection bugs come from confusing pairs — mostly already adjudicated:
derived state vs source of truth (cache.md / log.md: the family's premise)
freshness vs correctness (cache.md axis 1)
cache vs materialized view (family axis A: who pays, miss or write)
index hit vs authoritative existence (index_structures.md: candidate ≠ truth)
event delivery vs projection update (log.md rung 4 vs applied state)
checkpoint vs completed side effect (commit point*, queue.md)
rebuild vs atomic cutover (native: steps 1–3 are not step 5)
processing time vs event time (native: the star's terrain)
Design procedure: place the sibling (axis A, axis B), inherit its owner’s vector, then do the two jobs only this block owns — declare the time semantics with a watermark policy by name, and keep the six-step cutover protocol ready for the day the projection lies. The named types are recognition shortcuts; here, they are mostly arrows.