Skip to main content
  1. concepts/

Materialized View / Projection #

projection        = derived state computed from source state/events
materialized view = persisted projection optimized for reads

It answers:

what should we precompute so reads do not redo expensive work?

Role in the catalog: the family portrait — a Combination block. Cache, index, and materialized view are three siblings of one species:

derived, keyed, rebuildable copies of truth

This block creates no new primitive; it names the composite and states the family axes once. Roughly 70% of the classic “projection types” are already owned:

index/search projection   → index_structures.md (whole)
cache projection          → cache.md (whole, incl. key completeness)
streaming projection      → checkpoint_replay.md's composite
                            (state + offsets bound by coordinate)
                            + log.md's applied rung
batch publish discipline  → garbage_collection.md's publication protocol
aggregate/rollup          → GC Part II query-sufficient rung + index_structures.md
denormalized read model   → cache.md's write-path axis, at schema level

What is native here: event time (the star) and the cutover protocol (a composition worth writing once). Everything else points.

Central tension:

read speed and query simplicity  vs  freshness, correctness, maintenance cost

Family Axes (stated once for all three siblings) #

Axis A — Fill Trigger #

on-read / lazy:    populated by misses; DEMAND decides what exists
                   (cache: pay per miss, coverage emergent)
on-write / eager:  maintained by changes; COVERAGE decided upfront
                   (MV, index: pay per write — capacity.md's
                    write amplification, by subscription)
hybrid:            read-through fill + background refresh

Interrogation:

Who pays — the reader on miss, or every writer forever?
Is the coverage set known upfront (eager viable) or emergent (lazy)?
What is the hit pattern that amortizes eager maintenance? (index_structures.md axis 5)

Axis B — Codomain #

the answer itself:      cache, MV — serve directly
a pointer to candidates: index — verification step mandatory,
                         silent omission* lives there (index_structures.md)

Everything else the family needs is already owned: freshness contract → cache.md axis 1; maintenance coupling → index_structures.md axis 4; rebuildability + progress coordinate → checkpoint_replay.md; authoritative-vs-derived → log.md axis 1 / cache.md deep lesson.


Native Content 1 — Event Time (the block’s own territory) #

No other file owns event time. A projection consuming events must decide what “now” and “complete” mean:

event time:       when it happened        (the domain's truth)
processing time:  when the projection saw it
the gap between them is unbounded and arrives OUT OF ORDER

The machinery:

watermark:          "I declare hour 14 closed" — a completeness BET made
                    explicit. arrivals after it are LATE by definition.
allowed lateness:   a grace window; late-but-tolerated events retract
                    and reissue the affected result
late-data policy:   drop / side-output / retract-and-reissue — chosen,
                    not discovered in an incident
bitemporal record:  store event time AND processing time, so
                    "what did we know when" is answerable
idempotent upsert:  a late correction is an overwrite, not a double-count

The feature-store material is this axis under an ML costume:

point-in-time leakage  = a training join that saw the future
                         (bitemporal join done wrong)
training/serving skew  = the same feature computed under two different
                         event-time disciplines

And two aggregation traps that only exist under reordering:

non-associative aggregation: reordered application changes the answer —
                             the function itself must be commutative-
                             associative, or inputs must be sequenced first
double-count:                the same event applied twice because the
                             progress coordinate and the state diverged
                             (binding coordinate*, checkpoint_replay.md)

Interrogation:

Event time or processing time — and does every window say which?
Where is the watermark, who advances it, and on what evidence?
The late-data policy: drop, side-output, or retract? Written down?
Can the aggregate absorb out-of-order application, or must input be sequenced?
For features: is the offline join point-in-time correct, and is the
  online path computing the SAME function?

Native Content 2 — The Rebuild / Cutover Protocol #

A composition of three owned pieces, written once because “reindex without downtime” is a perennial question:

1. build new generation ALONGSIDE the old (never in place)
2. backfill from a source snapshot          (checkpoint_replay.md rebuild path)
3. catch up on the live stream from the
   snapshot's coordinate                    (the snapshot→stream seam:
                                             the GAP/OVERLAP DIAL, 4th appearance —
                                             gap = writes during rebuild missed;
                                             overlap = old and new views mix /
                                             events double-applied; you choose,
                                             and idempotent upserts pay for overlap)
4. validate the new generation against source (drift check before, not after)
5. cut over ATOMICALLY                      (GC's publication protocol:
                                             generation pointer flips once;
                                             readers see old or new, never a blend)
6. retire the old after readers drain       (GC's pin registry: visibility
                                             proof before physical deletion)

Interrogation:

Which step is skipped in the current design? (usually 4 or 6)
Cutover: is there a single generation pointer, or can queries straddle?
Catch-up seam: gap or overlap — and what makes overlap safe? (idempotency)
Can the source still produce the backfill? (the retention treaty:
  a projection is only as rebuildable as the history that survives)

Technical Bottleneck: Completeness Is a Bet* #

a projection over event time can never know it has seen everything.

state_machine.md’s ignorance*, applied to COMPLETENESS rather than outcome: the projection is permanently in Unknown about whether more input exists, and the watermark is the timeout-driven exit from that state — a declared bet, with a declared policy for losing it.

"late events" is not a failure mode; it is the terrain.
the failure mode is a watermark policy chosen implicitly.

Known recipes (all above, gathered):

watermarks + declared lateness bounds
retraction/reissue for tolerated latecomers
bitemporal storage ("what did we know when")
idempotent upserts (corrections overwrite, never accumulate)
completeness metrics (watermark lag observed, not assumed)

A strong design says explicitly:

what source truth is projected, under which time semantics,
where the watermark is and what advancing it costs,
the late-data policy by name,
how progress and state stay bound (the coordinate),
and the six cutover steps for the day the projection is wrong.

Named Configurations (lookup table — mostly arrows, by design) #

NameOwner / native contentSignature failure
Denormalized read modelcache.md write-path, schema-level; CQRS fanoutsource/view divergence; delete not propagated; partial update
Aggregate view / rollup→ GC Part II rung + index_structures.md aggregate row; event-time traps nativedouble-count; late events; non-associative reorder; wrong grain
Index / search projectionindex_structures.md, whole(see index_structures.md; refresh-lag surprise is cache freshness)
Streaming projection→ checkpoint_replay composite + log.md applied rung; watermarks nativeoffset ahead of state (binding coordinate*); duplicate apply; stale restore
Batch projection→ GC publication protocol (stage, validate, atomic publish, drain)partial publish; rerun overwrites good output; late input missed
Incremental MV→ checkpoint_replay deltas + dependency invalidation ( cache.md surrogate keys)wrong delta logic; accumulated drift; missed dependency
Rebuild / reindexnative: the cutover protocolmixed generations; writes-during-rebuild gap; cutover race; source history gone
Serving analytical (Pinot/Druid)composition: segments (log) + indexes ( index_structures.md) + freshness (cache) + routing ( index_structures.md owner codomain)freshness lag; segment/routing mismatch; wrong grain
Cache projectioncache.md, whole(see cache.md axis 4 for the auth-in-key failure)
Audit/provenance projectionlineage = the derivation’s own log ( log.md fact records); Zanzibar decision traceanswers without explanation; lineage expired before question asked
Feature/ML projectionnative: point-in-time discipline; offline/online = two coupled projectionstraining/serving skew; PIT leakage; backfill mismatch

Vocabulary #

projection  materialized view  read model  derived state
event time  processing time  watermark  allowed lateness
late event  retraction  reissue  bitemporal  point-in-time
grain  rollup  window  non-associative
backfill  catch-up  cutover  generation  drain
training/serving skew  leakage  feature freshness
(coordinate, rebuild, publication, freshness, key → owning files)

Deep Lesson #

Projection bugs come from confusing pairs — mostly already adjudicated:

derived state      vs  source of truth     (cache.md / log.md: the family's premise)
freshness          vs  correctness         (cache.md axis 1)
cache              vs  materialized view   (family axis A: who pays, miss or write)
index hit          vs  authoritative existence (index_structures.md: candidate ≠ truth)
event delivery     vs  projection update   (log.md rung 4 vs applied state)
checkpoint         vs  completed side effect (commit point*, queue.md)
rebuild            vs  atomic cutover      (native: steps 1–3 are not step 5)
processing time    vs  event time          (native: the star's terrain)

Design procedure: place the sibling (axis A, axis B), inherit its owner’s vector, then do the two jobs only this block owns — declare the time semantics with a watermark policy by name, and keep the six-step cutover protocol ready for the day the projection lies. The named types are recognition shortcuts; here, they are mostly arrows.