Skip to main content
  1. concepts/

Index #

index = auxiliary structure that maps a question to candidate data

It answers:

where should I look?

Role in the catalog: a component block with a confession — an index is a cache: a materialized derived structure keyed for lookup. It imports cache.md’s freshness ladder wholesale (“index-source divergence” IS staleness), checkpoint_replay.md’s rebuild path (derived state must be reconstructible from source), and garbage_collection.md’s publication protocol (segment generations). What it owns natively is one asymmetry and one bottleneck — both stated below.

Central tension:

read speed  vs  write cost, storage cost, freshness, and complexity
(capacity.md axis 5: every index moves cost from read-time to
write-time + storage; the query pattern must amortize it)

Design Axes (the core module) #

Axis 1 — The Answer Contract (the structural cleave — a ladder) #

The source doc’s preamble stated it and never used it:

exact:        the entry IS the answer, or points directly to it —
              false positives and false negatives both forbidden
              (primary index, covering secondary)
candidate:    false positives fine, FALSE NEGATIVES CATASTROPHIC —
              answers must be verified against truth; pruning only
              (bloom filters, zone maps, sparse indexes, routing)
approximate:  false negatives permitted and PRICED — recall is a dial
              declared to the caller, not a bug
              (vector ANN: HNSW, IVF/PQ)

The contract decides everything downstream: what verification means, whether the protocol’s “verify candidates” step is optional or load-bearing, and what even counts as a bug.

Interrogation:

Which rung? Is it written down, or assumed by half the callers?
Candidate rung: where is the verification step, and can it be skipped?
Approximate rung: what recall, measured how, declared where?
Exact rung: what maintains exactness, and at what write cost?

Axis 2 — Predicate Shape #

Seven of the fourteen “types” are this one axis wearing different names. Each shape has a natural structure and a signature pathology:

point:            hash / B-tree            hot key
range/order:      B-tree, BRIN, sorted     hot range; sort-order mismatch
containment:      inverted index/postings  analyzer ≠ equality; stale ranking stats
set-combination:  bitmaps (Roaring)        cardinality blowup; row-ID remap
similarity:       HNSW / IVF               wrong metric; embedding version drift
spatial:          S2 / geohash / H3 / R-tree  boundary false positives; hot city cells
temporal:         time index               event-time vs ingest-time; late data
label-match:      label -> series IDs      cardinality explosion; churn

Interrogation:

What question, precisely, does the index answer — and is the query's
  predicate actually the indexed shape? (prefix ≠ contains; analyzed ≠ exact)
Which pathology does this shape invite, and what bounds it?
Selectivity: does the predicate narrow enough to beat a scan?

Axis 3 — Codomain Granularity (what the index maps TO) #

rows:            primary lookup, postings lists
blocks/files:    sparse indexes, zone maps, per-SSTable blooms —
                 pruning units, not answers
owners:          routing indexes — key/query -> shard/endpoint

The owner codomain carries the deep lesson’s sharpest row:

routing index ≠ ownership proof.
a routing entry is ADVISORY placement (scheduler.md axis 4);
treating it as leased/fenced ownership is the split-brain bug.
routing inherits view-vs-reality* — it owns nothing new.

Interrogation:

Row, block, or owner? Coarser codomain = cheaper index + mandatory second hop.
Owner codomain: what fences the route? (nothing — so what does, downstream?)
Does the route provably include ALL relevant data, or just the common case?

Axis 4 — Maintenance Coupling (the cache confession) #

synchronous:     index updated in the source's transaction —
                 divergence impossible, write latency paid
                 (B-tree in the same txn, LSI)
asynchronous:    divergence is a WHEN, not an if — the lag must be
                 named and observable
                 (GSI, search ingestion pipelines, rollups)
immutable-per-segment: written once with the segment; freshness by
                 generation replacement; merged under GC's
                 publication protocol
                 (Lucene segments, Parquet stats, SSTable blooms)

This axis is cache.md’s freshness ladder, applied:

sync ≈ write-through      async ≈ invalidation/eventual      segment ≈ version-pinned
"deleted doc still visible" = a stale cache entry
"stats not tied to schema version" = key incompleteness (cache.md axis 4)

Interrogation:

Is index update atomic with source update? If not — lag SLO, measured where?
Can the index be rebuilt from source truth? (if the index is the only copy,
  it stopped being an index and became the primary — checkpoint_replay.md)
How is drift DETECTED, not just repaired — sampling, checksums, counts?

Axis 5 — Economics (does it pay for itself?) #

selectivity:        low-selectivity secondary index costs more than it saves
cardinality:        bitmap and label indexes blow up past their regime
write amplification: every maintained index is a tax on every write
coverage:           covering index ends the lookup; partial index shrinks the tax

capacity.md’s unit-economics judgment, per artifact: cost moved from read to write+storage — is the movement profitable under the real query distribution, including its tail?

Interrogation:

What is the hit pattern that amortizes this index — measured or hoped?
What happens at 10× the cardinality? (label churn, term dictionary growth)
Which queries does it NOT cover, and do callers know they're scanning?

Technical Bottleneck: Silent Incompleteness* #

a false negative is invisible at query time.
you cannot see what the index didn't return —
the error announces itself never, or in an incident.

Epistemically distinct from the staleness family: staleness is a claim that WAS true and aged; silent incompleteness is an answer that lies by omission and carries no timestamp to interrogate. False positives degrade performance observably; false negatives corrupt answers undetectably. This asymmetry is WHY axis 1’s ladder exists.

The doc’s failure modes that are this one problem:

incorrect stats drop valid data          route omits relevant shard
bloom false negative (catastrophic)      index points past a missing row
query not covered but caller assumes it  deleted-doc bitset wrong way

Known recipes (bounded, none universal):

one-sided by construction    a Bloom filter CANNOT produce a false
                             negative — the math forbids it; that is why
                             it is the canonical pruning primitive.
                             prefer structures whose dangerous error is
                             impossible, not merely rare (flagship)
conservative pruning         skip only what is PROVABLY impossible —
                             min/max semantics must be sound under nulls,
                             NaN, collation, schema version
mandatory verification       on the candidate rung, verify-against-truth
                             is a protocol step, not an optimization
rebuildability               source truth survives independently;
                             a suspect index can be dropped and rebuilt
drift detection              sample/checksum index against source —
                             hunt the invisible error on a schedule

A strong design says explicitly:

what the index maps from and to,
which answer contract it honors (named rung),
how it is maintained and how stale it may be,
how candidates are verified against truth,
and how a silent omission would ever be noticed.

Index As Protocol (the crossing-point spec — keep) #

extract indexable fields from source
insert/update/delete entries (per axis-4 coupling)
publish index generation/segment (GC's atomic-publish discipline)
lookup predicate in index
return candidate locations/IDs
verify/filter candidates against truth   <- load-bearing on the candidate rung
merge results across shards/segments
repair/rebuild on detected drift

LSM/SSTable lookup (a contract stack in four lines):

check bloom filter            (candidate: one-sided, no false negatives)
sparse block index -> seek    (candidate: block granularity)
scan block for key            (exact: verification step)
merge newer sources by seqno  (freshness: axis 4)

Lucene search:

analyze terms -> term dictionary -> postings
intersect/union postings -> score -> merge top-k across segments
(analyzer mismatch = asking a different question than you indexed)

Analytical pruning:

manifest/segment stats -> evaluate predicate against min/max/null
skip impossible units -> scan remainder -> filter exact rows
(pruning is candidate-rung: stats prune, they never authorize)

Named Configurations (lookup table) #

Vector = {contract, shape, codomain, coupling, economics}.

NameVectorCanonical study objectSignature failure
Primary indexexact, point/range, row, sync, corePostgres B-tree; RocksDB lookuphot key; dangling pointer; poor key choice
Secondary indexexact-ish, point/range, row-ref, async (GSI) or sync (LSI), selectivity-pricedDynamoDB GSIindex-source divergence; low-selectivity waste; global hotspot
Sparse indexcandidate, range, block, segment-immutable, ~freeSSTable block indexwrong boundary; too sparse = scan, too dense = memory
Inverted indexexact-per-analyzer, containment, row (postings), segment, dictionary-sizedLucene segmentsanalyzer ≠ equality; deleted doc visible; stale ranking stats
Bitmap indexexact, set-combination, row, segment, cardinality-boundedDruid/Roaring bitmapshigh-cardinality blowup; row-ID remap; segment misalignment
Range indexexact, range/order, row, sync, coreB-tree scan; BRINover-broad scan; sort mismatch; hot range
Bloom filtercandidate, one-sided, point membership, block/segment, immutable, near-freeRocksDB bloomsFP rate erodes benefit; mis-tuned size/hashes; (FN impossible by construction)
Zone maps/statscandidate, range, file/row-group, written-once, ~freeParquet stats + pushdownunsound stats drop valid data*; unsorted data prunes nothing; schema drift
Vector indexapproximate, similarity, row, staleness-prone, memory-heavyHNSWrecall too low; stale after updates; wrong metric; embedding version skew
Aggregate indexexact-per-grain, group-by, precomputed cell, async, grain-pricedPinot StarTree; rollupswrong grain; non-additive metric; stale rollup; uncovered query scans
Routing indexcandidate, key->owner, owner, watch-updated, —Kafka partitioner; ES shard routingstale route; hot shard; route ≠ ownership (split-brain); omitted shard*
Label indexexact, label-match, series ID, sync-ish, cardinality-boundedPrometheus TSDBcardinality explosion; churn; fan-out too broad
Geo indexcandidate, spatial, cell->objects, —, precision-pricedS2 / H3boundary FPs; hot cells; wrong precision; normalization bug
Time indexcandidate, temporal, offset/block, segment, ~freeKafka time indexevent vs ingest time; late data outside block; clock skew

Vocabulary #

key  predicate  candidate set  posting list  bitmap
row ID  doc ID  block  segment  boundary  zone map
false positive  false negative  one-sided  recall
selectivity  cardinality  coverage  covering index
analyzer  term dictionary  grain  rollup
write amplification  maintenance  rebuild  drift
generation  segment publish  pruning  pushdown

Deep Lesson #

Index bugs come from confusing pairs on different axes:

candidate            vs  truth                (axis 1: verification is load-bearing)
index freshness      vs  source freshness     (axis 4: an index is a cache)
low-cardinality      vs  high-cardinality     (axis 5: structures have regimes)
routing index        vs  ownership proof      (axis 3: advisory ≠ fenced — scheduler.md)
analyzer             vs  exact equality       (axis 2: you indexed a different question)
approximate search   vs  exact recall         (axis 1: recall is declared, not assumed)
statistics pruning   vs  authoritative filter (axis 1 + bottleneck*: stats prune,
                                               never authorize — and their unsoundness
                                               is the silent kind)

Design procedure: name the answer contract, match the predicate shape, choose the codomain and its second hop, couple maintenance deliberately and measure the lag, run the unit-economics check — then ask the bottleneck’s question: if this index silently omitted a result, how would anyone ever know? The named types are recognition shortcuts, not the design space.