Partition / Shard #
partition = subset of a larger space
shard = independently placed/owned partition
It answers:
which slice owns this key/work/query/state?
Role in the catalog: the operator block. Every block so far has been a component or a protocol; this one is a modular operator wearing an infrastructure costume — partitioning is Splitting, applied to state. So the axes are the standard Splitting questions: what is the cleaving key, which invariants stay local, what crosses the cut, and how do the modules move. And the classic confusions are all one category error:
mistaking the split for the property you wanted the split to deliver.
tenant partitioning ≠ tenant isolation (a mechanism ≠ a concern —
boundary.md's matrix, verbatim)
even key distribution ≠ even load (cardinality balance ≠ cost balance)
per-partition order ≠ global order (local invariants ≠ global ones)
Central tension:
parallelism and scale
vs
skew, cross-shard coordination, and rebalancing complexity
Design Axes (the core module) #
Axis 1 — The Key Scheme (the structural cleave) #
hash: uniform spread, order destroyed — range queries become fanouts
(Kafka partitioner, Dynamo ring, Redis slots)
range: order preserved, splittable/mergeable ADAPTIVELY —
and hot-range prone (Bigtable tablets, HBase regions, TiKV)
time: range's special case where the key is MONOTONIC —
which is WHY the newest partition is always hot (every
writer holds the same key prefix), and WHY retention gets
cheap (GC by dropping whole partitions — garbage_collection's
decree proof at its most efficient)
directory: explicit map (tenant → shard) — arbitrary assignment,
movable one entry at a time; the price is that the map is
now a component with its own availability, freshness
(routing.md axis 3), and consistency story
The hash-vs-range exam question, honestly stated:
do you need the order, and can you afford the fanout?
Interrogation:
Which queries need order or ranges? (they vote range/time)
Which need uniform write spread? (they vote hash)
Who serves the directory, and what happens when it's stale or down?
Can the scheme change later? (Iceberg partition evolution says yes,
with metadata; most systems say no, with a migration)
Axis 2 — What Invariant Is Local to the Slice (the load-bearing axis) #
A partition is precisely a SCOPE OF CHEAP INVARIANTS:
order within a Kafka partition (log.md axis 3: per-partition scope)
transactions within a Spanner group, before 2PC engages
uniqueness within a range
aggregation within a segment
The design move is aligning the partition key with the invariant’s scope. The failure is the cross-slice operation — where every hard problem in the catalog re-enters at once:
cross-shard transaction → quorum.md's transaction-commit row
(2PC-over-groups; the Unknown state returns)
scatter/gather → silent omission* (index_structures.md): a missed shard
is an answer that lies, invisibly
fanout cost → capacity.md: the query's unit economics
multiply by shard count
cross-slice ordering → doesn't exist; buy it back with sequencing
or don't assume it
Interrogation:
List the invariants the application needs. Which are local under THIS key?
Every cross-slice operation: enumerated, priced, and does anyone believe
a partial scatter/gather result is complete?
The one-sentence design test: does the expensive invariant land INSIDE
a slice?
Axis 3 — Placement and Ownership #
a partition is a UNIT; a shard is a partition PLACED and OWNED:
placement → scheduler.md (bin-pack/spread objectives; failure-domain
spread is quorum/replication's placement bet)
ownership → lease_fencing.md (generation-fenced, or split-brain)
replica-partition = partition × replication.md axis 4 — a partition
with a replica set, leader, and ISR is three blocks
composed, not a new type
cell = this axis at boundary.md's blast-radius extreme:
a shard whose isolation concern is failure, dialed to
"shares nothing, not even the control plane"
Interrogation:
Who owns each slice, fenced how? (the zombie-owner question, always)
Are a slice's replicas spread across the failure domains that matter?
For cells: what is still shared, and does it break the blast-radius claim?
Axis 4 — Movement (split, merge, rebalance) #
Resharding is ownership handoff — the dial’s SEVENTH appearance:
gap: keys unowned mid-move (requests fail or queue)
overlap: two owners briefly — fenced by generation, made safe by
idempotent application
plus routing-metadata freshness ( routing.md axis 3: the shard map is routing state on the cache ladder), plus — for hash schemes — the movement-minimization math:
consistent hashing's entire contribution: membership change moves
~1/N of keys, not all of them. virtual nodes smooth the remainder.
Range schemes move differently: split under load, merge under idleness — adaptive, with SPLIT STORM as the signature failure (a hot range split recursively while the hotspot outruns the splitter).
Interrogation:
Gap or overlap during a move — chosen, fenced, written down?
How much data moves per membership change, and who pays the bandwidth?
What triggers a split — size, load, or an operator at 3am?
Does in-flight work survive the move? (work partitions: checkpoint +
generation, checkpoint_replay.md's stamped coordinate)
Axis 5 — Purposes Riding Along (arrows, deliberately) #
Tenant, geo/residency, blast-radius are boundary.md CONCERNS compiled onto the partition MECHANISM — and the compilation-strength question is boundary.md’s matrix, referenced not restated:
tenant-per-shard isolation at whatever strength the shard's
enforcement actually provides (a tenant ID in the
key is convention-strength; a cell is physical)
geo partition residency (a legal concern) + latency (a locality
concern) riding one placement decision — and a
user who moves is a tenant-move problem with
a passport
whale tenants outgrow their shard: the directory scheme's reason
to exist (move ONE entry), and skew*'s social form
Technical Bottleneck: Skew* #
uniform hashing solves the CARDINALITY problem
and leaves the COST problem untouched.
keys are not equal: one celebrity, one whale tenant, one current-hour
partition can exceed the aggregate of everything else.
Essential — every partitioned system meets it — and the recipes are all partial, and all live OUTSIDE the partitioner:
salting / key-splitting split the hot key's writes across sub-keys,
pay a read-side merge. this is the data-model
change routing.md said wasn't an LB knob —
that arrow lands here.
range-split under load the adaptive answer (Bigtable's); split storm
is its failure mode
two-level schemes hash within range, tenant then hash —
compound keys that give each force its own level
heat-based rebalancing move by OBSERVED COST, not key count —
which requires per-slice cost metering:
capacity.md's denomination lesson (the
partitioner counts keys; the bill counts work)
structural acceptance for time partitions, the hot-now partition is
STRUCTURAL: scale the write path of "now"
differently from the cold tail, rather than
pretending a key scheme will fix monotonicity
The star’s one-line form is the deep lesson’s first row, promoted:
even key distribution is not even load —
the partitioner balances what it can count,
and cost is not what it counts.
A strong design says explicitly:
what key defines the slice (axis 1),
which invariants are local and what crosses (axis 2),
who owns and places each slice, fenced how (axis 3),
how ownership moves — gap or overlap, and how much data travels (axis 4),
which boundary concerns ride along, at what strength (axis 5),
and how skew is DETECTED (cost metering) and CORRECTED (a recipe
chosen before the celebrity signs up).
Partition As Protocol (the crossing-point spec — keep) #
choose partition key (axis 1, aligned to axis 2's invariants)
map key/work/time/tenant → partition
route to owner/replica (routing.md; freshness rung named)
split/merge as size/load changes (axis 4)
move ownership: fence old, admit new (lease_fencing.md; the dial)
update routing metadata (atomically enough — GC's publication discipline)
repair/retry during movement (retry_idempotency.md: idempotent under overlap)
preserve the local invariants across all of the above (axis 2, the point)
Kafka instantiation:
producer picks partition by key hash (axis 1)
leader owns append order per partition (axis 2's local invariant)
consumer group assigns partitions (lease_fencing: generation)
offsets track progress per partition (checkpoint_replay: the coordinate)
rebalance moves ownership (the dial: eager=gap, cooperative=overlap)
Range-shard instantiation (Bigtable/HBase/TiKV):
key → range → owner serves
range hotspots or grows → split (skew*'s adaptive recipe)
metadata updates routing (the directory's freshness rung)
old/new owners coordinate handoff (fenced; gap-or-overlap chosen)
Named Configurations (lookup table) #
Vector = {key scheme, local invariant, ownership, movement, riding concerns}. Rows marked → are owned elsewhere.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Hash partition | hash, none-but-locality, ring membership, ~1/N moves, — | Dynamo ring + vnodes | hot key*; churn movement; cross-key query = fanout |
| Range partition | range, order + range scans, per-range owner, split/merge, — | Bigtable tablets; TiKV regions | hot range; split storm; unbalanced ranges |
| Time partition | monotonic range, window locality, —, drop-by-retention, event-time ( materialized.md) | Prometheus blocks; Iceberg time parts | hot-now is structural*; late data (watermarks); small-file explosion |
| Tenant partition | directory, per-tenant everything, movable per entry, one-entry moves, boundary matrix | SaaS tenant-shard map | tenant ID missing from query (convention-strength!); whale outgrows shard |
| Stream partition → log.md, queue.md | hash, per-partition order, generation-fenced consumers, rebalance dial, — | Kafka topic partition | global order assumed; too few partitions caps consumers; rebalance duplicates |
| Replica partition → replication.md | any × replica set, + durability, leader + ISR, —, failure-domain spread | Kafka ISR; Spanner groups | same-domain replicas; stale-replica promotion (promotion*) |
| Work partition → scheduler.md, checkpoint_replay.md | splits/frontier, completion state, leased, reassignment, — | MapReduce splits; Flink splits | straggler; duplicate after reassignment (stamped checkpoints); uneven sizes |
| Query partition → index_structures.md, capacity.md | scan splits, fragment locality, coordinator-assigned, —, — | Trino splits; ES fanout | skew*/straggler; partial-as-complete (silent omission*); tiny-split overhead |
| Geo partition | directory + range, residency + latency, regional owners, user-move, residency (boundary.md) | Spanner placement; Global Tables | cross-region txn latency; wrong residency; failover conflict ( replication.md axis 1) |
| Cell partition | directory, blast radius, per-cell everything, tenant migration, failure boundary | cell architectures | shared control plane breaks the claim; cell imbalance; migration pain |
| Hybrid | compound key, per-level forces, —, —, — | Iceberg partition evolution | wrong key ORDER (prune-ability dies); too many partitions; complex rebalance |
Vocabulary #
partition key shard key compound key key order
range bucket slot token tablet region segment
split merge split storm rebalance reshard
directory shard map routing metadata
hot key hot partition whale tenant skew heat
salting key splitting virtual node 1/N movement
fanout scatter/gather partial result
local invariant cross-shard cell blast radius
Deep Lesson #
Partitioning bugs come from confusing pairs on different axes:
even key distribution vs even load (skew*: cardinality ≠ cost)
partition vs replica (axis 3: a slice ≠ its copies — replication.md)
shard ownership vs routing freshness (axis 3 vs 4: the owner moved; the map didn't — routing.md)
per-partition order vs global order (axis 2: local invariant, assumed global — log.md)
tenant partitioning vs tenant isolation (axis 5: mechanism ≠ concern — boundary.md's matrix)
time partitioning vs event-time correctness (axis 1: storage layout ≠ watermark discipline — materialized.md)
hash partitioning vs range-query support (axis 1: the fanout you signed up for)
Design procedure: list the invariants first, choose the key so the expensive ones land inside slices, price every cross-slice operation, fence the ownership and choose gap-or-overlap for moves, name the strength of every concern riding the mechanism — and meter cost per slice from day one, because the celebrity, the whale, and the current hour are already on their way. The named types are recognition shortcuts, not the design space.