Skip to main content
  1. concepts/

Partition / Shard #

partition = subset of a larger space
shard     = independently placed/owned partition

It answers:

which slice owns this key/work/query/state?

Role in the catalog: the operator block. Every block so far has been a component or a protocol; this one is a modular operator wearing an infrastructure costume — partitioning is Splitting, applied to state. So the axes are the standard Splitting questions: what is the cleaving key, which invariants stay local, what crosses the cut, and how do the modules move. And the classic confusions are all one category error:

mistaking the split for the property you wanted the split to deliver.
tenant partitioning ≠ tenant isolation   (a mechanism ≠ a concern —
                                          boundary.md's matrix, verbatim)
even key distribution ≠ even load        (cardinality balance ≠ cost balance)
per-partition order ≠ global order       (local invariants ≠ global ones)

Central tension:

parallelism and scale
        vs
skew, cross-shard coordination, and rebalancing complexity

Design Axes (the core module) #

Axis 1 — The Key Scheme (the structural cleave) #

hash:       uniform spread, order destroyed — range queries become fanouts
            (Kafka partitioner, Dynamo ring, Redis slots)
range:      order preserved, splittable/mergeable ADAPTIVELY —
            and hot-range prone (Bigtable tablets, HBase regions, TiKV)
time:       range's special case where the key is MONOTONIC —
            which is WHY the newest partition is always hot (every
            writer holds the same key prefix), and WHY retention gets
            cheap (GC by dropping whole partitions — garbage_collection's
            decree proof at its most efficient)
directory:  explicit map (tenant → shard) — arbitrary assignment,
            movable one entry at a time; the price is that the map is
            now a component with its own availability, freshness
            (routing.md axis 3), and consistency story

The hash-vs-range exam question, honestly stated:

do you need the order, and can you afford the fanout?

Interrogation:

Which queries need order or ranges? (they vote range/time)
Which need uniform write spread? (they vote hash)
Who serves the directory, and what happens when it's stale or down?
Can the scheme change later? (Iceberg partition evolution says yes,
  with metadata; most systems say no, with a migration)

Axis 2 — What Invariant Is Local to the Slice (the load-bearing axis) #

A partition is precisely a SCOPE OF CHEAP INVARIANTS:

order        within a Kafka partition (log.md axis 3: per-partition scope)
transactions within a Spanner group, before 2PC engages
uniqueness   within a range
aggregation  within a segment

The design move is aligning the partition key with the invariant’s scope. The failure is the cross-slice operation — where every hard problem in the catalog re-enters at once:

cross-shard transaction   → quorum.md's transaction-commit row
                            (2PC-over-groups; the Unknown state returns)
scatter/gather            → silent omission* (index_structures.md): a missed shard
                            is an answer that lies, invisibly
fanout cost               → capacity.md: the query's unit economics
                            multiply by shard count
cross-slice ordering      → doesn't exist; buy it back with sequencing
                            or don't assume it

Interrogation:

List the invariants the application needs. Which are local under THIS key?
Every cross-slice operation: enumerated, priced, and does anyone believe
  a partial scatter/gather result is complete?
The one-sentence design test: does the expensive invariant land INSIDE
  a slice?

Axis 3 — Placement and Ownership #

a partition is a UNIT; a shard is a partition PLACED and OWNED:
  placement  → scheduler.md (bin-pack/spread objectives; failure-domain
               spread is quorum/replication's placement bet)
  ownership  → lease_fencing.md (generation-fenced, or split-brain)
  replica-partition = partition × replication.md axis 4 — a partition
               with a replica set, leader, and ISR is three blocks
               composed, not a new type
  cell       = this axis at boundary.md's blast-radius extreme:
               a shard whose isolation concern is failure, dialed to
               "shares nothing, not even the control plane"

Interrogation:

Who owns each slice, fenced how? (the zombie-owner question, always)
Are a slice's replicas spread across the failure domains that matter?
For cells: what is still shared, and does it break the blast-radius claim?

Axis 4 — Movement (split, merge, rebalance) #

Resharding is ownership handoff — the dial’s SEVENTH appearance:

gap:      keys unowned mid-move (requests fail or queue)
overlap:  two owners briefly — fenced by generation, made safe by
          idempotent application

plus routing-metadata freshness ( routing.md axis 3: the shard map is routing state on the cache ladder), plus — for hash schemes — the movement-minimization math:

consistent hashing's entire contribution: membership change moves
~1/N of keys, not all of them. virtual nodes smooth the remainder.

Range schemes move differently: split under load, merge under idleness — adaptive, with SPLIT STORM as the signature failure (a hot range split recursively while the hotspot outruns the splitter).

Interrogation:

Gap or overlap during a move — chosen, fenced, written down?
How much data moves per membership change, and who pays the bandwidth?
What triggers a split — size, load, or an operator at 3am?
Does in-flight work survive the move? (work partitions: checkpoint +
  generation, checkpoint_replay.md's stamped coordinate)

Axis 5 — Purposes Riding Along (arrows, deliberately) #

Tenant, geo/residency, blast-radius are boundary.md CONCERNS compiled onto the partition MECHANISM — and the compilation-strength question is boundary.md’s matrix, referenced not restated:

tenant-per-shard    isolation at whatever strength the shard's
                    enforcement actually provides (a tenant ID in the
                    key is convention-strength; a cell is physical)
geo partition       residency (a legal concern) + latency (a locality
                    concern) riding one placement decision — and a
                    user who moves is a tenant-move problem with
                    a passport
whale tenants       outgrow their shard: the directory scheme's reason
                    to exist (move ONE entry), and skew*'s social form

Technical Bottleneck: Skew* #

uniform hashing solves the CARDINALITY problem
and leaves the COST problem untouched.
keys are not equal: one celebrity, one whale tenant, one current-hour
partition can exceed the aggregate of everything else.

Essential — every partitioned system meets it — and the recipes are all partial, and all live OUTSIDE the partitioner:

salting / key-splitting    split the hot key's writes across sub-keys,
                           pay a read-side merge. this is the data-model
                           change routing.md said wasn't an LB knob —
                           that arrow lands here.
range-split under load     the adaptive answer (Bigtable's); split storm
                           is its failure mode
two-level schemes          hash within range, tenant then hash —
                           compound keys that give each force its own level
heat-based rebalancing     move by OBSERVED COST, not key count —
                           which requires per-slice cost metering:
                           capacity.md's denomination lesson (the
                           partitioner counts keys; the bill counts work)
structural acceptance      for time partitions, the hot-now partition is
                           STRUCTURAL: scale the write path of "now"
                           differently from the cold tail, rather than
                           pretending a key scheme will fix monotonicity

The star’s one-line form is the deep lesson’s first row, promoted:

even key distribution is not even load —
the partitioner balances what it can count,
and cost is not what it counts.

A strong design says explicitly:

what key defines the slice (axis 1),
which invariants are local and what crosses (axis 2),
who owns and places each slice, fenced how (axis 3),
how ownership moves — gap or overlap, and how much data travels (axis 4),
which boundary concerns ride along, at what strength (axis 5),
and how skew is DETECTED (cost metering) and CORRECTED (a recipe
chosen before the celebrity signs up).

Partition As Protocol (the crossing-point spec — keep) #

choose partition key (axis 1, aligned to axis 2's invariants)
map key/work/time/tenant → partition
route to owner/replica (routing.md; freshness rung named)
split/merge as size/load changes (axis 4)
move ownership: fence old, admit new (lease_fencing.md; the dial)
update routing metadata (atomically enough — GC's publication discipline)
repair/retry during movement (retry_idempotency.md: idempotent under overlap)
preserve the local invariants across all of the above (axis 2, the point)

Kafka instantiation:

producer picks partition by key hash (axis 1)
leader owns append order per partition (axis 2's local invariant)
consumer group assigns partitions (lease_fencing: generation)
offsets track progress per partition (checkpoint_replay: the coordinate)
rebalance moves ownership (the dial: eager=gap, cooperative=overlap)

Range-shard instantiation (Bigtable/HBase/TiKV):

key → range → owner serves
range hotspots or grows → split (skew*'s adaptive recipe)
metadata updates routing (the directory's freshness rung)
old/new owners coordinate handoff (fenced; gap-or-overlap chosen)

Named Configurations (lookup table) #

Vector = {key scheme, local invariant, ownership, movement, riding concerns}. Rows marked → are owned elsewhere.

NameVectorCanonical study objectSignature failure
Hash partitionhash, none-but-locality, ring membership, ~1/N moves, —Dynamo ring + vnodeshot key*; churn movement; cross-key query = fanout
Range partitionrange, order + range scans, per-range owner, split/merge, —Bigtable tablets; TiKV regionshot range; split storm; unbalanced ranges
Time partitionmonotonic range, window locality, —, drop-by-retention, event-time ( materialized.md)Prometheus blocks; Iceberg time partshot-now is structural*; late data (watermarks); small-file explosion
Tenant partitiondirectory, per-tenant everything, movable per entry, one-entry moves, boundary matrixSaaS tenant-shard maptenant ID missing from query (convention-strength!); whale outgrows shard
Stream partition → log.md, queue.mdhash, per-partition order, generation-fenced consumers, rebalance dial, —Kafka topic partitionglobal order assumed; too few partitions caps consumers; rebalance duplicates
Replica partition → replication.mdany × replica set, + durability, leader + ISR, —, failure-domain spreadKafka ISR; Spanner groupssame-domain replicas; stale-replica promotion (promotion*)
Work partition → scheduler.md, checkpoint_replay.mdsplits/frontier, completion state, leased, reassignment, —MapReduce splits; Flink splitsstraggler; duplicate after reassignment (stamped checkpoints); uneven sizes
Query partition → index_structures.md, capacity.mdscan splits, fragment locality, coordinator-assigned, —, —Trino splits; ES fanoutskew*/straggler; partial-as-complete (silent omission*); tiny-split overhead
Geo partitiondirectory + range, residency + latency, regional owners, user-move, residency (boundary.md)Spanner placement; Global Tablescross-region txn latency; wrong residency; failover conflict ( replication.md axis 1)
Cell partitiondirectory, blast radius, per-cell everything, tenant migration, failure boundarycell architecturesshared control plane breaks the claim; cell imbalance; migration pain
Hybridcompound key, per-level forces, —, —, —Iceberg partition evolutionwrong key ORDER (prune-ability dies); too many partitions; complex rebalance

Vocabulary #

partition key  shard key  compound key  key order
range  bucket  slot  token  tablet  region  segment
split  merge  split storm  rebalance  reshard
directory  shard map  routing metadata
hot key  hot partition  whale tenant  skew  heat
salting  key splitting  virtual node  1/N movement
fanout  scatter/gather  partial result
local invariant  cross-shard  cell  blast radius

Deep Lesson #

Partitioning bugs come from confusing pairs on different axes:

even key distribution  vs  even load            (skew*: cardinality ≠ cost)
partition              vs  replica              (axis 3: a slice ≠ its copies — replication.md)
shard ownership        vs  routing freshness    (axis 3 vs 4: the owner moved; the map didn't — routing.md)
per-partition order    vs  global order         (axis 2: local invariant, assumed global — log.md)
tenant partitioning    vs  tenant isolation     (axis 5: mechanism ≠ concern — boundary.md's matrix)
time partitioning      vs  event-time correctness (axis 1: storage layout ≠ watermark discipline — materialized.md)
hash partitioning      vs  range-query support  (axis 1: the fanout you signed up for)

Design procedure: list the invariants first, choose the key so the expensive ones land inside slices, price every cross-slice operation, fence the ownership and choose gap-or-overlap for moves, name the strength of every concern riding the mechanism — and meter cost per slice from day one, because the celebrity, the whale, and the current hour are already on their way. The named types are recognition shortcuts, not the design space.