Skip to main content
  1. System Design Components/

Bigtable Process And Storage Traces #

Source: bigtable.pdf

Archetype Read #

Bigtable is primarily a storage system.

Dominant archetypes:

  • I19 Chunk / Block / File Storage as substrate support via GFS + SSTables
  • I16 Replicated KV flavor for row-keyed authoritative serving state
  • I06 Projection / Index / Search flavor in the metadata hierarchy and derived serving structure

Supporting process shapes:

  • membership view for tablet servers via Chubby
  • allocation / assignment for tablets to tablet servers
  • batch for compaction
  • reconcile target for master reassigning unserved tablets

Main judgment:

  • Bigtable is storage-primary
  • the main truth objects are rows, tablets, memtables, SSTables, commit logs, and tablet metadata
  • process mostly exists to keep tablets assigned, recoverable, and compacted

1. Storage Trace: Tablet Write Path #

This is the main Bigtable storage trace.

Core lines #

  • commit_log_sequence
  • memtable_state
  • minor_compaction_output
  • reader_visible_state
  • writer_trying_mutation

Reusable frame mapping #

  • truth locus -> commit log + memtable + published SSTables for the tablet
  • replica/derived state -> memtable_state, minor_compaction_output
  • durability boundary -> commit_log_sequence
  • visibility -> reader_visible_state
  • actor trying to write/read/apply -> writer_trying_mutation

Trace #

time →

commit_log_sequence      L100 ------------ L101 ------------- L101 ------------- L101
memtable_state           M100 ------------ M101 dirty ------- M101 frozen ------ empty/new
minor_compaction_output  S5 -------------- S5 --------------- S6 building ------ S6 published
reader_visible_state     sees S5+M100 ---- sees S5+M101 ----- sees S5+M101 ----- sees S5+S6
writer_trying_mutation   none ------------ append+apply ----- none ------------- none

What it teaches #

  • mutations first hit the commit log and memtable
  • visibility may come from SSTables + memtable together
  • minor compaction converts mutable in-memory state into immutable SSTable state

State loci #

  • authoritative:
    • commit log + current tablet state on the serving tablet server
  • local/replica:
    • tablet server memtable for the tablet
  • cached/derived:
    • newly built SSTable output before full publication
  • repair:
    • recovery by replaying commit log or flushing via compaction

Core invariant #

  • a write must not become logically durable unless it is recorded in the commit log, and reads must merge memtable + SSTables consistently for the row

2. Storage Trace: Read Path Across Memtable And SSTables #

Bigtable reads are a merged storage view.

Core lines #

  • memtable_version
  • sstable_set
  • timestamp_order_view
  • reader_visible_cell_versions
  • reader_trying_lookup

Reusable frame mapping #

  • truth locus -> merged row/cell state across memtable and SSTables
  • replica/derived state -> memtable_version, sstable_set
  • durability/currentness boundary -> timestamp_order_view
  • visibility -> reader_visible_cell_versions
  • actor trying to write/read/apply -> reader_trying_lookup

Trace #

time →

memtable_version           in-mem t9 ------- in-mem t10 ------ flushed
sstable_set                {S1,S2} --------- {S1,S2} --------- {S1,S2,S3}
timestamp_order_view       t9,t6,t5 -------- t10,t9,t6 ------- t10,t9,t6
reader_visible_cell_versions latest=t9 ----- latest=t10 ------ latest=t10
reader_trying_lookup       get(row,col) ---- get(row,col) ---- get(row,col)

What it teaches #

  • the latest visible value may be in memory, not yet in SSTables
  • Bigtable’s storage truth is multi-versioned by timestamp
  • read visibility is a merge over recent in-memory and older immutable disk state

State loci #

  • authoritative:
    • serving tablet’s merged row state
  • local/replica:
    • memtable and SSTable readers on the tablet server
  • cached/derived:
    • block cache / in-memory SSTable indexes / client read result
  • repair:
    • compaction producing a cleaner merged disk view over time

Core invariant #

  • readers must observe the correct newest cell version by merging memtable and SSTables in timestamp order

3. Storage Trace: Tablet Metadata Hierarchy #

This is the core Bigtable location-storage trace.

Core lines #

  • chubby_root_pointer
  • root_tablet_location
  • metadata_tablet_mapping
  • client_cached_location
  • client_trying_lookup_tablet

Reusable frame mapping #

  • truth locus -> chubby_root_pointer + root_tablet_location + metadata_tablet_mapping
  • replica/derived state -> client_cached_location
  • freshness boundary -> cached tablet-location validity
  • visibility -> client-resolved tablet location
  • actor trying to write/read/apply -> client_trying_lookup_tablet

Trace #

time →

chubby_root_pointer      R1 -------------- R1 -------------- R1
root_tablet_location     TS7 ------------- TS7 ------------- TS9
metadata_tablet_mapping  T[a-m]->TS3 ----- T[a-m]->TS3 ----- T[a-m]->TS4
client_cached_location   TS3 ------------- stale TS3 ------- refreshed TS4
client_trying_lookup_tablet locate ------- direct request --- retry via metadata

What it teaches #

  • tablet location truth is hierarchical
  • client visibility is often a cached derived view
  • stale location cache is normal and repaired by walking back up the hierarchy

State loci #

  • authoritative:
    • Chubby root pointer + root tablet + METADATA tablets
  • local/replica:
    • tablet servers hosting root/METADATA/user tablets
  • cached/derived:
    • client cached location entries
  • repair:
    • client fallback up the hierarchy when cache is stale

Core invariant #

  • a client must be able to recover from stale cached tablet location by consulting the authoritative location hierarchy

4. Storage Trace: Tablet Split #

Tablet split is a storage topology change.

Core lines #

  • parent_tablet_range
  • metadata_registration_state
  • child_tablets_visible
  • sstable_sharing_state
  • actor_trying_split_commit

Reusable frame mapping #

  • truth locus -> metadata_registration_state
  • replica/derived state -> child_tablets_visible, sstable_sharing_state
  • durability boundary -> committed child metadata entries
  • visibility -> authoritative child tablet publication
  • actor trying to write/read/apply -> actor_trying_split_commit

Trace #

time →

parent_tablet_range         [a,z] ---------- [a,z] splitting --- retired
metadata_registration_state parent only ---- child entries write - children authoritative
child_tablets_visible       none ----------- [a,m],[m,z] pending - [a,m],[m,z]
sstable_sharing_state       parent SSTables - shared by children - shared/compacted later
actor_trying_split_commit   none ----------- tablet server ----- master notices

What it teaches #

  • split commits through metadata
  • children can initially share immutable SSTables of the parent
  • storage topology changes are metadata-first, data-light

State loci #

  • authoritative:
    • METADATA entries for parent/child tablets
  • local/replica:
    • splitting tablet server handling the parent tablet
  • cached/derived:
    • shared immutable SSTables referenced by children
  • repair:
    • master noticing/finishing split assignment if notification is lost

Core invariant #

  • child tablets must not become authoritative before their metadata entries are durably committed

5. Storage Trace: Major Compaction / Garbage Collection #

Bigtable relies on immutable SSTables and cleanup of obsolete files.

Core lines #

  • live_sstable_set
  • obsolete_sstable_set
  • metadata_registered_files
  • gc_sweep_state
  • actor_trying_compact_or_collect

Reusable frame mapping #

  • truth locus -> metadata_registered_files
  • replica/derived state -> live_sstable_set, obsolete_sstable_set
  • durability boundary -> metadata no longer references obsolete files
  • visibility -> live_sstable_set
  • actor trying to write/read/apply -> actor_trying_compact_or_collect

Trace #

time →

live_sstable_set           {S1,S2,S3} ------ {S4} ------------- {S4}
obsolete_sstable_set       none ------------ {S1,S2,S3} ------- deleted
metadata_registered_files  S1,S2,S3 -------- S4 --------------- S4
gc_sweep_state             idle ------------ mark obsolete ---- sweep done
actor_trying_compact_or_collect compaction -- master GC ------- none

What it teaches #

  • immutable file replacement gives simple storage visibility
  • metadata registration determines which SSTables are live
  • garbage collection is a storage-control process over obsolete immutable files

State loci #

  • authoritative:
    • METADATA registration of live SSTables
  • local/replica:
    • tablet server local SSTable set and compaction output
  • cached/derived:
    • obsolete SSTable candidate set
  • repair:
    • master mark-and-sweep garbage collection

Core invariant #

  • an SSTable must not be deleted until it is no longer registered as live in authoritative metadata

6. Process Trace: Tablet Assignment #

This is the main Bigtable process trace.

Core lines #

  • tablet_assignment_state
  • assigned_tablet_server
  • assignment_generation
  • server_lock_validity
  • actor_trying_to_serve_or_reassign

Reusable frame mapping #

  • state -> tablet_assignment_state
  • owner/view -> assigned_tablet_server
  • monotonic marker -> assignment_generation
  • validity boundary -> server_lock_validity
  • actor trying to advance -> actor_trying_to_serve_or_reassign

Trace #

time →

tablet_assignment_state    UNASSIGNED ------ ASSIGNED -------- SERVING ---------- REASSIGNING
assigned_tablet_server     none ----------- TS3 ------------- TS3 -------------- TS4
assignment_generation      10 ------------- 11 -------------- 11 --------------- 12
server_lock_validity       - -------------- valid ----------- lost/expired ----- valid
actor_trying_to_serve_or_reassign master ---- TS3 load ------- TS3 serve -------- master move

What it teaches #

  • a tablet is assigned to one tablet server at a time
  • Chubby-backed liveness is the validity boundary
  • reassignment changes the serving generation

State loci #

  • authoritative:
    • master tablet-assignment map
  • local/execution:
    • assigned tablet server serving the tablet
  • cached/derived:
    • client cached tablet location, master in-memory assignment view
  • repair:
    • master reassigning tablets after loss of validity

Core invariant #

  • at most one currently valid tablet server may serve a tablet assignment generation

7. Process Trace: Tablet Server Membership View #

Bigtable’s later design uses Chubby-backed tablet-server membership.

Core lines #

  • live_server_set
  • membership_generation
  • server_lock_state
  • expiry_or_session_validity
  • actor_trying_to_act_as_server

Reusable frame mapping #

  • state -> current server membership/serving state
  • owner/view -> live_server_set
  • monotonic marker -> membership_generation
  • validity boundary -> expiry_or_session_validity
  • actor trying to advance -> actor_trying_to_act_as_server

Trace #

time →

live_server_set           {TS1,TS2,TS3} --- {TS1,TS2,TS3} --- {TS1,TS2}
membership_generation     m20 ------------- m20 ------------- m21
server_lock_state         TS3 held -------- TS3 lost -------- TS3 absent
expiry_or_session_validity valid ---------- expired --------- expired
actor_trying_to_act_as_server TS3 serves --- TS3 stale tries - rejected/reassigned

What it teaches #

  • serving authority depends on continued Chubby session/lock validity
  • membership changes should fence out stale servers

State loci #

  • authoritative:
    • Chubby lock/session state and master’s live-server set
  • local/execution:
    • tablet server currently holding/renewing its lock
  • cached/derived:
    • master’s monitored membership view
  • repair:
    • master removing dead servers and reassigning their tablets

Core invariant #

  • a tablet server that has lost its lock/session must not continue serving tablets

8. Process Trace: Master Recovery And Reconciliation #

The master reconstructs assignment state on restart.

Core lines #

  • master_state
  • live_servers_discovered
  • tablets_known_assigned
  • unassigned_tablet_set
  • master_trying_to_reconcile

Reusable frame mapping #

  • state -> master_state
  • owner/view -> live_servers_discovered, tablets_known_assigned
  • monotonic marker -> assignment knowledge convergence over startup
  • validity boundary -> reconciled metadata + live-server view before acting
  • actor trying to advance -> master_trying_to_reconcile

Trace #

time →

master_state              STARTING -------- SCANNING -------- RECONCILING ------ ACTIVE
live_servers_discovered   none ------------ {TS1,TS2} ------ {TS1,TS2} -------- {TS1,TS2}
tablets_known_assigned    partial --------- from servers ---- from metadata ---- complete
unassigned_tablet_set     unknown --------- partial --------- computed --------- draining
master_trying_to_reconcile lock Chubby ---- ask servers ----- scan metadata ---- assign root/others

What it teaches #

  • master recovery is a reconcile loop over live servers plus metadata
  • assignment truth is reconstructed, not blindly assumed
  • root tablet bootstraps metadata discovery

State loci #

  • authoritative:
    • Chubby master lock + METADATA table + live tablet servers
  • local/execution:
    • recovering master process
  • cached/derived:
    • discovered assignments from servers and metadata scan results
  • repair:
    • reconciliation loop that rebuilds unassigned/assigned tablet sets

Core invariant #

  • master must not make new assignment decisions until it has reconciled live server state with metadata-derived tablet existence

9. Process Trace: Tablet Recovery Via Minor Compaction #

The paper explicitly describes speeding tablet movement/recovery by compacting uncompacted log state.

Core lines #

  • tablet_move_state
  • source_memtable_log_residue
  • minor_compaction_progress
  • recovery_requirement
  • actor_trying_to_move_tablet

Reusable frame mapping #

  • state -> tablet_move_state
  • owner/view -> source/target tablet server relationship
  • monotonic marker -> minor_compaction_progress
  • validity boundary -> recovery_requirement
  • actor trying to advance -> actor_trying_to_move_tablet

Trace #

time →

tablet_move_state          SERVING ---------- COMPACTING ------- READY_TO_MOVE ---- LOADED_ELSEWHERE
source_memtable_log_residue present --------- shrinking -------- eliminated ------- none
minor_compaction_progress  none ------------ first+second ----- done ------------ done
recovery_requirement       log replay needed - less replay ---- no replay ------- no replay
actor_trying_to_move_tablet source TS ------- compaction ------ master assigns --- target TS load

What it teaches #

  • compaction here is a process step used to simplify subsequent storage recovery
  • Bigtable reduces recovery cost by flushing mutable residue before reassignment

State loci #

  • authoritative:
    • tablet’s persisted SSTables plus commit log residue boundary
  • local/execution:
    • source tablet server compacting before unload, target loading after move
  • cached/derived:
    • temporary movement/recovery requirement state
  • repair:
    • minor-compaction-driven recovery avoidance during reassignment

Core invariant #

  • a tablet should not be reloaded elsewhere with hidden uncompacted state that would violate the expected recovery boundary

Minimum Bigtable Trace Set #

If you only want the representative Bigtable traces, drill these:

  • storage:

    • Tablet Write Path
    • Read Path Across Memtable And SSTables
    • Tablet Metadata Hierarchy
    • Tablet Split
  • process:

    • Tablet Assignment
    • Tablet Server Membership View
    • Master Recovery And Reconciliation

That is enough to capture the main Bigtable shapes.