Bigtable Process And Storage Traces #
Source: bigtable.pdf
Archetype Read #
Bigtable is primarily a storage system.
Dominant archetypes:
I19 Chunk / Block / File Storageas substrate support via GFS + SSTablesI16 Replicated KVflavor for row-keyed authoritative serving stateI06 Projection / Index / Searchflavor in the metadata hierarchy and derived serving structure
Supporting process shapes:
membership viewfor tablet servers via Chubbyallocation / assignmentfor tablets to tablet serversbatchfor compactionreconcile targetfor master reassigning unserved tablets
Main judgment:
- Bigtable is storage-primary
- the main truth objects are rows, tablets, memtables, SSTables, commit logs, and tablet metadata
- process mostly exists to keep tablets assigned, recoverable, and compacted
1. Storage Trace: Tablet Write Path #
This is the main Bigtable storage trace.
Core lines #
commit_log_sequencememtable_stateminor_compaction_outputreader_visible_statewriter_trying_mutation
Reusable frame mapping #
truth locus-> commit log + memtable + published SSTables for the tabletreplica/derived state->memtable_state,minor_compaction_outputdurability boundary->commit_log_sequencevisibility->reader_visible_stateactor trying to write/read/apply->writer_trying_mutation
Trace #
time →
commit_log_sequence L100 ------------ L101 ------------- L101 ------------- L101
memtable_state M100 ------------ M101 dirty ------- M101 frozen ------ empty/new
minor_compaction_output S5 -------------- S5 --------------- S6 building ------ S6 published
reader_visible_state sees S5+M100 ---- sees S5+M101 ----- sees S5+M101 ----- sees S5+S6
writer_trying_mutation none ------------ append+apply ----- none ------------- none
What it teaches #
- mutations first hit the commit log and memtable
- visibility may come from
SSTables + memtabletogether - minor compaction converts mutable in-memory state into immutable SSTable state
State loci #
- authoritative:
- commit log + current tablet state on the serving tablet server
- local/replica:
- tablet server memtable for the tablet
- cached/derived:
- newly built SSTable output before full publication
- repair:
- recovery by replaying commit log or flushing via compaction
Core invariant #
- a write must not become logically durable unless it is recorded in the commit log, and reads must merge memtable + SSTables consistently for the row
2. Storage Trace: Read Path Across Memtable And SSTables #
Bigtable reads are a merged storage view.
Core lines #
memtable_versionsstable_settimestamp_order_viewreader_visible_cell_versionsreader_trying_lookup
Reusable frame mapping #
truth locus-> merged row/cell state across memtable and SSTablesreplica/derived state->memtable_version,sstable_setdurability/currentness boundary->timestamp_order_viewvisibility->reader_visible_cell_versionsactor trying to write/read/apply->reader_trying_lookup
Trace #
time →
memtable_version in-mem t9 ------- in-mem t10 ------ flushed
sstable_set {S1,S2} --------- {S1,S2} --------- {S1,S2,S3}
timestamp_order_view t9,t6,t5 -------- t10,t9,t6 ------- t10,t9,t6
reader_visible_cell_versions latest=t9 ----- latest=t10 ------ latest=t10
reader_trying_lookup get(row,col) ---- get(row,col) ---- get(row,col)
What it teaches #
- the latest visible value may be in memory, not yet in SSTables
- Bigtable’s storage truth is multi-versioned by timestamp
- read visibility is a merge over recent in-memory and older immutable disk state
State loci #
- authoritative:
- serving tablet’s merged row state
- local/replica:
- memtable and SSTable readers on the tablet server
- cached/derived:
- block cache / in-memory SSTable indexes / client read result
- repair:
- compaction producing a cleaner merged disk view over time
Core invariant #
- readers must observe the correct newest cell version by merging memtable and SSTables in timestamp order
3. Storage Trace: Tablet Metadata Hierarchy #
This is the core Bigtable location-storage trace.
Core lines #
chubby_root_pointerroot_tablet_locationmetadata_tablet_mappingclient_cached_locationclient_trying_lookup_tablet
Reusable frame mapping #
truth locus->chubby_root_pointer+root_tablet_location+metadata_tablet_mappingreplica/derived state->client_cached_locationfreshness boundary-> cached tablet-location validityvisibility-> client-resolved tablet locationactor trying to write/read/apply->client_trying_lookup_tablet
Trace #
time →
chubby_root_pointer R1 -------------- R1 -------------- R1
root_tablet_location TS7 ------------- TS7 ------------- TS9
metadata_tablet_mapping T[a-m]->TS3 ----- T[a-m]->TS3 ----- T[a-m]->TS4
client_cached_location TS3 ------------- stale TS3 ------- refreshed TS4
client_trying_lookup_tablet locate ------- direct request --- retry via metadata
What it teaches #
- tablet location truth is hierarchical
- client visibility is often a cached derived view
- stale location cache is normal and repaired by walking back up the hierarchy
State loci #
- authoritative:
- Chubby root pointer + root tablet + METADATA tablets
- local/replica:
- tablet servers hosting root/METADATA/user tablets
- cached/derived:
- client cached location entries
- repair:
- client fallback up the hierarchy when cache is stale
Core invariant #
- a client must be able to recover from stale cached tablet location by consulting the authoritative location hierarchy
4. Storage Trace: Tablet Split #
Tablet split is a storage topology change.
Core lines #
parent_tablet_rangemetadata_registration_statechild_tablets_visiblesstable_sharing_stateactor_trying_split_commit
Reusable frame mapping #
truth locus->metadata_registration_statereplica/derived state->child_tablets_visible,sstable_sharing_statedurability boundary-> committed child metadata entriesvisibility-> authoritative child tablet publicationactor trying to write/read/apply->actor_trying_split_commit
Trace #
time →
parent_tablet_range [a,z] ---------- [a,z] splitting --- retired
metadata_registration_state parent only ---- child entries write - children authoritative
child_tablets_visible none ----------- [a,m],[m,z] pending - [a,m],[m,z]
sstable_sharing_state parent SSTables - shared by children - shared/compacted later
actor_trying_split_commit none ----------- tablet server ----- master notices
What it teaches #
- split commits through metadata
- children can initially share immutable SSTables of the parent
- storage topology changes are metadata-first, data-light
State loci #
- authoritative:
- METADATA entries for parent/child tablets
- local/replica:
- splitting tablet server handling the parent tablet
- cached/derived:
- shared immutable SSTables referenced by children
- repair:
- master noticing/finishing split assignment if notification is lost
Core invariant #
- child tablets must not become authoritative before their metadata entries are durably committed
5. Storage Trace: Major Compaction / Garbage Collection #
Bigtable relies on immutable SSTables and cleanup of obsolete files.
Core lines #
live_sstable_setobsolete_sstable_setmetadata_registered_filesgc_sweep_stateactor_trying_compact_or_collect
Reusable frame mapping #
truth locus->metadata_registered_filesreplica/derived state->live_sstable_set,obsolete_sstable_setdurability boundary-> metadata no longer references obsolete filesvisibility->live_sstable_setactor trying to write/read/apply->actor_trying_compact_or_collect
Trace #
time →
live_sstable_set {S1,S2,S3} ------ {S4} ------------- {S4}
obsolete_sstable_set none ------------ {S1,S2,S3} ------- deleted
metadata_registered_files S1,S2,S3 -------- S4 --------------- S4
gc_sweep_state idle ------------ mark obsolete ---- sweep done
actor_trying_compact_or_collect compaction -- master GC ------- none
What it teaches #
- immutable file replacement gives simple storage visibility
- metadata registration determines which SSTables are live
- garbage collection is a storage-control process over obsolete immutable files
State loci #
- authoritative:
- METADATA registration of live SSTables
- local/replica:
- tablet server local SSTable set and compaction output
- cached/derived:
- obsolete SSTable candidate set
- repair:
- master mark-and-sweep garbage collection
Core invariant #
- an SSTable must not be deleted until it is no longer registered as live in authoritative metadata
6. Process Trace: Tablet Assignment #
This is the main Bigtable process trace.
Core lines #
tablet_assignment_stateassigned_tablet_serverassignment_generationserver_lock_validityactor_trying_to_serve_or_reassign
Reusable frame mapping #
state->tablet_assignment_stateowner/view->assigned_tablet_servermonotonic marker->assignment_generationvalidity boundary->server_lock_validityactor trying to advance->actor_trying_to_serve_or_reassign
Trace #
time →
tablet_assignment_state UNASSIGNED ------ ASSIGNED -------- SERVING ---------- REASSIGNING
assigned_tablet_server none ----------- TS3 ------------- TS3 -------------- TS4
assignment_generation 10 ------------- 11 -------------- 11 --------------- 12
server_lock_validity - -------------- valid ----------- lost/expired ----- valid
actor_trying_to_serve_or_reassign master ---- TS3 load ------- TS3 serve -------- master move
What it teaches #
- a tablet is assigned to one tablet server at a time
- Chubby-backed liveness is the validity boundary
- reassignment changes the serving generation
State loci #
- authoritative:
- master tablet-assignment map
- local/execution:
- assigned tablet server serving the tablet
- cached/derived:
- client cached tablet location, master in-memory assignment view
- repair:
- master reassigning tablets after loss of validity
Core invariant #
- at most one currently valid tablet server may serve a tablet assignment generation
7. Process Trace: Tablet Server Membership View #
Bigtable’s later design uses Chubby-backed tablet-server membership.
Core lines #
live_server_setmembership_generationserver_lock_stateexpiry_or_session_validityactor_trying_to_act_as_server
Reusable frame mapping #
state-> current server membership/serving stateowner/view->live_server_setmonotonic marker->membership_generationvalidity boundary->expiry_or_session_validityactor trying to advance->actor_trying_to_act_as_server
Trace #
time →
live_server_set {TS1,TS2,TS3} --- {TS1,TS2,TS3} --- {TS1,TS2}
membership_generation m20 ------------- m20 ------------- m21
server_lock_state TS3 held -------- TS3 lost -------- TS3 absent
expiry_or_session_validity valid ---------- expired --------- expired
actor_trying_to_act_as_server TS3 serves --- TS3 stale tries - rejected/reassigned
What it teaches #
- serving authority depends on continued Chubby session/lock validity
- membership changes should fence out stale servers
State loci #
- authoritative:
- Chubby lock/session state and master’s live-server set
- local/execution:
- tablet server currently holding/renewing its lock
- cached/derived:
- master’s monitored membership view
- repair:
- master removing dead servers and reassigning their tablets
Core invariant #
- a tablet server that has lost its lock/session must not continue serving tablets
8. Process Trace: Master Recovery And Reconciliation #
The master reconstructs assignment state on restart.
Core lines #
master_statelive_servers_discoveredtablets_known_assignedunassigned_tablet_setmaster_trying_to_reconcile
Reusable frame mapping #
state->master_stateowner/view->live_servers_discovered,tablets_known_assignedmonotonic marker-> assignment knowledge convergence over startupvalidity boundary-> reconciled metadata + live-server view before actingactor trying to advance->master_trying_to_reconcile
Trace #
time →
master_state STARTING -------- SCANNING -------- RECONCILING ------ ACTIVE
live_servers_discovered none ------------ {TS1,TS2} ------ {TS1,TS2} -------- {TS1,TS2}
tablets_known_assigned partial --------- from servers ---- from metadata ---- complete
unassigned_tablet_set unknown --------- partial --------- computed --------- draining
master_trying_to_reconcile lock Chubby ---- ask servers ----- scan metadata ---- assign root/others
What it teaches #
- master recovery is a reconcile loop over live servers plus metadata
- assignment truth is reconstructed, not blindly assumed
- root tablet bootstraps metadata discovery
State loci #
- authoritative:
- Chubby master lock + METADATA table + live tablet servers
- local/execution:
- recovering master process
- cached/derived:
- discovered assignments from servers and metadata scan results
- repair:
- reconciliation loop that rebuilds unassigned/assigned tablet sets
Core invariant #
- master must not make new assignment decisions until it has reconciled live server state with metadata-derived tablet existence
9. Process Trace: Tablet Recovery Via Minor Compaction #
The paper explicitly describes speeding tablet movement/recovery by compacting uncompacted log state.
Core lines #
tablet_move_statesource_memtable_log_residueminor_compaction_progressrecovery_requirementactor_trying_to_move_tablet
Reusable frame mapping #
state->tablet_move_stateowner/view-> source/target tablet server relationshipmonotonic marker->minor_compaction_progressvalidity boundary->recovery_requirementactor trying to advance->actor_trying_to_move_tablet
Trace #
time →
tablet_move_state SERVING ---------- COMPACTING ------- READY_TO_MOVE ---- LOADED_ELSEWHERE
source_memtable_log_residue present --------- shrinking -------- eliminated ------- none
minor_compaction_progress none ------------ first+second ----- done ------------ done
recovery_requirement log replay needed - less replay ---- no replay ------- no replay
actor_trying_to_move_tablet source TS ------- compaction ------ master assigns --- target TS load
What it teaches #
- compaction here is a process step used to simplify subsequent storage recovery
- Bigtable reduces recovery cost by flushing mutable residue before reassignment
State loci #
- authoritative:
- tablet’s persisted SSTables plus commit log residue boundary
- local/execution:
- source tablet server compacting before unload, target loading after move
- cached/derived:
- temporary movement/recovery requirement state
- repair:
- minor-compaction-driven recovery avoidance during reassignment
Core invariant #
- a tablet should not be reloaded elsewhere with hidden uncompacted state that would violate the expected recovery boundary
Minimum Bigtable Trace Set #
If you only want the representative Bigtable traces, drill these:
storage:
Tablet Write PathRead Path Across Memtable And SSTablesTablet Metadata HierarchyTablet Split
process:
Tablet AssignmentTablet Server Membership ViewMaster Recovery And Reconciliation
That is enough to capture the main Bigtable shapes.