Borg Process And Storage Traces #
Source: borg.pdf
Borg is primarily a process system:
- jobs are admitted
- tasks move through pending -> running -> dead
- allocs reserve capacity
- the scheduler places tasks
- Borglets report health
- failed or evicted tasks are restarted or rescheduled
Its storage traces are mostly about Borg’s own control-state substrate:
- Borgmaster state in the Paxos-backed persistent store
- checkpoints + change log
- desired/observed cell state
- task/alloc/machine object state
It is not primarily a user-data storage system.
1. Process Trace: Task Run #
This is the main Borg unit of work.
Core lines #
task_stateassigned_machineattempt_idhealth_or_liveness_boundaryactor_trying_to_complete_or_restart
Reusable frame mapping #
state->task_stateowner/view->assigned_machinemonotonic marker->attempt_idvalidity boundary->health_or_liveness_boundaryactor trying to advance->actor_trying_to_complete_or_restart
Trace #
time →
task_state PENDING ------- ASSIGNED ------- RUNNING -------- DEAD/RESCHEDULED
assigned_machine none ---------- M1 ------------ M1 ------------- M2
attempt_id 0 ------------- 1 ------------- 1 -------------- 2
health_or_liveness_boundary - ------------- t5 ------------ health fail ----- t12
actor_trying_to_complete_or_restart scheduler ---- borglet start -- health monitor -- scheduler reassign
What it teaches #
- the scheduler assigns a task from the pending queue
- the Borglet starts and monitors it
- failure or loss causes a new execution generation
attempt_idis the epoch-like marker for task execution
State loci #
- authoritative:
- Borgmaster task state / pending-running-dead state machine
- local/execution:
- Borglet and task processes on the assigned machine
- cached/derived:
- scheduler cached cell state, monitoring views
- repair:
- scheduler + Borgmaster reschedule/restart path
Core invariant #
- only the current valid task attempt may be treated as the live execution of that task slot
2. Process Trace: Pending Queue To Placement #
This is the scheduler’s main progression path.
Core lines #
task_statepriority_bandfeasible_machine_setplacement_decision_versionscheduler_trying_to_place
Reusable frame mapping #
state->task_stateowner/view->feasible_machine_setmonotonic marker->placement_decision_versionvalidity boundary-> current-enough cell state for feasibility/scoringactor trying to advance->scheduler_trying_to_place
Trace #
time →
task_state PENDING -------- PENDING -------- ASSIGNED -------- RUNNING
priority_band prod ----------- prod ----------- prod ------------ prod
feasible_machine_set {M1,M2} -------- {M2} ----------- {M2} ------------ {M2}
placement_decision_version 10 ------------- 11 ------------- 11 -------------- 11
scheduler_trying_to_place scan ----------- rescore -------- assign M2 -------- none
What it teaches #
- scheduling is over a pending backlog
- feasibility and scoring drive placement
- cell state changes can invalidate an older placement view
State loci #
- authoritative:
- Borgmaster cell state and pending queue
- local/execution:
- scheduler process using its local cached copy
- cached/derived:
- feasible/scored machine candidates in scheduler memory
- repair:
- next scheduler pass after stale/failed placement
Core invariant #
- a placement decision must be made against current enough cell state and must not over-allocate machine resources
3. Process Trace: Alloc #
The paper explicitly calls out alloc as a reserved set of machine resources that may hold one or more tasks.
Core lines #
alloc_statealloc_owner_or_attached_jobsalloc_epochexpiry_or_relocation_boundaryactor_trying_to_consume_alloc
Reusable frame mapping #
state->alloc_stateowner/view->alloc_owner_or_attached_jobsmonotonic marker->alloc_epochvalidity boundary->expiry_or_relocation_boundaryactor trying to advance->actor_trying_to_consume_alloc
Trace #
time →
alloc_state FREE -------- RESERVED -------- IN_USE -------- RELOCATED
alloc_owner_or_attached_jobs none ------- allocset A ------ task T1 ------- allocset A on M2
alloc_epoch 0 ----------- 1 --------------- 1 ------------ 2
expiry_or_relocation_boundary - --------- t10 ------------ move needed --- t18
actor_trying_to_consume_alloc scheduler -- task start ----- task runs ----- scheduler rebinds
What it teaches #
- alloc is a first-class capacity reservation
- tasks may run inside allocs
- relocation creates a new authority generation
State loci #
- authoritative:
- Borgmaster alloc object state
- local/execution:
- machine hosting the alloc and tasks inside it
- cached/derived:
- scheduler/resource accounting view of reserved capacity
- repair:
- relocation / reschedule path when alloc must move
Core invariant #
- at most one current valid allocation generation may reserve and expose the same machine resource slice
4. Process Trace: Preemption / Eviction #
This is a core Borg process shape because higher-priority tasks can evict lower-priority ones.
Core lines #
victim_task_statepreemptor_priorityvictim_attempt_idtermination_notice_boundaryactor_trying_to_preempt
Reusable frame mapping #
state->victim_task_stateowner/view-> current victim/preemptor relationshipmonotonic marker->victim_attempt_idvalidity boundary->termination_notice_boundaryactor trying to advance->actor_trying_to_preempt
Trace #
time →
victim_task_state RUNNING ------- SIGTERM_SENT ---- KILLED -------- RESCHEDULED
preemptor_priority none ---------- prod task ------- prod task ----- prod task
victim_attempt_id 4 ------------- 4 -------------- 4 ------------- 5
termination_notice_boundary notice t3 ----- notice active --- expired ------- new start t9
actor_trying_to_preempt none ---------- scheduler ------- borglet kill --- scheduler reassign
What it teaches #
- eviction is an intentional control action, not just a crash
- notice/SIGTERM is the validity boundary before forced kill
- rescheduling creates a new execution attempt
State loci #
- authoritative:
- Borgmaster priority/preemption decision state
- local/execution:
- victim task and Borglet on the machine
- cached/derived:
- preemptor/victim scheduling view, disruption budget view
- repair:
- rescheduler creating a new attempt after eviction
Core invariant #
- once a task is preempted and a newer attempt exists, the older attempt must not continue to count as the live task
5. Process Trace: Admission Control #
Quota and priority are handled before scheduling.
Core lines #
job_submission_statequota_sufficiencypriority_bandadmission_decision_versionactor_trying_to_admit
Reusable frame mapping #
state->job_submission_stateowner/view->priority_bandplus quota evaluation contextmonotonic marker->admission_decision_versionvalidity boundary-> quota sufficiency at submission timeactor trying to advance->actor_trying_to_admit
Trace #
time →
job_submission_state SUBMITTED ------ QUOTA_CHECK ----- ADMITTED ------- PENDING
quota_sufficiency unknown -------- sufficient ------ sufficient ----- sufficient
priority_band batch ---------- batch ----------- batch ---------- batch
admission_decision_version 21 ------------ 22 ------------- 22 ------------- 22
actor_trying_to_admit client --------- borgmaster ------ accept --------- enqueue tasks
What it teaches #
- quota is an admission gate, not the placement algorithm itself
- a job may be admitted and still remain pending if the cell is full
State loci #
- authoritative:
- admission result and job object in Borgmaster state
- local/execution:
- client submission and Borgmaster RPC handling
- cached/derived:
- quota/priority evaluation context
- repair:
- later scheduler passes if the admitted job remains pending
Core invariant #
- a job lacking sufficient quota at the requested priority must not be admitted into schedulable cell state
6. Process Trace: Borglet Health And Restart #
The paper emphasizes health monitoring and automatic restarts.
Core lines #
task_health_statecurrent_attempthealth_check_deadlinerestart_generationactor_trying_to_restart
Reusable frame mapping #
state->task_health_stateowner/view-> current live attemptmonotonic marker->restart_generationvalidity boundary->health_check_deadlineactor trying to advance->actor_trying_to_restart
Trace #
time →
task_health_state healthy -------- unhealthy ------- failed -------- restarted
current_attempt 7 -------------- 7 -------------- 7 ------------- 8
health_check_deadline t4 ------------- t4 expired ----- expired ------- t12
restart_generation 7 -------------- 7 -------------- 7 ------------- 8
actor_trying_to_restart none ----------- monitor -------- borgmaster ---- borglet start
What it teaches #
- health-check failure is a control signal into restart logic
- restart increments the execution generation
State loci #
- authoritative:
- Borgmaster task-health and restart state
- local/execution:
- Borglet performing checks and restarting processes
- cached/derived:
- monitoring/health-check observations
- repair:
- restart path or reschedule path after health failure
Core invariant #
- a failed health state must eventually lead to a new valid attempt or terminal task death
7. Storage Trace: Borgmaster Control-State Store #
This is the primary storage trace in Borg itself.
Core lines #
in_memory_cell_state_versionpaxos_persistent_log_versioncheckpoint_versionread_visible_master_statewriter_trying_to_mutate_state
Reusable frame mapping #
truth locus->in_memory_cell_state_versionbacked bypaxos_persistent_log_versionreplica/derived state-> checkpointed snapshots and replica copiesdurability boundary->paxos_persistent_log_versionvisibility->read_visible_master_stateactor trying to write/read/apply->writer_trying_to_mutate_state
Trace #
time →
in_memory_cell_state_version 100 ------------ 101 pending ------- 101 ----------------
paxos_persistent_log_version 100 ------------ 101 -------------- 101 ----------------
checkpoint_version 95 ------------- 95 --------------- 100 snapshot -------
read_visible_master_state 100 ------------ 101 -------------- 101 ----------------
writer_trying_to_mutate_state none ----------- submit job ------- commit state change -
What it teaches #
- Borgmaster state is both in-memory and persisted via Paxos
- the persistent log is the durable truth boundary for control-state mutation
- checkpoints are periodic storage compaction/snapshot state, not the only source of truth
State loci #
- authoritative:
- elected Borgmaster plus Paxos-backed persistent store
- local/replica:
- each Borgmaster replica’s in-memory copy
- cached/derived:
- checkpoints and read/UI shards
- repair:
- Paxos replication and master failover recovery
Core invariant #
- cell state mutations must not become authoritative unless recorded through the Paxos-backed persistent store
8. Storage Trace: Checkpoint + Change Log Recovery #
The paper explicitly describes checkpoint plus change log reconstruction.
Core lines #
checkpoint_base_versionchange_log_tail_versionreconstructed_master_state_versionrecovery_visibilityactor_trying_to_recover_master
Reusable frame mapping #
truth locus->checkpoint_base_versionpluschange_log_tail_versionreplica/derived state->reconstructed_master_state_versiondurability boundary-> replayed change-log tail after checkpointvisibility->recovery_visibilityactor trying to write/read/apply->actor_trying_to_recover_master
Trace #
time →
checkpoint_base_version 95 ------------- 95 ------------- 100
change_log_tail_version 100 ------------ 101 ------------ 101
reconstructed_master_state_version 95 --------- 100 ------------ 101
recovery_visibility recovering ----- partial -------- authoritative
actor_trying_to_recover_master new master ----- replay log ----- serve traffic
What it teaches #
- checkpoint is not enough alone
- recovery reconstructs authoritative state by replaying the log tail after the checkpoint
- visibility of the recovered master state is delayed until replay catches up
State loci #
- authoritative:
- checkpoint plus persisted change log in Paxos store
- local/replica:
- recovering master’s reconstructed in-memory state
- cached/derived:
- snapshot/checkpoint files used for replay
- repair:
- master recovery logic replaying log tail
Core invariant #
- a recovering master must not serve authoritative state older than checkpoint-plus-replayed-log reconstruction
9. Storage Trace: Machine / Task Object View Synchronization #
The paper notes that Borglets report full state and link shards send diffs into state machines.
Core lines #
borglet_local_machine_viewmaster_object_viewlast_applied_report_versionmaster_visible_machine_stateactor_trying_to_apply_report
Reusable frame mapping #
truth locus->master_object_viewreplica/derived state->borglet_local_machine_viewdurability/freshness boundary->last_applied_report_versionvisibility->master_visible_machine_stateactor trying to write/read/apply->actor_trying_to_apply_report
Trace #
time →
borglet_local_machine_view v40 ------------ v41 health fail ---- v41
master_object_view v40 ------------ v40 --------------- v41
last_applied_report_version 40 ------------- 40 ---------------- 41
master_visible_machine_state healthy -------- healthy ----------- unhealthy
actor_trying_to_apply_report none ----------- link shard ------- state machine apply
What it teaches #
- observed state at the Borglet and authoritative control view at the master can lag
- report application is a storage-style observed-state synchronization trace
State loci #
- authoritative:
- master object state machines
- local/replica:
- Borglet local machine/task state
- cached/derived:
- link shard buffered/diff-applied report view
- repair:
- subsequent full-state reports and state-machine reconciliation
Core invariant #
- master-visible machine/task state must advance monotonically from applied Borglet reports and must not regress
Minimum Borg Trace Set #
If you only want the representative Borg traces, drill these:
process:
Task RunAllocPreemption / EvictionAdmission Control
storage:
Borgmaster Control-State StoreCheckpoint + Change Log Recovery
That is enough to capture the main Borg shapes.