Skip to main content
  1. System Design Components/

Borg Process And Storage Traces #

Source: borg.pdf

Borg is primarily a process system:

  • jobs are admitted
  • tasks move through pending -> running -> dead
  • allocs reserve capacity
  • the scheduler places tasks
  • Borglets report health
  • failed or evicted tasks are restarted or rescheduled

Its storage traces are mostly about Borg’s own control-state substrate:

  • Borgmaster state in the Paxos-backed persistent store
  • checkpoints + change log
  • desired/observed cell state
  • task/alloc/machine object state

It is not primarily a user-data storage system.

1. Process Trace: Task Run #

This is the main Borg unit of work.

Core lines #

  • task_state
  • assigned_machine
  • attempt_id
  • health_or_liveness_boundary
  • actor_trying_to_complete_or_restart

Reusable frame mapping #

  • state -> task_state
  • owner/view -> assigned_machine
  • monotonic marker -> attempt_id
  • validity boundary -> health_or_liveness_boundary
  • actor trying to advance -> actor_trying_to_complete_or_restart

Trace #

time →

task_state                    PENDING ------- ASSIGNED ------- RUNNING -------- DEAD/RESCHEDULED
assigned_machine              none ---------- M1 ------------ M1 ------------- M2
attempt_id                    0 ------------- 1 ------------- 1 -------------- 2
health_or_liveness_boundary   - ------------- t5 ------------ health fail ----- t12
actor_trying_to_complete_or_restart scheduler ---- borglet start -- health monitor -- scheduler reassign

What it teaches #

  • the scheduler assigns a task from the pending queue
  • the Borglet starts and monitors it
  • failure or loss causes a new execution generation
  • attempt_id is the epoch-like marker for task execution

State loci #

  • authoritative:
    • Borgmaster task state / pending-running-dead state machine
  • local/execution:
    • Borglet and task processes on the assigned machine
  • cached/derived:
    • scheduler cached cell state, monitoring views
  • repair:
    • scheduler + Borgmaster reschedule/restart path

Core invariant #

  • only the current valid task attempt may be treated as the live execution of that task slot

2. Process Trace: Pending Queue To Placement #

This is the scheduler’s main progression path.

Core lines #

  • task_state
  • priority_band
  • feasible_machine_set
  • placement_decision_version
  • scheduler_trying_to_place

Reusable frame mapping #

  • state -> task_state
  • owner/view -> feasible_machine_set
  • monotonic marker -> placement_decision_version
  • validity boundary -> current-enough cell state for feasibility/scoring
  • actor trying to advance -> scheduler_trying_to_place

Trace #

time →

task_state                  PENDING -------- PENDING -------- ASSIGNED -------- RUNNING
priority_band              prod ----------- prod ----------- prod ------------ prod
feasible_machine_set       {M1,M2} -------- {M2} ----------- {M2} ------------ {M2}
placement_decision_version 10 ------------- 11 ------------- 11 -------------- 11
scheduler_trying_to_place  scan ----------- rescore -------- assign M2 -------- none

What it teaches #

  • scheduling is over a pending backlog
  • feasibility and scoring drive placement
  • cell state changes can invalidate an older placement view

State loci #

  • authoritative:
    • Borgmaster cell state and pending queue
  • local/execution:
    • scheduler process using its local cached copy
  • cached/derived:
    • feasible/scored machine candidates in scheduler memory
  • repair:
    • next scheduler pass after stale/failed placement

Core invariant #

  • a placement decision must be made against current enough cell state and must not over-allocate machine resources

3. Process Trace: Alloc #

The paper explicitly calls out alloc as a reserved set of machine resources that may hold one or more tasks.

Core lines #

  • alloc_state
  • alloc_owner_or_attached_jobs
  • alloc_epoch
  • expiry_or_relocation_boundary
  • actor_trying_to_consume_alloc

Reusable frame mapping #

  • state -> alloc_state
  • owner/view -> alloc_owner_or_attached_jobs
  • monotonic marker -> alloc_epoch
  • validity boundary -> expiry_or_relocation_boundary
  • actor trying to advance -> actor_trying_to_consume_alloc

Trace #

time →

alloc_state                 FREE -------- RESERVED -------- IN_USE -------- RELOCATED
alloc_owner_or_attached_jobs none ------- allocset A ------ task T1 ------- allocset A on M2
alloc_epoch                 0 ----------- 1 --------------- 1 ------------ 2
expiry_or_relocation_boundary - --------- t10 ------------ move needed --- t18
actor_trying_to_consume_alloc scheduler -- task start ----- task runs ----- scheduler rebinds

What it teaches #

  • alloc is a first-class capacity reservation
  • tasks may run inside allocs
  • relocation creates a new authority generation

State loci #

  • authoritative:
    • Borgmaster alloc object state
  • local/execution:
    • machine hosting the alloc and tasks inside it
  • cached/derived:
    • scheduler/resource accounting view of reserved capacity
  • repair:
    • relocation / reschedule path when alloc must move

Core invariant #

  • at most one current valid allocation generation may reserve and expose the same machine resource slice

4. Process Trace: Preemption / Eviction #

This is a core Borg process shape because higher-priority tasks can evict lower-priority ones.

Core lines #

  • victim_task_state
  • preemptor_priority
  • victim_attempt_id
  • termination_notice_boundary
  • actor_trying_to_preempt

Reusable frame mapping #

  • state -> victim_task_state
  • owner/view -> current victim/preemptor relationship
  • monotonic marker -> victim_attempt_id
  • validity boundary -> termination_notice_boundary
  • actor trying to advance -> actor_trying_to_preempt

Trace #

time →

victim_task_state           RUNNING ------- SIGTERM_SENT ---- KILLED -------- RESCHEDULED
preemptor_priority          none ---------- prod task ------- prod task ----- prod task
victim_attempt_id           4 ------------- 4 -------------- 4 ------------- 5
termination_notice_boundary notice t3 ----- notice active --- expired ------- new start t9
actor_trying_to_preempt     none ---------- scheduler ------- borglet kill --- scheduler reassign

What it teaches #

  • eviction is an intentional control action, not just a crash
  • notice/SIGTERM is the validity boundary before forced kill
  • rescheduling creates a new execution attempt

State loci #

  • authoritative:
    • Borgmaster priority/preemption decision state
  • local/execution:
    • victim task and Borglet on the machine
  • cached/derived:
    • preemptor/victim scheduling view, disruption budget view
  • repair:
    • rescheduler creating a new attempt after eviction

Core invariant #

  • once a task is preempted and a newer attempt exists, the older attempt must not continue to count as the live task

5. Process Trace: Admission Control #

Quota and priority are handled before scheduling.

Core lines #

  • job_submission_state
  • quota_sufficiency
  • priority_band
  • admission_decision_version
  • actor_trying_to_admit

Reusable frame mapping #

  • state -> job_submission_state
  • owner/view -> priority_band plus quota evaluation context
  • monotonic marker -> admission_decision_version
  • validity boundary -> quota sufficiency at submission time
  • actor trying to advance -> actor_trying_to_admit

Trace #

time →

job_submission_state       SUBMITTED ------ QUOTA_CHECK ----- ADMITTED ------- PENDING
quota_sufficiency          unknown -------- sufficient ------ sufficient ----- sufficient
priority_band             batch ---------- batch ----------- batch ---------- batch
admission_decision_version 21 ------------ 22 ------------- 22 ------------- 22
actor_trying_to_admit     client --------- borgmaster ------ accept --------- enqueue tasks

What it teaches #

  • quota is an admission gate, not the placement algorithm itself
  • a job may be admitted and still remain pending if the cell is full

State loci #

  • authoritative:
    • admission result and job object in Borgmaster state
  • local/execution:
    • client submission and Borgmaster RPC handling
  • cached/derived:
    • quota/priority evaluation context
  • repair:
    • later scheduler passes if the admitted job remains pending

Core invariant #

  • a job lacking sufficient quota at the requested priority must not be admitted into schedulable cell state

6. Process Trace: Borglet Health And Restart #

The paper emphasizes health monitoring and automatic restarts.

Core lines #

  • task_health_state
  • current_attempt
  • health_check_deadline
  • restart_generation
  • actor_trying_to_restart

Reusable frame mapping #

  • state -> task_health_state
  • owner/view -> current live attempt
  • monotonic marker -> restart_generation
  • validity boundary -> health_check_deadline
  • actor trying to advance -> actor_trying_to_restart

Trace #

time →

task_health_state          healthy -------- unhealthy ------- failed -------- restarted
current_attempt            7 -------------- 7 -------------- 7 ------------- 8
health_check_deadline      t4 ------------- t4 expired ----- expired ------- t12
restart_generation         7 -------------- 7 -------------- 7 ------------- 8
actor_trying_to_restart    none ----------- monitor -------- borgmaster ---- borglet start

What it teaches #

  • health-check failure is a control signal into restart logic
  • restart increments the execution generation

State loci #

  • authoritative:
    • Borgmaster task-health and restart state
  • local/execution:
    • Borglet performing checks and restarting processes
  • cached/derived:
    • monitoring/health-check observations
  • repair:
    • restart path or reschedule path after health failure

Core invariant #

  • a failed health state must eventually lead to a new valid attempt or terminal task death

7. Storage Trace: Borgmaster Control-State Store #

This is the primary storage trace in Borg itself.

Core lines #

  • in_memory_cell_state_version
  • paxos_persistent_log_version
  • checkpoint_version
  • read_visible_master_state
  • writer_trying_to_mutate_state

Reusable frame mapping #

  • truth locus -> in_memory_cell_state_version backed by paxos_persistent_log_version
  • replica/derived state -> checkpointed snapshots and replica copies
  • durability boundary -> paxos_persistent_log_version
  • visibility -> read_visible_master_state
  • actor trying to write/read/apply -> writer_trying_to_mutate_state

Trace #

time →

in_memory_cell_state_version   100 ------------ 101 pending ------- 101 ----------------
paxos_persistent_log_version   100 ------------ 101 -------------- 101 ----------------
checkpoint_version             95 ------------- 95 --------------- 100 snapshot -------
read_visible_master_state      100 ------------ 101 -------------- 101 ----------------
writer_trying_to_mutate_state  none ----------- submit job ------- commit state change -

What it teaches #

  • Borgmaster state is both in-memory and persisted via Paxos
  • the persistent log is the durable truth boundary for control-state mutation
  • checkpoints are periodic storage compaction/snapshot state, not the only source of truth

State loci #

  • authoritative:
    • elected Borgmaster plus Paxos-backed persistent store
  • local/replica:
    • each Borgmaster replica’s in-memory copy
  • cached/derived:
    • checkpoints and read/UI shards
  • repair:
    • Paxos replication and master failover recovery

Core invariant #

  • cell state mutations must not become authoritative unless recorded through the Paxos-backed persistent store

8. Storage Trace: Checkpoint + Change Log Recovery #

The paper explicitly describes checkpoint plus change log reconstruction.

Core lines #

  • checkpoint_base_version
  • change_log_tail_version
  • reconstructed_master_state_version
  • recovery_visibility
  • actor_trying_to_recover_master

Reusable frame mapping #

  • truth locus -> checkpoint_base_version plus change_log_tail_version
  • replica/derived state -> reconstructed_master_state_version
  • durability boundary -> replayed change-log tail after checkpoint
  • visibility -> recovery_visibility
  • actor trying to write/read/apply -> actor_trying_to_recover_master

Trace #

time →

checkpoint_base_version         95 ------------- 95 ------------- 100
change_log_tail_version         100 ------------ 101 ------------ 101
reconstructed_master_state_version 95 --------- 100 ------------ 101
recovery_visibility             recovering ----- partial -------- authoritative
actor_trying_to_recover_master  new master ----- replay log ----- serve traffic

What it teaches #

  • checkpoint is not enough alone
  • recovery reconstructs authoritative state by replaying the log tail after the checkpoint
  • visibility of the recovered master state is delayed until replay catches up

State loci #

  • authoritative:
    • checkpoint plus persisted change log in Paxos store
  • local/replica:
    • recovering master’s reconstructed in-memory state
  • cached/derived:
    • snapshot/checkpoint files used for replay
  • repair:
    • master recovery logic replaying log tail

Core invariant #

  • a recovering master must not serve authoritative state older than checkpoint-plus-replayed-log reconstruction

9. Storage Trace: Machine / Task Object View Synchronization #

The paper notes that Borglets report full state and link shards send diffs into state machines.

Core lines #

  • borglet_local_machine_view
  • master_object_view
  • last_applied_report_version
  • master_visible_machine_state
  • actor_trying_to_apply_report

Reusable frame mapping #

  • truth locus -> master_object_view
  • replica/derived state -> borglet_local_machine_view
  • durability/freshness boundary -> last_applied_report_version
  • visibility -> master_visible_machine_state
  • actor trying to write/read/apply -> actor_trying_to_apply_report

Trace #

time →

borglet_local_machine_view     v40 ------------ v41 health fail ---- v41
master_object_view             v40 ------------ v40 --------------- v41
last_applied_report_version    40 ------------- 40 ---------------- 41
master_visible_machine_state   healthy -------- healthy ----------- unhealthy
actor_trying_to_apply_report   none ----------- link shard ------- state machine apply

What it teaches #

  • observed state at the Borglet and authoritative control view at the master can lag
  • report application is a storage-style observed-state synchronization trace

State loci #

  • authoritative:
    • master object state machines
  • local/replica:
    • Borglet local machine/task state
  • cached/derived:
    • link shard buffered/diff-applied report view
  • repair:
    • subsequent full-state reports and state-machine reconciliation

Core invariant #

  • master-visible machine/task state must advance monotonically from applied Borglet reports and must not regress

Minimum Borg Trace Set #

If you only want the representative Borg traces, drill these:

  • process:

    • Task Run
    • Alloc
    • Preemption / Eviction
    • Admission Control
  • storage:

    • Borgmaster Control-State Store
    • Checkpoint + Change Log Recovery

That is enough to capture the main Borg shapes.