Distributed Build System (Bazel Remote Execution-class) #

This note models a Bazel remote-execution-class build system where clients submit build actions, a scheduler assigns them to executors, artifacts are stored by content address, and results are reused through caching.

Step 1 - Normalize #

Assume the baseline prompt is:

design a distributed build system
clients submit builds made of many actions
actions run remotely on executors
outputs are uploaded and reused through cache
system scales across many workers

Normalize into state-affecting paths.

Requirement	Actor	Operation	State touched	Priority
Client submits build / action graph	Client	append event	`S1` `create target` `BuildRequest`	C1
Scheduler records runnable action	System	append event	`S1` `create target` `ActionRequest`	C1
Executor claims action for execution	Client	state transition	`S1` `update target` `ActionExecutionState`	C1
Executor uploads action result / artifacts	Client	append event	`S1` `create target` `ContentBlob`	C1
System commits action result metadata	System	state transition	`S1` `update target` `ActionResult`	C1
Client or scheduler checks action cache / CAS	Client	read source	`S1` `read source target` `ActionResult`	R1
System retries timed-out or failed action	System	async process	`S1` `hidden write target` `ActionExecutionState`	C1
System routes queue/shard to current owner	System	read source	`S1` `read source target` `PartitionMap`	C1
System reassigns shard ownership after node failure	System	state transition	`S1` `update target` `PartitionOwnership`	C1
Client reads build status / queue stats	Client	read projection	`S1` `read projection target` `BuildStatusView`	R2

Notes on normalization #

Important choices:

build submission is append event
- build intent is immutable fact recording
runnable action creation is also append event
- action requests are materialized work items
executor claim is state transition
- action moves from ready to claimed/running
artifact upload is append event
- CAS/blob storage is immutable by digest
action result commit is state transition
- current execution/result state changes
retries and ownership are explicit because this is distributed infra

This system is a hybrid of:

claimable execution queue
content-addressed immutable storage
build graph orchestration

Step 2 - Critical Path Selection #

Requirement	Priority class	Why
Submit build / actions	C1	build intent and runnable work must not be lost
Claim action for execution	C1	duplicate uncontrolled execution wastes resources and breaks scheduling correctness
Upload artifacts / results	C1	output correctness and cacheability depend on immutable artifact storage
Commit action result metadata	C1	build progress and cache hits depend on correct result linkage
Check action cache / CAS	R1	major serving path for avoiding re-execution
Retry timed-out or failed action	C1	worker crash recovery and completion depend on safe requeue
Route to shard owner	C1	wrong routing can split claims and action state
Reassign shard ownership	C1	failover must preserve action lifecycle correctness
Build status / stats	R2	operational only

Baseline critical paths #

Main C1 paths:

P1 submit build / action requests
P2 claim action
P3 upload artifacts
P4 commit action result
P5 retry timed-out or failed action
P6 route to shard owner
P7 reassign shard ownership

Main R1 path:

P8 read action cache / CAS

This design is driven by:

immutable artifact addressing
one active executor claim per action attempt
guarded action completion
result reuse through cache keys and digests

Step 3 - Primary State Extraction #

For a distributed build system, the minimal primary state is the build request, runnable action, action execution lifecycle, immutable content blobs, action result metadata, and routing/ownership state.

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
BuildRequest	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	build_id
ActionRequest	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	action_id
ActionExecutionState	lifecycle object	Yes	keep as candidate	process	Yes	service	state machine	instance	action_id
ContentBlob	direct noun	Yes	keep as candidate	event	Yes	service	append-only	instance	digest
ActionResult	lifecycle object	Yes	keep as candidate	entity	Yes	service	overwrite	instance	action_digest
PartitionOwnership	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	shard_id
PartitionMap	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	collection	action shards
BuildStatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	build_id

Important modeling choices #

`BuildRequest` #

Primary because:

build submission is immutable request recording
captures target set, platform, and request metadata

`ActionRequest` #

Primary because:

scheduler materializes runnable work as distinct action items
action identity is central to remote execution

`ActionExecutionState` #

This is the main execution lifecycle object. It captures states like:

READY
CLAIMED(executor_id, claim_token, expiry)
RUNNING
SUCCEEDED
FAILED
RETRY_PENDING

`ContentBlob` #

Primary because:

artifacts are immutable and addressed by digest
CAS storage is one of the core correctness layers

`ActionResult` #

Primary because:

action cache lookup uses current result metadata
links action digest to output digests and execution metadata

`PartitionOwnership` / `PartitionMap` #

Needed because:

scheduler and executors must consistently route action claims and state updates to authoritative owners

Minimal strict primary set #

The strongest minimal set is:

BuildRequest
ActionRequest
ActionExecutionState
ContentBlob
ActionResult
PartitionOwnership
PartitionMap

Step 4 - Hard Invariants #

For a Bazel remote-execution-class system, the hard invariants are about durable action existence, one active executor claim per action attempt, immutable blob addressing, and correct result linkage.

Path	Tier	Type	Invariant statement
`P1` submit build / action requests	HARD	uniqueness	Key `build request_id` maps to at most one logical outcome `recorded build request` within build scope.
`P2` claim action	HARD	eligibility	Action `claim_action` is valid only if `ActionExecutionState(action_id) is READY` at decision time.
`P2` claim action	HARD	uniqueness	Key `action_id` maps to at most one logical outcome `active executor claim holder` within claim-timeout scope.
`P3` upload artifact	HARD	uniqueness	Key `digest` maps to at most one logical outcome `immutable content blob` within CAS scope.
`P4` commit action result	HARD	eligibility	Action `commit_action_result` is valid only if `ActionExecutionState(action_id)` is in a current successful terminalizing state and claim token matches current executor at decision time.
`P4` commit action result	HARD	accounting	`ActionResult(action_digest)` references only existing content-addressed outputs and metadata committed for that action execution.
`P5` retry timed-out or failed action	HARD	eligibility	Action `requeue_action` is valid only if `ActionExecutionState(action_id)` is in a retryable or expired claimed state and claim token/epoch still matches at decision time.
`P6` route to shard owner	HARD	uniqueness	Key `shard_id` maps to at most one logical outcome `current authoritative owner` within `shard_id`.
`P7` reassign shard ownership	HARD	eligibility	Action `reassign_shard` is valid only if `current owner is failed or relinquished and candidate owner is eligible and sufficiently current` on `shard_id` at decision time.
`P8` read action cache	HARD	freshness	Read path `lookup_action_result` reflects authoritative committed `ActionResult` state within configured consistency bound.

What matters most #

1. One active executor claim per action #

This is the execution-control invariant.

2. CAS immutability by digest #

Artifacts with the same digest must mean the same bytes.

3. Action result must only point to valid blobs #

Result metadata cannot reference outputs that were never durably stored.

4. Duplicate execution may still happen under retry #

At-most-once execution is usually not the baseline guarantee. Systems rely on deterministic actions and cacheability rather than exactly-once execution.

Step 5 - Execution Context #

For the strict baseline distributed build system:

Field	Value	Why
Topology	single service distributed	one logical build execution system spread across many schedulers and executors
Write coordination scope	per object scope	correctness is per `action_id`, `digest`, and shard ownership scope
Read consistency target	strong only	claim/commit/cache-result paths must use authoritative action/result state
Holder model	client	executors temporarily hold claimed actions
Compensation acceptable?	No	wrong claim/result linkage cannot be repaired safely after publication

Derived implications #

holder_may_crash = true
- executors can crash while holding claimed actions
cross_service_write = false
- baseline keeps action state, CAS metadata, and ownership in one logical service
bounded_staleness_allowed = false
- claim/result-commit paths need authoritative current state
cross_service_atomicity_required = false
- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true
- one executor should hold an action attempt at a time
guarded_by_current_state = true
- claim, commit, retry, and failover depend on current action state

What this implies #

This pushes us toward:

one authoritative owner per action shard
explicit action execution state machine
claim timeout modeled as lease-like executor ownership
immutable CAS storage with digest validation

Step 6 - Deterministic Mechanism Selection #

Path	Write shape	Base mechanism	Required companions
`P1` submit build / action requests	append-only event	append log	request id dedup
`P2` claim action	exclusive claim	lease	claim token, claim timeout
`P3` upload artifact	append-only event	content-addressed append	digest verification
`P4` commit action result	guarded state transition	CAS on `(state, version)` or leader-applied guarded transition	claim token, digest references
`P5` retry timed-out or failed action	guarded state transition	leader-applied guarded transition	claim token, timeout scan, retry policy
`P6` route to shard owner	exclusive claim	lease	fencing token, heartbeat
`P7` reassign shard ownership	guarded state transition	CAS on `(state, version)`	fencing token, shard catch-up check

Why these fit #

Build / action submission #

These are immutable workload facts, so append-only fits.

Action claim #

Executor ownership is temporary and exclusive, so lease semantics fit.

Artifact upload #

CAS storage is immutable by digest, so append-only content-addressed writes fit.

Result commit #

Committing an action result is not a blind write. It must verify:

current action state
current claim token
referenced output digests exist

So it is a guarded transition.

Canonical substrate implied #

The baseline now points to:

sharded build scheduler / action queue
one owner per action shard
immutable content-addressed blob store
per-action execution lifecycle state
guarded result commit after artifact upload

Step 7 - Read Model / Source of Truth #

For a Bazel remote-execution-class system, truth is mostly direct source state. Status dashboards are derived.

Concept	Truth	Read path	Rebuild path
`C1` build submission	`BuildRequest`	read source directly	authoritative build-request store
`C2` runnable action record	`ActionRequest`	read source directly	authoritative action store
`C3` current action lifecycle	`ActionExecutionState`	read source directly	authoritative execution-state store
`C4` immutable artifact bytes	`ContentBlob`	read source directly	authoritative CAS store
`C5` current action result / cache entry	`ActionResult`	read source directly	authoritative result store
`C6` shard ownership	`PartitionOwnership`	read source directly	authoritative ownership store
`C7` shard routing map	`PartitionMap`	read source directly	authoritative routing metadata
`C8` build status / queue stats	derived from action and result state	materialized view	recompute from authoritative state

Important point #

For the core semantics:

cache lookup reads authoritative ActionResult
output downloads read authoritative CAS blobs
claim/commit paths operate on authoritative ActionExecutionState
dashboards are projections

Step 8 - Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
`P1` submit build / actions	retry with `request_id` to avoid duplicate submission	competing builds coexist; dedup matters only for same logical request	committed submission survives owner crash if replicated past commit point	client may retry safely with idempotency key	n/a
`P2` claim action	retry claim safely; may return another action or later same action	only one active claim should win for an action at a time	if claim committed and executor crashes, action stays claimed until timeout or failover	n/a	stale executor fenced by claim token/epoch
`P3` upload artifact	retry by digest safely	duplicate uploads of same digest collapse to same immutable blob	committed blob survives storage node crash if durable	partial upload rejected until digest verified	n/a
`P4` commit action result	retry with same claim token and result metadata	stale/wrong claim token loses guarded transition	committed result survives crash if replicated past commit point	if result publish fails after blob upload, commit can be retried	stale holder cannot publish result after timeout and reassignment
`P5` retry timed-out or failed action	timeout/retry scan retry safe	only one requeue transition should win for current expired/retryable state	scanner crash delays recovery; next scan retries	n/a	old claim token becomes invalid after requeue
`P6` route to shard owner	retry after refreshing shard map	only one valid owner should exist	if owner changed, refreshed map points to new owner	n/a	stale owner rejected by fencing token
`P7` reassign shard ownership	retry failover transition safely	only one reassignment wins current ownership state	promoted owner crash triggers later reassignment	n/a	old owner fenced and must not continue serving

What matters most #

1. Claim-token fencing #

This is the scheduler/executor safety mechanism.

Bad case:

executor A claims action
action times out
executor B reclaims action
A later uploads result and tries to commit it

Without fencing:

A could incorrectly publish a stale result

So result commit must be tied to the current claim token.

2. CAS first, result commit second #

Artifact bytes should be durably stored and digest-verified before publishing ActionResult.

3. Duplicate execution is tolerated #

If an executor crashes after doing work but before result commit:

another executor may rerun the action

The system relies on deterministic execution and cache keys, not exactly-once execution.

Step 9 - Scale Adjustments #

Hotspot	Type	First response
hot scheduler shard with many claims	contention hotspot	increase shard count and isolate large builds or hot queues
CAS upload bandwidth	write throughput hotspot	separate blob service, multipart upload, and locality-aware caches
large fan-in artifact downloads	read hotspot	add edge/cache layers close to executors
retry storms after executor fleet issue	contention hotspot	exponential backoff, jitter, and executor health draining
delayed action requeues / timeout scans	write throughput hotspot	bucket expiries and scan incrementally
build status queries	read hotspot	keep them as materialized projections only

What scales well #

This system scales by:

sharding action scheduling
giving each shard one authoritative owner
keeping claim/commit local to that owner
storing blobs in content-addressed immutable storage that can scale independently

What fails first #

Usually:

scheduler shard hot spots
CAS bandwidth bottlenecks
stale-result fencing mistakes
retry storms after executor failures

Canonical design conclusion #

The mechanical outcome is:

primary state:
- BuildRequest
- ActionRequest
- ActionExecutionState
- ContentBlob
- ActionResult
- PartitionOwnership
- PartitionMap
critical invariants:
- durable build/action submission
- one active executor claim per action attempt
- immutable CAS blobs by digest
- result commit valid only for current claim token and existing output digests
- timed-out or retryable actions become runnable again safely
mechanisms:
- append log
- lease
- content-addressed immutable storage
- guarded result commit / retry transitions
- fenced shard ownership
reads:
- direct authoritative reads for action/result/CAS state
- projections only for build status and queue stats

Polished interview answer #

“I’d design the distributed build system as a sharded remote-execution service with one leader per action shard. Build submission records immutable build and action requests. Executors reserve actions through a lease-like claim that issues a claim token and expiry, upload outputs into a content-addressed blob store by digest, and then commit action results through a guarded transition that succeeds only for the current claim token and only if referenced output digests already exist. If an executor crashes or times out, the action is safely requeued for at-least-once execution, while cache hits come from authoritative action-result metadata plus immutable CAS blobs. The main scaling levers are more scheduler shards, independent CAS scaling, locality-aware caches, and retry backoff control.”

Concrete Substrate #

I’ll choose a service-owned sharded build scheduler with executor fleet plus separate content-addressed storage as the concrete baseline, because it matches the mechanics we derived:

append-only build/action submission
lease-like action claim
immutable CAS artifact uploads
guarded action-result commit
one owner per shard

Concrete tech family:

scheduler service in Go or Java
local durable metadata storage:
- RocksDB or replicated metadata DB
shard replication:
- Raft or leader-follower replication with commit index
CAS storage:
- object storage or blob service addressed by digest
metadata/control:
- etcd or internal metadata quorum

Each shard leader stores:

action records
current ActionExecutionState
ready-action queue
claim-expiry index
build-to-action tracking metadata

CAS stores:

immutable input/output blobs
directory trees / manifests by digest

Operation Layer #

1. Submit build #

API

SubmitBuild(build_request, request_id?)

Initiator

client/build tool

Entry point

gateway or scheduler frontend

Authoritative decider

scheduler shard owner(s)

Precondition

request id optional for dedup

Transition

append BuildRequest
materialize runnable ActionRequest records for ready actions

Response

{build_id}

2. Claim action #

API

ReserveAction(worker_capabilities, max_actions, lease_timeout)

Initiator

executor/client

Entry point

scheduler frontend or shard leader

Authoritative decider

shard leader owning the action

Precondition

action chosen must currently be READY
executor capabilities satisfy action requirements

Transition

selected action:
- READY -> CLAIMED(executor_id, claim_token, expiry)
remove from ready queue
insert into claim-expiry index

Response

{actions: [{action_id, command_digest, input_root_digest, claim_token, lease_expiry}]}

3. Upload artifact #

API

PutBlob(digest, bytes)

Initiator

executor/client

Entry point

CAS service

Authoritative decider

CAS blob store

Precondition

uploaded bytes hash to stated digest

Transition

store immutable ContentBlob(digest)

Response

success / already exists

4. Commit action result #

API

CommitActionResult(action_id, claim_token, result_metadata)

Initiator

executor/client

Entry point

scheduler frontend or shard leader

Authoritative decider

shard leader owning the action

Precondition

current ActionExecutionState is CLAIMED
claim_token matches current holder
referenced output digests exist in CAS

Transition

CLAIMED -> SUCCEEDED
overwrite ActionResult(action_digest)

Response

success

5. Requeue timed-out or failed action #

API

internal background process

Initiator

system

Entry point

shard leader

Authoritative decider

shard leader

Precondition

action is still CLAIMED and expired
or action is failed and retryable by policy

Transition

move action back to READY
clear current claim token
reinsert into ready queue

6. Lookup cached result #

API

GetActionResult(action_digest)

Initiator

client/build tool or scheduler

Entry point

result service / shard owner

Authoritative decider

authoritative result store

Precondition

none

Transition

none

Response

result metadata if present

Entry Point vs Decider vs Responder #

Path	Entry point	Authoritative decider	Physical responder	Logical responder
`SubmitBuild`	gateway / frontend	scheduler shard leader	leader or front node	build system
`ReserveAction`	frontend / shard leader	shard leader	leader or front node	build system
`PutBlob`	CAS endpoint	CAS blob store	blob node	build system
`CommitActionResult`	frontend / shard leader	shard leader	leader or front node	build system
retry / requeue	shard leader	shard leader	internal	build system
shard failover	follower / coordination layer	shard quorum / lease store	new leader / control plane	build system

Concrete HLD #

Main components:

build frontend / API gateway
- receives build submissions
scheduler shard leaders
- authoritative owners for action lifecycle
- manage ready, claimed, and retry indexes
executor fleet
- pulls runnable actions
- downloads inputs and uploads outputs
CAS / blob store
- stores immutable inputs and outputs by digest
metadata/control service
- tracks shard ownership and routing
background timeout / retry workers
- usually run on shard leaders

Short Interview Version #

I’d build the distributed build system as a sharded remote-execution service with one scheduler leader per action shard plus a separate content-addressed blob store. Build submission records immutable build and action requests. Executors reserve actions through a lease-like claim with a claim token and expiry, execute remotely, upload outputs to CAS by digest, and then publish action results through a guarded transition that succeeds only for the current claim token and only if referenced output digests exist. If an executor crashes or times out, the action is safely requeued for at-least-once execution, while cache hits come from authoritative action-result metadata plus immutable CAS blobs.

Distributed Build System (Bazel Remote Execution-class) #

Step 1 - Normalize #

Notes on normalization #

Step 2 - Critical Path Selection #

Baseline critical paths #

Step 3 - Primary State Extraction #

Important modeling choices #

BuildRequest #

ActionRequest #

ActionExecutionState #

ContentBlob #

ActionResult #

PartitionOwnership / PartitionMap #

Minimal strict primary set #

Step 4 - Hard Invariants #

What matters most #

1. One active executor claim per action #

2. CAS immutability by digest #

3. Action result must only point to valid blobs #

4. Duplicate execution may still happen under retry #

Step 5 - Execution Context #

Derived implications #

What this implies #

Step 6 - Deterministic Mechanism Selection #

Why these fit #

Build / action submission #

Action claim #

Artifact upload #

Result commit #

Canonical substrate implied #

Step 7 - Read Model / Source of Truth #

Important point #

Step 8 - Failure Handling #

What matters most #

1. Claim-token fencing #

2. CAS first, result commit second #

3. Duplicate execution is tolerated #

Step 9 - Scale Adjustments #

What scales well #

What fails first #

Canonical design conclusion #

Polished interview answer #

Concrete Substrate #

Operation Layer #

1. Submit build #

2. Claim action #

3. Upload artifact #

4. Commit action result #

5. Requeue timed-out or failed action #

6. Lookup cached result #

Entry Point vs Decider vs Responder #

Concrete HLD #

Short Interview Version #

`BuildRequest` #

`ActionRequest` #

`ActionExecutionState` #

`ContentBlob` #

`ActionResult` #

`PartitionOwnership` / `PartitionMap` #