- My Development Notes/
- System Design Components/
- Distributed Build System (Bazel Remote Execution-class)/
Distributed Build System (Bazel Remote Execution-class)
Distributed Build System (Bazel Remote Execution-class) #
This note models a Bazel remote-execution-class build system where clients submit build actions, a scheduler assigns them to executors, artifacts are stored by content address, and results are reused through caching.
Step 1 - Normalize #
Assume the baseline prompt is:
- design a distributed build system
- clients submit builds made of many actions
- actions run remotely on executors
- outputs are uploaded and reused through cache
- system scales across many workers
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Client submits build / action graph | Client | append event | S1create targetBuildRequest | C1 |
| Scheduler records runnable action | System | append event | S1create targetActionRequest | C1 |
| Executor claims action for execution | Client | state transition | S1update targetActionExecutionState | C1 |
| Executor uploads action result / artifacts | Client | append event | S1create targetContentBlob | C1 |
| System commits action result metadata | System | state transition | S1update targetActionResult | C1 |
| Client or scheduler checks action cache / CAS | Client | read source | S1read source targetActionResult | R1 |
| System retries timed-out or failed action | System | async process | S1hidden write targetActionExecutionState | C1 |
| System routes queue/shard to current owner | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns shard ownership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
| Client reads build status / queue stats | Client | read projection | S1read projection targetBuildStatusView | R2 |
Notes on normalization #
Important choices:
- build submission is
append event- build intent is immutable fact recording
- runnable action creation is also
append event- action requests are materialized work items
- executor claim is
state transition- action moves from ready to claimed/running
- artifact upload is
append event- CAS/blob storage is immutable by digest
- action result commit is
state transition- current execution/result state changes
- retries and ownership are explicit because this is distributed infra
This system is a hybrid of:
claimable execution queuecontent-addressed immutable storagebuild graph orchestration
Step 2 - Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Submit build / actions | C1 | build intent and runnable work must not be lost |
| Claim action for execution | C1 | duplicate uncontrolled execution wastes resources and breaks scheduling correctness |
| Upload artifacts / results | C1 | output correctness and cacheability depend on immutable artifact storage |
| Commit action result metadata | C1 | build progress and cache hits depend on correct result linkage |
| Check action cache / CAS | R1 | major serving path for avoiding re-execution |
| Retry timed-out or failed action | C1 | worker crash recovery and completion depend on safe requeue |
| Route to shard owner | C1 | wrong routing can split claims and action state |
| Reassign shard ownership | C1 | failover must preserve action lifecycle correctness |
| Build status / stats | R2 | operational only |
Baseline critical paths #
Main C1 paths:
P1submit build / action requestsP2claim actionP3upload artifactsP4commit action resultP5retry timed-out or failed actionP6route to shard ownerP7reassign shard ownership
Main R1 path:
P8read action cache / CAS
This design is driven by:
- immutable artifact addressing
- one active executor claim per action attempt
- guarded action completion
- result reuse through cache keys and digests
Step 3 - Primary State Extraction #
For a distributed build system, the minimal primary state is the build request, runnable action, action execution lifecycle, immutable content blobs, action result metadata, and routing/ownership state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| BuildRequest | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | build_id |
| ActionRequest | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | action_id |
| ActionExecutionState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | action_id |
| ContentBlob | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | digest |
| ActionResult | lifecycle object | Yes | keep as candidate | entity | Yes | service | overwrite | instance | action_digest |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | shard_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | action shards |
| BuildStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | build_id |
Important modeling choices #
BuildRequest #
Primary because:
- build submission is immutable request recording
- captures target set, platform, and request metadata
ActionRequest #
Primary because:
- scheduler materializes runnable work as distinct action items
- action identity is central to remote execution
ActionExecutionState #
This is the main execution lifecycle object. It captures states like:
READYCLAIMED(executor_id, claim_token, expiry)RUNNINGSUCCEEDEDFAILEDRETRY_PENDING
ContentBlob #
Primary because:
- artifacts are immutable and addressed by digest
- CAS storage is one of the core correctness layers
ActionResult #
Primary because:
- action cache lookup uses current result metadata
- links action digest to output digests and execution metadata
PartitionOwnership / PartitionMap #
Needed because:
- scheduler and executors must consistently route action claims and state updates to authoritative owners
Minimal strict primary set #
The strongest minimal set is:
BuildRequestActionRequestActionExecutionStateContentBlobActionResultPartitionOwnershipPartitionMap
Step 4 - Hard Invariants #
For a Bazel remote-execution-class system, the hard invariants are about durable action existence, one active executor claim per action attempt, immutable blob addressing, and correct result linkage.
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 submit build / action requests | HARD | uniqueness | Key build request_id maps to at most one logical outcome recorded build request within build scope. |
P2 claim action | HARD | eligibility | Action claim_action is valid only if ActionExecutionState(action_id) is READY at decision time. |
P2 claim action | HARD | uniqueness | Key action_id maps to at most one logical outcome active executor claim holder within claim-timeout scope. |
P3 upload artifact | HARD | uniqueness | Key digest maps to at most one logical outcome immutable content blob within CAS scope. |
P4 commit action result | HARD | eligibility | Action commit_action_result is valid only if ActionExecutionState(action_id) is in a current successful terminalizing state and claim token matches current executor at decision time. |
P4 commit action result | HARD | accounting | ActionResult(action_digest) references only existing content-addressed outputs and metadata committed for that action execution. |
P5 retry timed-out or failed action | HARD | eligibility | Action requeue_action is valid only if ActionExecutionState(action_id) is in a retryable or expired claimed state and claim token/epoch still matches at decision time. |
P6 route to shard owner | HARD | uniqueness | Key shard_id maps to at most one logical outcome current authoritative owner within shard_id. |
P7 reassign shard ownership | HARD | eligibility | Action reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time. |
P8 read action cache | HARD | freshness | Read path lookup_action_result reflects authoritative committed ActionResult state within configured consistency bound. |
What matters most #
1. One active executor claim per action #
This is the execution-control invariant.
2. CAS immutability by digest #
Artifacts with the same digest must mean the same bytes.
3. Action result must only point to valid blobs #
Result metadata cannot reference outputs that were never durably stored.
4. Duplicate execution may still happen under retry #
At-most-once execution is usually not the baseline guarantee. Systems rely on deterministic actions and cacheability rather than exactly-once execution.
Step 5 - Execution Context #
For the strict baseline distributed build system:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical build execution system spread across many schedulers and executors |
| Write coordination scope | per object scope | correctness is per action_id, digest, and shard ownership scope |
| Read consistency target | strong only | claim/commit/cache-result paths must use authoritative action/result state |
| Holder model | client | executors temporarily hold claimed actions |
| Compensation acceptable? | No | wrong claim/result linkage cannot be repaired safely after publication |
Derived implications #
holder_may_crash = true- executors can crash while holding claimed actions
cross_service_write = false- baseline keeps action state, CAS metadata, and ownership in one logical service
bounded_staleness_allowed = false- claim/result-commit paths need authoritative current state
cross_service_atomicity_required = false- no multi-service transaction across unrelated services in baseline
exclusive_claim_required = true- one executor should hold an action attempt at a time
guarded_by_current_state = true- claim, commit, retry, and failover depend on current action state
What this implies #
This pushes us toward:
- one authoritative owner per action shard
- explicit action execution state machine
- claim timeout modeled as lease-like executor ownership
- immutable CAS storage with digest validation
Step 6 - Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 submit build / action requests | append-only event | append log | request id dedup |
P2 claim action | exclusive claim | lease | claim token, claim timeout |
P3 upload artifact | append-only event | content-addressed append | digest verification |
P4 commit action result | guarded state transition | CAS on (state, version) or leader-applied guarded transition | claim token, digest references |
P5 retry timed-out or failed action | guarded state transition | leader-applied guarded transition | claim token, timeout scan, retry policy |
P6 route to shard owner | exclusive claim | lease | fencing token, heartbeat |
P7 reassign shard ownership | guarded state transition | CAS on (state, version) | fencing token, shard catch-up check |
Why these fit #
Build / action submission #
These are immutable workload facts, so append-only fits.
Action claim #
Executor ownership is temporary and exclusive, so lease semantics fit.
Artifact upload #
CAS storage is immutable by digest, so append-only content-addressed writes fit.
Result commit #
Committing an action result is not a blind write. It must verify:
- current action state
- current claim token
- referenced output digests exist
So it is a guarded transition.
Canonical substrate implied #
The baseline now points to:
- sharded build scheduler / action queue
- one owner per action shard
- immutable content-addressed blob store
- per-action execution lifecycle state
- guarded result commit after artifact upload
Step 7 - Read Model / Source of Truth #
For a Bazel remote-execution-class system, truth is mostly direct source state. Status dashboards are derived.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 build submission | BuildRequest | read source directly | authoritative build-request store |
C2 runnable action record | ActionRequest | read source directly | authoritative action store |
C3 current action lifecycle | ActionExecutionState | read source directly | authoritative execution-state store |
C4 immutable artifact bytes | ContentBlob | read source directly | authoritative CAS store |
C5 current action result / cache entry | ActionResult | read source directly | authoritative result store |
C6 shard ownership | PartitionOwnership | read source directly | authoritative ownership store |
C7 shard routing map | PartitionMap | read source directly | authoritative routing metadata |
C8 build status / queue stats | derived from action and result state | materialized view | recompute from authoritative state |
Important point #
For the core semantics:
- cache lookup reads authoritative
ActionResult - output downloads read authoritative CAS blobs
- claim/commit paths operate on authoritative
ActionExecutionState - dashboards are projections
Step 8 - Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 submit build / actions | retry with request_id to avoid duplicate submission | competing builds coexist; dedup matters only for same logical request | committed submission survives owner crash if replicated past commit point | client may retry safely with idempotency key | n/a |
P2 claim action | retry claim safely; may return another action or later same action | only one active claim should win for an action at a time | if claim committed and executor crashes, action stays claimed until timeout or failover | n/a | stale executor fenced by claim token/epoch |
P3 upload artifact | retry by digest safely | duplicate uploads of same digest collapse to same immutable blob | committed blob survives storage node crash if durable | partial upload rejected until digest verified | n/a |
P4 commit action result | retry with same claim token and result metadata | stale/wrong claim token loses guarded transition | committed result survives crash if replicated past commit point | if result publish fails after blob upload, commit can be retried | stale holder cannot publish result after timeout and reassignment |
P5 retry timed-out or failed action | timeout/retry scan retry safe | only one requeue transition should win for current expired/retryable state | scanner crash delays recovery; next scan retries | n/a | old claim token becomes invalid after requeue |
P6 route to shard owner | retry after refreshing shard map | only one valid owner should exist | if owner changed, refreshed map points to new owner | n/a | stale owner rejected by fencing token |
P7 reassign shard ownership | retry failover transition safely | only one reassignment wins current ownership state | promoted owner crash triggers later reassignment | n/a | old owner fenced and must not continue serving |
What matters most #
1. Claim-token fencing #
This is the scheduler/executor safety mechanism.
Bad case:
- executor A claims action
- action times out
- executor B reclaims action
- A later uploads result and tries to commit it
Without fencing:
- A could incorrectly publish a stale result
So result commit must be tied to the current claim token.
2. CAS first, result commit second #
Artifact bytes should be durably stored and digest-verified before publishing ActionResult.
3. Duplicate execution is tolerated #
If an executor crashes after doing work but before result commit:
- another executor may rerun the action
The system relies on deterministic execution and cache keys, not exactly-once execution.
Step 9 - Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| hot scheduler shard with many claims | contention hotspot | increase shard count and isolate large builds or hot queues |
| CAS upload bandwidth | write throughput hotspot | separate blob service, multipart upload, and locality-aware caches |
| large fan-in artifact downloads | read hotspot | add edge/cache layers close to executors |
| retry storms after executor fleet issue | contention hotspot | exponential backoff, jitter, and executor health draining |
| delayed action requeues / timeout scans | write throughput hotspot | bucket expiries and scan incrementally |
| build status queries | read hotspot | keep them as materialized projections only |
What scales well #
This system scales by:
- sharding action scheduling
- giving each shard one authoritative owner
- keeping claim/commit local to that owner
- storing blobs in content-addressed immutable storage that can scale independently
What fails first #
Usually:
- scheduler shard hot spots
- CAS bandwidth bottlenecks
- stale-result fencing mistakes
- retry storms after executor failures
Canonical design conclusion #
The mechanical outcome is:
- primary state:
BuildRequestActionRequestActionExecutionStateContentBlobActionResultPartitionOwnershipPartitionMap
- critical invariants:
- durable build/action submission
- one active executor claim per action attempt
- immutable CAS blobs by digest
- result commit valid only for current claim token and existing output digests
- timed-out or retryable actions become runnable again safely
- mechanisms:
append loglease- content-addressed immutable storage
- guarded result commit / retry transitions
- fenced shard ownership
- reads:
- direct authoritative reads for action/result/CAS state
- projections only for build status and queue stats
Polished interview answer #
“I’d design the distributed build system as a sharded remote-execution service with one leader per action shard. Build submission records immutable build and action requests. Executors reserve actions through a lease-like claim that issues a claim token and expiry, upload outputs into a content-addressed blob store by digest, and then commit action results through a guarded transition that succeeds only for the current claim token and only if referenced output digests already exist. If an executor crashes or times out, the action is safely requeued for at-least-once execution, while cache hits come from authoritative action-result metadata plus immutable CAS blobs. The main scaling levers are more scheduler shards, independent CAS scaling, locality-aware caches, and retry backoff control.”
Concrete Substrate #
I’ll choose a service-owned sharded build scheduler with executor fleet plus separate content-addressed storage as the concrete baseline, because it matches the mechanics we derived:
- append-only build/action submission
- lease-like action claim
- immutable CAS artifact uploads
- guarded action-result commit
- one owner per shard
Concrete tech family:
- scheduler service in
GoorJava - local durable metadata storage:
RocksDBor replicated metadata DB
- shard replication:
Raftor leader-follower replication with commit index
- CAS storage:
- object storage or blob service addressed by digest
- metadata/control:
etcdor internal metadata quorum
Each shard leader stores:
- action records
- current
ActionExecutionState - ready-action queue
- claim-expiry index
- build-to-action tracking metadata
CAS stores:
- immutable input/output blobs
- directory trees / manifests by digest
Operation Layer #
1. Submit build #
API
SubmitBuild(build_request, request_id?)
Initiator
- client/build tool
Entry point
- gateway or scheduler frontend
Authoritative decider
- scheduler shard owner(s)
Precondition
- request id optional for dedup
Transition
- append
BuildRequest - materialize runnable
ActionRequestrecords for ready actions
Response
{build_id}
2. Claim action #
API
ReserveAction(worker_capabilities, max_actions, lease_timeout)
Initiator
- executor/client
Entry point
- scheduler frontend or shard leader
Authoritative decider
- shard leader owning the action
Precondition
- action chosen must currently be
READY - executor capabilities satisfy action requirements
Transition
- selected action:
READY -> CLAIMED(executor_id, claim_token, expiry)
- remove from ready queue
- insert into claim-expiry index
Response
{actions: [{action_id, command_digest, input_root_digest, claim_token, lease_expiry}]}
3. Upload artifact #
API
PutBlob(digest, bytes)
Initiator
- executor/client
Entry point
- CAS service
Authoritative decider
- CAS blob store
Precondition
- uploaded bytes hash to stated digest
Transition
- store immutable
ContentBlob(digest)
Response
- success / already exists
4. Commit action result #
API
CommitActionResult(action_id, claim_token, result_metadata)
Initiator
- executor/client
Entry point
- scheduler frontend or shard leader
Authoritative decider
- shard leader owning the action
Precondition
- current
ActionExecutionStateisCLAIMED claim_tokenmatches current holder- referenced output digests exist in CAS
Transition
CLAIMED -> SUCCEEDED- overwrite
ActionResult(action_digest)
Response
- success
5. Requeue timed-out or failed action #
API
- internal background process
Initiator
- system
Entry point
- shard leader
Authoritative decider
- shard leader
Precondition
- action is still
CLAIMEDand expired - or action is failed and retryable by policy
Transition
- move action back to
READY - clear current claim token
- reinsert into ready queue
6. Lookup cached result #
API
GetActionResult(action_digest)
Initiator
- client/build tool or scheduler
Entry point
- result service / shard owner
Authoritative decider
- authoritative result store
Precondition
- none
Transition
- none
Response
- result metadata if present
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
SubmitBuild | gateway / frontend | scheduler shard leader | leader or front node | build system |
ReserveAction | frontend / shard leader | shard leader | leader or front node | build system |
PutBlob | CAS endpoint | CAS blob store | blob node | build system |
CommitActionResult | frontend / shard leader | shard leader | leader or front node | build system |
| retry / requeue | shard leader | shard leader | internal | build system |
| shard failover | follower / coordination layer | shard quorum / lease store | new leader / control plane | build system |
Concrete HLD #
Main components:
- build frontend / API gateway
- receives build submissions
- scheduler shard leaders
- authoritative owners for action lifecycle
- manage ready, claimed, and retry indexes
- executor fleet
- pulls runnable actions
- downloads inputs and uploads outputs
- CAS / blob store
- stores immutable inputs and outputs by digest
- metadata/control service
- tracks shard ownership and routing
- background timeout / retry workers
- usually run on shard leaders
Short Interview Version #
I’d build the distributed build system as a sharded remote-execution service with one scheduler leader per action shard plus a separate content-addressed blob store. Build submission records immutable build and action requests. Executors reserve actions through a lease-like claim with a claim token and expiry, execute remotely, upload outputs to CAS by digest, and then publish action results through a guarded transition that succeeds only for the current claim token and only if referenced output digests exist. If an executor crashes or times out, the action is safely requeued for at-least-once execution, while cache hits come from authoritative action-result metadata plus immutable CAS blobs.