Skip to main content
  1. System Design Components/

Distributed Build System (Bazel Remote Execution-class)

Distributed Build System (Bazel Remote Execution-class) #

This note models a Bazel remote-execution-class build system where clients submit build actions, a scheduler assigns them to executors, artifacts are stored by content address, and results are reused through caching.


Step 1 - Normalize #

Assume the baseline prompt is:

  • design a distributed build system
  • clients submit builds made of many actions
  • actions run remotely on executors
  • outputs are uploaded and reused through cache
  • system scales across many workers

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Client submits build / action graphClientappend eventS1
create target
BuildRequest
C1
Scheduler records runnable actionSystemappend eventS1
create target
ActionRequest
C1
Executor claims action for executionClientstate transitionS1
update target
ActionExecutionState
C1
Executor uploads action result / artifactsClientappend eventS1
create target
ContentBlob
C1
System commits action result metadataSystemstate transitionS1
update target
ActionResult
C1
Client or scheduler checks action cache / CASClientread sourceS1
read source target
ActionResult
R1
System retries timed-out or failed actionSystemasync processS1
hidden write target
ActionExecutionState
C1
System routes queue/shard to current ownerSystemread sourceS1
read source target
PartitionMap
C1
System reassigns shard ownership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1
Client reads build status / queue statsClientread projectionS1
read projection target
BuildStatusView
R2

Notes on normalization #

Important choices:

  • build submission is append event
    • build intent is immutable fact recording
  • runnable action creation is also append event
    • action requests are materialized work items
  • executor claim is state transition
    • action moves from ready to claimed/running
  • artifact upload is append event
    • CAS/blob storage is immutable by digest
  • action result commit is state transition
    • current execution/result state changes
  • retries and ownership are explicit because this is distributed infra

This system is a hybrid of:

  • claimable execution queue
  • content-addressed immutable storage
  • build graph orchestration

Step 2 - Critical Path Selection #

RequirementPriority classWhy
Submit build / actionsC1build intent and runnable work must not be lost
Claim action for executionC1duplicate uncontrolled execution wastes resources and breaks scheduling correctness
Upload artifacts / resultsC1output correctness and cacheability depend on immutable artifact storage
Commit action result metadataC1build progress and cache hits depend on correct result linkage
Check action cache / CASR1major serving path for avoiding re-execution
Retry timed-out or failed actionC1worker crash recovery and completion depend on safe requeue
Route to shard ownerC1wrong routing can split claims and action state
Reassign shard ownershipC1failover must preserve action lifecycle correctness
Build status / statsR2operational only

Baseline critical paths #

Main C1 paths:

  • P1 submit build / action requests
  • P2 claim action
  • P3 upload artifacts
  • P4 commit action result
  • P5 retry timed-out or failed action
  • P6 route to shard owner
  • P7 reassign shard ownership

Main R1 path:

  • P8 read action cache / CAS

This design is driven by:

  • immutable artifact addressing
  • one active executor claim per action attempt
  • guarded action completion
  • result reuse through cache keys and digests

Step 3 - Primary State Extraction #

For a distributed build system, the minimal primary state is the build request, runnable action, action execution lifecycle, immutable content blobs, action result metadata, and routing/ownership state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
BuildRequestdirect nounYeskeep as candidateeventYesserviceappend-onlyinstancebuild_id
ActionRequestdirect nounYeskeep as candidateeventYesserviceappend-onlyinstanceaction_id
ActionExecutionStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstanceaction_id
ContentBlobdirect nounYeskeep as candidateeventYesserviceappend-onlyinstancedigest
ActionResultlifecycle objectYeskeep as candidateentityYesserviceoverwriteinstanceaction_digest
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstanceshard_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectionaction shards
BuildStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectionbuild_id

Important modeling choices #

BuildRequest #

Primary because:

  • build submission is immutable request recording
  • captures target set, platform, and request metadata

ActionRequest #

Primary because:

  • scheduler materializes runnable work as distinct action items
  • action identity is central to remote execution

ActionExecutionState #

This is the main execution lifecycle object. It captures states like:

  • READY
  • CLAIMED(executor_id, claim_token, expiry)
  • RUNNING
  • SUCCEEDED
  • FAILED
  • RETRY_PENDING

ContentBlob #

Primary because:

  • artifacts are immutable and addressed by digest
  • CAS storage is one of the core correctness layers

ActionResult #

Primary because:

  • action cache lookup uses current result metadata
  • links action digest to output digests and execution metadata

PartitionOwnership / PartitionMap #

Needed because:

  • scheduler and executors must consistently route action claims and state updates to authoritative owners

Minimal strict primary set #

The strongest minimal set is:

  • BuildRequest
  • ActionRequest
  • ActionExecutionState
  • ContentBlob
  • ActionResult
  • PartitionOwnership
  • PartitionMap

Step 4 - Hard Invariants #

For a Bazel remote-execution-class system, the hard invariants are about durable action existence, one active executor claim per action attempt, immutable blob addressing, and correct result linkage.

PathTierTypeInvariant statement
P1 submit build / action requestsHARDuniquenessKey build request_id maps to at most one logical outcome recorded build request within build scope.
P2 claim actionHARDeligibilityAction claim_action is valid only if ActionExecutionState(action_id) is READY at decision time.
P2 claim actionHARDuniquenessKey action_id maps to at most one logical outcome active executor claim holder within claim-timeout scope.
P3 upload artifactHARDuniquenessKey digest maps to at most one logical outcome immutable content blob within CAS scope.
P4 commit action resultHARDeligibilityAction commit_action_result is valid only if ActionExecutionState(action_id) is in a current successful terminalizing state and claim token matches current executor at decision time.
P4 commit action resultHARDaccountingActionResult(action_digest) references only existing content-addressed outputs and metadata committed for that action execution.
P5 retry timed-out or failed actionHARDeligibilityAction requeue_action is valid only if ActionExecutionState(action_id) is in a retryable or expired claimed state and claim token/epoch still matches at decision time.
P6 route to shard ownerHARDuniquenessKey shard_id maps to at most one logical outcome current authoritative owner within shard_id.
P7 reassign shard ownershipHARDeligibilityAction reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time.
P8 read action cacheHARDfreshnessRead path lookup_action_result reflects authoritative committed ActionResult state within configured consistency bound.

What matters most #

1. One active executor claim per action #

This is the execution-control invariant.

2. CAS immutability by digest #

Artifacts with the same digest must mean the same bytes.

3. Action result must only point to valid blobs #

Result metadata cannot reference outputs that were never durably stored.

4. Duplicate execution may still happen under retry #

At-most-once execution is usually not the baseline guarantee. Systems rely on deterministic actions and cacheability rather than exactly-once execution.


Step 5 - Execution Context #

For the strict baseline distributed build system:

FieldValueWhy
Topologysingle service distributedone logical build execution system spread across many schedulers and executors
Write coordination scopeper object scopecorrectness is per action_id, digest, and shard ownership scope
Read consistency targetstrong onlyclaim/commit/cache-result paths must use authoritative action/result state
Holder modelclientexecutors temporarily hold claimed actions
Compensation acceptable?Nowrong claim/result linkage cannot be repaired safely after publication

Derived implications #

  • holder_may_crash = true

    • executors can crash while holding claimed actions
  • cross_service_write = false

    • baseline keeps action state, CAS metadata, and ownership in one logical service
  • bounded_staleness_allowed = false

    • claim/result-commit paths need authoritative current state
  • cross_service_atomicity_required = false

    • no multi-service transaction across unrelated services in baseline
  • exclusive_claim_required = true

    • one executor should hold an action attempt at a time
  • guarded_by_current_state = true

    • claim, commit, retry, and failover depend on current action state

What this implies #

This pushes us toward:

  • one authoritative owner per action shard
  • explicit action execution state machine
  • claim timeout modeled as lease-like executor ownership
  • immutable CAS storage with digest validation

Step 6 - Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 submit build / action requestsappend-only eventappend logrequest id dedup
P2 claim actionexclusive claimleaseclaim token, claim timeout
P3 upload artifactappend-only eventcontent-addressed appenddigest verification
P4 commit action resultguarded state transitionCAS on (state, version) or leader-applied guarded transitionclaim token, digest references
P5 retry timed-out or failed actionguarded state transitionleader-applied guarded transitionclaim token, timeout scan, retry policy
P6 route to shard ownerexclusive claimleasefencing token, heartbeat
P7 reassign shard ownershipguarded state transitionCAS on (state, version)fencing token, shard catch-up check

Why these fit #

Build / action submission #

These are immutable workload facts, so append-only fits.

Action claim #

Executor ownership is temporary and exclusive, so lease semantics fit.

Artifact upload #

CAS storage is immutable by digest, so append-only content-addressed writes fit.

Result commit #

Committing an action result is not a blind write. It must verify:

  • current action state
  • current claim token
  • referenced output digests exist

So it is a guarded transition.

Canonical substrate implied #

The baseline now points to:

  • sharded build scheduler / action queue
  • one owner per action shard
  • immutable content-addressed blob store
  • per-action execution lifecycle state
  • guarded result commit after artifact upload

Step 7 - Read Model / Source of Truth #

For a Bazel remote-execution-class system, truth is mostly direct source state. Status dashboards are derived.

ConceptTruthRead pathRebuild path
C1 build submissionBuildRequestread source directlyauthoritative build-request store
C2 runnable action recordActionRequestread source directlyauthoritative action store
C3 current action lifecycleActionExecutionStateread source directlyauthoritative execution-state store
C4 immutable artifact bytesContentBlobread source directlyauthoritative CAS store
C5 current action result / cache entryActionResultread source directlyauthoritative result store
C6 shard ownershipPartitionOwnershipread source directlyauthoritative ownership store
C7 shard routing mapPartitionMapread source directlyauthoritative routing metadata
C8 build status / queue statsderived from action and result statematerialized viewrecompute from authoritative state

Important point #

For the core semantics:

  • cache lookup reads authoritative ActionResult
  • output downloads read authoritative CAS blobs
  • claim/commit paths operate on authoritative ActionExecutionState
  • dashboards are projections

Step 8 - Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 submit build / actionsretry with request_id to avoid duplicate submissioncompeting builds coexist; dedup matters only for same logical requestcommitted submission survives owner crash if replicated past commit pointclient may retry safely with idempotency keyn/a
P2 claim actionretry claim safely; may return another action or later same actiononly one active claim should win for an action at a timeif claim committed and executor crashes, action stays claimed until timeout or failovern/astale executor fenced by claim token/epoch
P3 upload artifactretry by digest safelyduplicate uploads of same digest collapse to same immutable blobcommitted blob survives storage node crash if durablepartial upload rejected until digest verifiedn/a
P4 commit action resultretry with same claim token and result metadatastale/wrong claim token loses guarded transitioncommitted result survives crash if replicated past commit pointif result publish fails after blob upload, commit can be retriedstale holder cannot publish result after timeout and reassignment
P5 retry timed-out or failed actiontimeout/retry scan retry safeonly one requeue transition should win for current expired/retryable statescanner crash delays recovery; next scan retriesn/aold claim token becomes invalid after requeue
P6 route to shard ownerretry after refreshing shard maponly one valid owner should existif owner changed, refreshed map points to new ownern/astale owner rejected by fencing token
P7 reassign shard ownershipretry failover transition safelyonly one reassignment wins current ownership statepromoted owner crash triggers later reassignmentn/aold owner fenced and must not continue serving

What matters most #

1. Claim-token fencing #

This is the scheduler/executor safety mechanism.

Bad case:

  • executor A claims action
  • action times out
  • executor B reclaims action
  • A later uploads result and tries to commit it

Without fencing:

  • A could incorrectly publish a stale result

So result commit must be tied to the current claim token.

2. CAS first, result commit second #

Artifact bytes should be durably stored and digest-verified before publishing ActionResult.

3. Duplicate execution is tolerated #

If an executor crashes after doing work but before result commit:

  • another executor may rerun the action

The system relies on deterministic execution and cache keys, not exactly-once execution.


Step 9 - Scale Adjustments #

HotspotTypeFirst response
hot scheduler shard with many claimscontention hotspotincrease shard count and isolate large builds or hot queues
CAS upload bandwidthwrite throughput hotspotseparate blob service, multipart upload, and locality-aware caches
large fan-in artifact downloadsread hotspotadd edge/cache layers close to executors
retry storms after executor fleet issuecontention hotspotexponential backoff, jitter, and executor health draining
delayed action requeues / timeout scanswrite throughput hotspotbucket expiries and scan incrementally
build status queriesread hotspotkeep them as materialized projections only

What scales well #

This system scales by:

  • sharding action scheduling
  • giving each shard one authoritative owner
  • keeping claim/commit local to that owner
  • storing blobs in content-addressed immutable storage that can scale independently

What fails first #

Usually:

  • scheduler shard hot spots
  • CAS bandwidth bottlenecks
  • stale-result fencing mistakes
  • retry storms after executor failures

Canonical design conclusion #

The mechanical outcome is:

  • primary state:
    • BuildRequest
    • ActionRequest
    • ActionExecutionState
    • ContentBlob
    • ActionResult
    • PartitionOwnership
    • PartitionMap
  • critical invariants:
    • durable build/action submission
    • one active executor claim per action attempt
    • immutable CAS blobs by digest
    • result commit valid only for current claim token and existing output digests
    • timed-out or retryable actions become runnable again safely
  • mechanisms:
    • append log
    • lease
    • content-addressed immutable storage
    • guarded result commit / retry transitions
    • fenced shard ownership
  • reads:
    • direct authoritative reads for action/result/CAS state
    • projections only for build status and queue stats

Polished interview answer #

“I’d design the distributed build system as a sharded remote-execution service with one leader per action shard. Build submission records immutable build and action requests. Executors reserve actions through a lease-like claim that issues a claim token and expiry, upload outputs into a content-addressed blob store by digest, and then commit action results through a guarded transition that succeeds only for the current claim token and only if referenced output digests already exist. If an executor crashes or times out, the action is safely requeued for at-least-once execution, while cache hits come from authoritative action-result metadata plus immutable CAS blobs. The main scaling levers are more scheduler shards, independent CAS scaling, locality-aware caches, and retry backoff control.”


Concrete Substrate #

I’ll choose a service-owned sharded build scheduler with executor fleet plus separate content-addressed storage as the concrete baseline, because it matches the mechanics we derived:

  • append-only build/action submission
  • lease-like action claim
  • immutable CAS artifact uploads
  • guarded action-result commit
  • one owner per shard

Concrete tech family:

  • scheduler service in Go or Java
  • local durable metadata storage:
    • RocksDB or replicated metadata DB
  • shard replication:
    • Raft or leader-follower replication with commit index
  • CAS storage:
    • object storage or blob service addressed by digest
  • metadata/control:
    • etcd or internal metadata quorum

Each shard leader stores:

  • action records
  • current ActionExecutionState
  • ready-action queue
  • claim-expiry index
  • build-to-action tracking metadata

CAS stores:

  • immutable input/output blobs
  • directory trees / manifests by digest

Operation Layer #

1. Submit build #

API

  • SubmitBuild(build_request, request_id?)

Initiator

  • client/build tool

Entry point

  • gateway or scheduler frontend

Authoritative decider

  • scheduler shard owner(s)

Precondition

  • request id optional for dedup

Transition

  • append BuildRequest
  • materialize runnable ActionRequest records for ready actions

Response

  • {build_id}

2. Claim action #

API

  • ReserveAction(worker_capabilities, max_actions, lease_timeout)

Initiator

  • executor/client

Entry point

  • scheduler frontend or shard leader

Authoritative decider

  • shard leader owning the action

Precondition

  • action chosen must currently be READY
  • executor capabilities satisfy action requirements

Transition

  • selected action:
    • READY -> CLAIMED(executor_id, claim_token, expiry)
  • remove from ready queue
  • insert into claim-expiry index

Response

  • {actions: [{action_id, command_digest, input_root_digest, claim_token, lease_expiry}]}

3. Upload artifact #

API

  • PutBlob(digest, bytes)

Initiator

  • executor/client

Entry point

  • CAS service

Authoritative decider

  • CAS blob store

Precondition

  • uploaded bytes hash to stated digest

Transition

  • store immutable ContentBlob(digest)

Response

  • success / already exists

4. Commit action result #

API

  • CommitActionResult(action_id, claim_token, result_metadata)

Initiator

  • executor/client

Entry point

  • scheduler frontend or shard leader

Authoritative decider

  • shard leader owning the action

Precondition

  • current ActionExecutionState is CLAIMED
  • claim_token matches current holder
  • referenced output digests exist in CAS

Transition

  • CLAIMED -> SUCCEEDED
  • overwrite ActionResult(action_digest)

Response

  • success

5. Requeue timed-out or failed action #

API

  • internal background process

Initiator

  • system

Entry point

  • shard leader

Authoritative decider

  • shard leader

Precondition

  • action is still CLAIMED and expired
  • or action is failed and retryable by policy

Transition

  • move action back to READY
  • clear current claim token
  • reinsert into ready queue

6. Lookup cached result #

API

  • GetActionResult(action_digest)

Initiator

  • client/build tool or scheduler

Entry point

  • result service / shard owner

Authoritative decider

  • authoritative result store

Precondition

  • none

Transition

  • none

Response

  • result metadata if present

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
SubmitBuildgateway / frontendscheduler shard leaderleader or front nodebuild system
ReserveActionfrontend / shard leadershard leaderleader or front nodebuild system
PutBlobCAS endpointCAS blob storeblob nodebuild system
CommitActionResultfrontend / shard leadershard leaderleader or front nodebuild system
retry / requeueshard leadershard leaderinternalbuild system
shard failoverfollower / coordination layershard quorum / lease storenew leader / control planebuild system

Concrete HLD #

Main components:

  • build frontend / API gateway
    • receives build submissions
  • scheduler shard leaders
    • authoritative owners for action lifecycle
    • manage ready, claimed, and retry indexes
  • executor fleet
    • pulls runnable actions
    • downloads inputs and uploads outputs
  • CAS / blob store
    • stores immutable inputs and outputs by digest
  • metadata/control service
    • tracks shard ownership and routing
  • background timeout / retry workers
    • usually run on shard leaders

Short Interview Version #

I’d build the distributed build system as a sharded remote-execution service with one scheduler leader per action shard plus a separate content-addressed blob store. Build submission records immutable build and action requests. Executors reserve actions through a lease-like claim with a claim token and expiry, execute remotely, upload outputs to CAS by digest, and then publish action results through a guarded transition that succeeds only for the current claim token and only if referenced output digests exist. If an executor crashes or times out, the action is safely requeued for at-least-once execution, while cache hits come from authoritative action-result metadata plus immutable CAS blobs.