Skip to main content
  1. System Design Components/

Mobile OS Rollout / Staged Delivery Analysis Note

Mobile OS Rollout / Staged Delivery Analysis Note #

This note captures the full step-by-step analysis for a mobile OS rollout / staged delivery service: release metadata, compatibility policy, cohort/stage policy, device eligibility, versioned snapshot propagation, and staged rollout progression.

Step 1 — Normalize #

Assume the baseline prompt is:

  • design a mobile OS rollout / staged delivery service
  • devices check for updates
  • control plane decides which OS version a device is eligible for
  • rollout can be staged by cohort, percentage, region, device model, etc.
  • rollout can be paused, resumed, or rolled back
  • system scales across millions of devices
RequirementActorOperationState touchedPriority
Device checks for updateClientread sourceS1
read source target
EligibilityState
C1
Admin publishes new release metadataAdminoverwrite stateS1
update target
ReleaseConfig
C1
Admin updates staged rollout policyAdminoverwrite stateS1
update target
RolloutPolicy
C1
Admin pauses/resumes/rolls back rolloutAdminstate transitionS1
update target
RolloutState
C1
System computes effective device eligibilitySystemstate transitionS1
update target
EligibilityState
C1
System propagates rollout snapshot to serving edgesSystemasync processS1
hidden write target
ConfigSnapshot
C1
Device reports install/update statusClientappend eventS1
create target
InstallEvent
R2
Admin reads rollout/health dashboardClientread projectionS1
read projection target
RolloutStatusView
R2

Notes on normalization:

  • device update check is a read path against current effective eligibility state
  • release metadata and rollout policy are overwrite-state control-plane objects
  • rollout pause/resume/rollback is a lifecycle state transition
  • effective eligibility is a recomputed control-plane object
  • snapshot propagation is async control-plane dissemination
  • install reporting is append-only but secondary to the decision path

This system is a composition of:

  • Control Plane + Data Plane
  • Time-Bounded Exclusive Allocation flavor for staged eligibility windows

but the dominant architecture is control-plane/data-plane.

Step 2 — Critical Path Selection #

RequirementPriority classWhy
Device checks for updateC1wrong eligibility sends wrong OS version or violates rollout safety
Publish new release metadataC1changes future device update truth
Update staged rollout policyC1changes future device eligibility
Pause/resume/rollback rolloutC1emergency safety path; correctness-critical
Compute effective device eligibilityC1control-plane to data-plane correctness bridge
Propagate rollout snapshotC1stale serving nodes can serve wrong release
Device reports install/update statusR2important for monitoring/analytics, not baseline decision correctness
Read rollout/health dashboardR2operational only

Critical paths:

  • P1 evaluate device eligibility
  • P2 update release config
  • P3 update rollout policy
  • P4 transition rollout state
  • P5 compute effective eligibility
  • P6 propagate config snapshot

Step 3 — Primary State Extraction #

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
ReleaseConfigdirect nounYeskeep as candidateentityYesserviceoverwriteinstancerelease_id
RolloutPolicydirect nounYeskeep as candidateentityYesserviceoverwriteinstancerollout_id or release_id
RolloutStatelifecycle objectYeskeep as candidateprocessYesservicestate machineinstancerollout_id
EligibilityStatehidden write targetYeskeep as candidateprocessYesserviceoverwriteinstancerelease_id or environment_scope
ConfigSnapshothidden write targetYeskeep as candidateprojectionYesserviceoverwriteinstanceedge_node_id or serving_scope
InstallEventhidden write targetNokeep as candidateeventNoderivedappend-onlycollectiondevice_id + release_id + timestamp
RolloutStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectionrollout_id

Minimal primary set:

  • ReleaseConfig
  • RolloutPolicy
  • RolloutState
  • EligibilityState
  • ConfigSnapshot

Important modeling choices:

RolloutState #

This is worth making explicit because rollout lifecycle matters:

  • DRAFT
  • ACTIVE
  • PAUSED
  • ROLLED_BACK
  • COMPLETED

EligibilityState #

This is the compiled/effective decision state for hot-path checks:

  • release compatibility rules
  • staged rollout percentage/cohort config
  • current rollout state

ConfigSnapshot #

Serving/edge nodes should evaluate against versioned snapshots, not query control plane for every device check.

Step 4 — Hard Invariants #

PathTierTypeInvariant statement
P1 device update checkHARDeligibilityserve_update is valid only if device is eligible under current release config, rollout policy, rollout state, and compatibility rules for that request scope.
P2 update release configHARDorderingRelease-config revisions are ordered by monotonic version within release_id.
P3 update rollout policyHARDorderingRollout-policy revisions are ordered by monotonic version within rollout scope.
P4 transition rollout stateHARDeligibilityAction transition_rollout_state is valid only if current rollout lifecycle allows the requested state transition.
P5 compute effective eligibilityHARDaccountingEffective EligibilityState equals function of release config, rollout policy, and rollout lifecycle state.
P6 propagate config snapshotHARDfreshnessServing-node ConfigSnapshot reflects authoritative eligibility state within configured propagation bound.
P1 device update checkHARDuniquenessFor fixed (release_id, device_key, config_version), rollout bucketing maps to at most one deterministic eligibility/treatment outcome within that rollout scope.

What matters most:

  • devices must not be served releases they are not eligible for
  • staged rollout bucketing must be stable for same device and config version
  • rollout pause/rollback must take effect within bounded propagation time
  • effective eligibility must faithfully represent authoritative release + rollout inputs

Step 5 — Execution Context #

FieldValueWhy
Topologysingle service distributedone logical rollout system with many serving/edge nodes
Write coordination scopeper object scopecorrectness is per release/rollout/serving-snapshot scope
Read consistency targetbounded stale allowedhot path typically uses recent snapshots, not strong control-plane reads per device check
Holder modelnoneno lease-like per-request ownership is central
Compensation acceptable?Nowrong update eligibility or accidental rollout exposure is not a compensable workflow for correctness purposes

Derived:

  • bounded_staleness_allowed = true
  • exclusive_claim_required = false
  • guarded_by_current_state = true

This implies:

  • authoritative control plane
  • versioned snapshot propagation
  • local eligibility evaluation on hot path

Step 6 — Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P2 update release configoverwrite current valueCAS on versionconfig version
P3 update rollout policyoverwrite current valueCAS on versionpolicy version
P4 transition rollout stateguarded state transitionCAS on (state, version)rollout version
P5 compute effective eligibilityoverwrite current valuesingle writer control-plane recomputecompiled-state version
P6 propagate config snapshotoverwrite current valuesingle writer snapshot publicationconfig version
install reportingappend-only eventappend logevent id or request id dedup if needed

Hot path P1 is a read path.

Why these fit:

  • release config and rollout policy are current-state control-plane config
  • rollout lifecycle has real state-machine semantics
  • effective eligibility is a recomputed current view
  • snapshots are versioned overwrite dissemination

Step 7 — Read Model / Source of Truth #

ConceptTruthRead pathRebuild path
C1 release configReleaseConfigread source directlyauthoritative release store
C2 rollout policyRolloutPolicyread source directlyauthoritative policy store
C3 rollout lifecycle stateRolloutStateread source directlyauthoritative rollout-state store
C4 effective eligibility stateEligibilityStateread source directlyrecompute from release + rollout inputs
C5 serving-node snapshotConfigSnapshotmaterialized viewrebuild from latest effective eligibility state
C6 rollout status/analyticsderivedmaterialized viewrecompute from primary state + install events
C7 install/update reportingInstallEvent if retainedappend/event analytics pathreplay from event stream

Important point:

For the device-check hot path:

  • serving node reads local ConfigSnapshot
  • not control-plane source state per request

Step 8 — Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
release/policy updateretry with config versionstale update loses CAScommitted config survives control-plane crash if persistedsnapshot propagation may lagn/a
rollout-state transitionretry with rollout versionstale/invalid transition loses CAScommitted lifecycle change survives crash if persistedsnapshot propagation may lagn/a
effective-state recomputeretry safe from primary inputssingle recompute/version winsrecompute reruns after crashsnapshot propagation may lagn/a
snapshot propagationretry with versioned snapshotolder snapshot loses to newer versionserving node keeps last good snapshot until refreshfailed push retried or pulledn/a
device update checkretries are application-levelmany serving nodes can evaluate concurrently using same snapshot versionone node crash affects only local requestsn/astale snapshot bounded by version/TTL refresh
install reportingretry with event id/request idduplicate events coexist unless dedup appliedcommitted event survives if persistedasync analytics publication may lagn/a

What matters most:

  • rollout state transitions must be versioned and legal
  • sidecars/serving nodes reject older snapshots
  • pause/rollback must propagate within bounded time
  • deterministic bucketing must stay stable for the same device and config version

Step 9 — Scale Adjustments #

HotspotTypeFirst response
very high device-check QPSread hotspotpush eligibility evaluation to local edges/CDNs/SDKs and keep snapshots compact
config churn from rapid staged-rollout changesfan-out hotspotbatch updates and publish incremental snapshots
huge device compatibility rule setsread hotspotcompile rules into efficient evaluation structures and shard config by product/region/model
edge reconnect/snapshot stormsfan-out hotspotbackoff reconnects and support pull-on-version-miss
rollout analytics/event volumewrite throughput hotspotkeep install reporting async and separate from decision hot path
admin/status readsread hotspotderived views only

What scales well:

  • hot path is local eligibility evaluation from snapshot
  • control plane is narrow and versioned
  • propagation is incremental

What fails first:

  • snapshot fanout storms during urgent pause/rollback
  • large compatibility/policy rule trees
  • analytics overload if tied too closely to the decision path

Canonical design conclusion:

  • archetype: Control Plane + Data Plane
  • primary truth:
    • ReleaseConfig
    • RolloutPolicy
    • RolloutState
    • EligibilityState
    • ConfigSnapshot
  • hot path:
    • local snapshot read + deterministic eligibility/bucketing
  • control plane:
    • authoritative release config + rollout config + lifecycle + compiled eligibility state + snapshot publication

Concrete Substrate #

  • control plane in Go/Java
  • authoritative rollout/config store in etcd, Postgres, or another strongly consistent config DB
  • config distribution via watch streams / push channels / CDN-backed snapshot pull
  • local evaluators at edge nodes, update-check servers, or device-facing services
  • optional install-event pipeline via Kafka/PubSub/ClickHouse path

Operation Layer #

  1. CheckForUpdate(device_context, current_version, device_key)
  • entry point: edge/update-check server
  • authoritative decider: local ConfigSnapshot
  • transition: none on source truth
  • response: {eligible, target_release, config_version, reason}
  1. PutReleaseConfig(release_id, config, expected_version?)
  • entry point: control-plane API
  • authoritative decider: release store owner
  • transition: overwrite ReleaseConfig
  1. PutRolloutPolicy(scope, config, expected_version?)
  • entry point: control-plane API
  • authoritative decider: policy store owner
  • transition: overwrite RolloutPolicy
  1. TransitionRolloutState(rollout_id, new_state, expected_version?)
  • entry point: control-plane API
  • authoritative decider: rollout-state owner
  • transition: guarded update to RolloutState
  1. internal recompute
  • recompute EligibilityState from release config + rollout policy + rollout state
  1. snapshot propagation
  • publish latest ConfigSnapshot(version) to serving edges/update-check nodes
  1. RecordInstallEvent(device_id, release_id, status, request_id?)
  • entry point: async event ingestion endpoint
  • authoritative decider: analytics/event pipeline
  • transition: append InstallEvent

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
check for updateedge/update-check nodelocal config snapshotedge/update-check noderollout service
release/policy updatecontrol-plane APIconfig/policy store ownercontrol-plane noderollout service
rollout-state transitioncontrol-plane APIrollout-state ownercontrol-plane noderollout service
effective-state recomputecontrol planerecompute workerinternalrollout service
snapshot propagationserving node / control planesnapshot publishercontrol/data-planerollout service
install reportingasync ingestion endpointevent pipelineingestion noderollout analytics subsystem

Concrete HLD #

Main components:

  • control-plane API
  • release + rollout state store
  • effective-eligibility compiler/recompute worker
  • snapshot distribution layer
  • edge/update-check evaluators
  • optional install analytics pipeline

Short interview version #

“I’d design the mobile OS rollout system as a control-plane/data-plane service. Control plane stores release metadata, staged rollout policy, and rollout lifecycle state, then compiles them into an effective eligibility snapshot. Devices don’t query control plane deeply on every check; update-check servers evaluate locally from versioned snapshots and use deterministic bucketing so the same device stays in the same treatment for a given config version. Pause, resume, and rollback are explicit lifecycle transitions, and the main correctness boundary is bounded-stale snapshot propagation to the serving edge.”