Mobile OS Rollout / Staged Delivery Analysis Note
Table of Contents
Mobile OS Rollout / Staged Delivery Analysis Note #
This note captures the full step-by-step analysis for a mobile OS rollout / staged delivery service: release metadata, compatibility policy, cohort/stage policy, device eligibility, versioned snapshot propagation, and staged rollout progression.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a mobile OS rollout / staged delivery service
- devices check for updates
- control plane decides which OS version a device is eligible for
- rollout can be staged by cohort, percentage, region, device model, etc.
- rollout can be paused, resumed, or rolled back
- system scales across millions of devices
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Device checks for update | Client | read source | S1read source targetEligibilityState | C1 |
| Admin publishes new release metadata | Admin | overwrite state | S1update targetReleaseConfig | C1 |
| Admin updates staged rollout policy | Admin | overwrite state | S1update targetRolloutPolicy | C1 |
| Admin pauses/resumes/rolls back rollout | Admin | state transition | S1update targetRolloutState | C1 |
| System computes effective device eligibility | System | state transition | S1update targetEligibilityState | C1 |
| System propagates rollout snapshot to serving edges | System | async process | S1hidden write targetConfigSnapshot | C1 |
| Device reports install/update status | Client | append event | S1create targetInstallEvent | R2 |
| Admin reads rollout/health dashboard | Client | read projection | S1read projection targetRolloutStatusView | R2 |
Notes on normalization:
- device update check is a read path against current effective eligibility state
- release metadata and rollout policy are overwrite-state control-plane objects
- rollout pause/resume/rollback is a lifecycle state transition
- effective eligibility is a recomputed control-plane object
- snapshot propagation is async control-plane dissemination
- install reporting is append-only but secondary to the decision path
This system is a composition of:
Control Plane + Data PlaneTime-Bounded Exclusive Allocationflavor for staged eligibility windows
but the dominant architecture is control-plane/data-plane.
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Device checks for update | C1 | wrong eligibility sends wrong OS version or violates rollout safety |
| Publish new release metadata | C1 | changes future device update truth |
| Update staged rollout policy | C1 | changes future device eligibility |
| Pause/resume/rollback rollout | C1 | emergency safety path; correctness-critical |
| Compute effective device eligibility | C1 | control-plane to data-plane correctness bridge |
| Propagate rollout snapshot | C1 | stale serving nodes can serve wrong release |
| Device reports install/update status | R2 | important for monitoring/analytics, not baseline decision correctness |
| Read rollout/health dashboard | R2 | operational only |
Critical paths:
P1evaluate device eligibilityP2update release configP3update rollout policyP4transition rollout stateP5compute effective eligibilityP6propagate config snapshot
Step 3 — Primary State Extraction #
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| ReleaseConfig | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | release_id |
| RolloutPolicy | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | rollout_id or release_id |
| RolloutState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | rollout_id |
| EligibilityState | hidden write target | Yes | keep as candidate | process | Yes | service | overwrite | instance | release_id or environment_scope |
| ConfigSnapshot | hidden write target | Yes | keep as candidate | projection | Yes | service | overwrite | instance | edge_node_id or serving_scope |
| InstallEvent | hidden write target | No | keep as candidate | event | No | derived | append-only | collection | device_id + release_id + timestamp |
| RolloutStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | rollout_id |
Minimal primary set:
ReleaseConfigRolloutPolicyRolloutStateEligibilityStateConfigSnapshot
Important modeling choices:
RolloutState #
This is worth making explicit because rollout lifecycle matters:
DRAFTACTIVEPAUSEDROLLED_BACKCOMPLETED
EligibilityState #
This is the compiled/effective decision state for hot-path checks:
- release compatibility rules
- staged rollout percentage/cohort config
- current rollout state
ConfigSnapshot #
Serving/edge nodes should evaluate against versioned snapshots, not query control plane for every device check.
Step 4 — Hard Invariants #
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 device update check | HARD | eligibility | serve_update is valid only if device is eligible under current release config, rollout policy, rollout state, and compatibility rules for that request scope. |
P2 update release config | HARD | ordering | Release-config revisions are ordered by monotonic version within release_id. |
P3 update rollout policy | HARD | ordering | Rollout-policy revisions are ordered by monotonic version within rollout scope. |
P4 transition rollout state | HARD | eligibility | Action transition_rollout_state is valid only if current rollout lifecycle allows the requested state transition. |
P5 compute effective eligibility | HARD | accounting | Effective EligibilityState equals function of release config, rollout policy, and rollout lifecycle state. |
P6 propagate config snapshot | HARD | freshness | Serving-node ConfigSnapshot reflects authoritative eligibility state within configured propagation bound. |
P1 device update check | HARD | uniqueness | For fixed (release_id, device_key, config_version), rollout bucketing maps to at most one deterministic eligibility/treatment outcome within that rollout scope. |
What matters most:
- devices must not be served releases they are not eligible for
- staged rollout bucketing must be stable for same device and config version
- rollout pause/rollback must take effect within bounded propagation time
- effective eligibility must faithfully represent authoritative release + rollout inputs
Step 5 — Execution Context #
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical rollout system with many serving/edge nodes |
| Write coordination scope | per object scope | correctness is per release/rollout/serving-snapshot scope |
| Read consistency target | bounded stale allowed | hot path typically uses recent snapshots, not strong control-plane reads per device check |
| Holder model | none | no lease-like per-request ownership is central |
| Compensation acceptable? | No | wrong update eligibility or accidental rollout exposure is not a compensable workflow for correctness purposes |
Derived:
bounded_staleness_allowed = trueexclusive_claim_required = falseguarded_by_current_state = true
This implies:
- authoritative control plane
- versioned snapshot propagation
- local eligibility evaluation on hot path
Step 6 — Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P2 update release config | overwrite current value | CAS on version | config version |
P3 update rollout policy | overwrite current value | CAS on version | policy version |
P4 transition rollout state | guarded state transition | CAS on (state, version) | rollout version |
P5 compute effective eligibility | overwrite current value | single writer control-plane recompute | compiled-state version |
P6 propagate config snapshot | overwrite current value | single writer snapshot publication | config version |
| install reporting | append-only event | append log | event id or request id dedup if needed |
Hot path P1 is a read path.
Why these fit:
- release config and rollout policy are current-state control-plane config
- rollout lifecycle has real state-machine semantics
- effective eligibility is a recomputed current view
- snapshots are versioned overwrite dissemination
Step 7 — Read Model / Source of Truth #
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 release config | ReleaseConfig | read source directly | authoritative release store |
C2 rollout policy | RolloutPolicy | read source directly | authoritative policy store |
C3 rollout lifecycle state | RolloutState | read source directly | authoritative rollout-state store |
C4 effective eligibility state | EligibilityState | read source directly | recompute from release + rollout inputs |
C5 serving-node snapshot | ConfigSnapshot | materialized view | rebuild from latest effective eligibility state |
C6 rollout status/analytics | derived | materialized view | recompute from primary state + install events |
C7 install/update reporting | InstallEvent if retained | append/event analytics path | replay from event stream |
Important point:
For the device-check hot path:
- serving node reads local
ConfigSnapshot - not control-plane source state per request
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
| release/policy update | retry with config version | stale update loses CAS | committed config survives control-plane crash if persisted | snapshot propagation may lag | n/a |
| rollout-state transition | retry with rollout version | stale/invalid transition loses CAS | committed lifecycle change survives crash if persisted | snapshot propagation may lag | n/a |
| effective-state recompute | retry safe from primary inputs | single recompute/version wins | recompute reruns after crash | snapshot propagation may lag | n/a |
| snapshot propagation | retry with versioned snapshot | older snapshot loses to newer version | serving node keeps last good snapshot until refresh | failed push retried or pulled | n/a |
| device update check | retries are application-level | many serving nodes can evaluate concurrently using same snapshot version | one node crash affects only local requests | n/a | stale snapshot bounded by version/TTL refresh |
| install reporting | retry with event id/request id | duplicate events coexist unless dedup applied | committed event survives if persisted | async analytics publication may lag | n/a |
What matters most:
- rollout state transitions must be versioned and legal
- sidecars/serving nodes reject older snapshots
- pause/rollback must propagate within bounded time
- deterministic bucketing must stay stable for the same device and config version
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| very high device-check QPS | read hotspot | push eligibility evaluation to local edges/CDNs/SDKs and keep snapshots compact |
| config churn from rapid staged-rollout changes | fan-out hotspot | batch updates and publish incremental snapshots |
| huge device compatibility rule sets | read hotspot | compile rules into efficient evaluation structures and shard config by product/region/model |
| edge reconnect/snapshot storms | fan-out hotspot | backoff reconnects and support pull-on-version-miss |
| rollout analytics/event volume | write throughput hotspot | keep install reporting async and separate from decision hot path |
| admin/status reads | read hotspot | derived views only |
What scales well:
- hot path is local eligibility evaluation from snapshot
- control plane is narrow and versioned
- propagation is incremental
What fails first:
- snapshot fanout storms during urgent pause/rollback
- large compatibility/policy rule trees
- analytics overload if tied too closely to the decision path
Canonical design conclusion:
- archetype:
Control Plane + Data Plane - primary truth:
ReleaseConfigRolloutPolicyRolloutStateEligibilityStateConfigSnapshot
- hot path:
- local snapshot read + deterministic eligibility/bucketing
- control plane:
- authoritative release config + rollout config + lifecycle + compiled eligibility state + snapshot publication
Concrete Substrate #
- control plane in
Go/Java - authoritative rollout/config store in
etcd, Postgres, or another strongly consistent config DB - config distribution via watch streams / push channels / CDN-backed snapshot pull
- local evaluators at edge nodes, update-check servers, or device-facing services
- optional install-event pipeline via Kafka/PubSub/ClickHouse path
Operation Layer #
CheckForUpdate(device_context, current_version, device_key)
- entry point: edge/update-check server
- authoritative decider: local
ConfigSnapshot - transition: none on source truth
- response:
{eligible, target_release, config_version, reason}
PutReleaseConfig(release_id, config, expected_version?)
- entry point: control-plane API
- authoritative decider: release store owner
- transition: overwrite
ReleaseConfig
PutRolloutPolicy(scope, config, expected_version?)
- entry point: control-plane API
- authoritative decider: policy store owner
- transition: overwrite
RolloutPolicy
TransitionRolloutState(rollout_id, new_state, expected_version?)
- entry point: control-plane API
- authoritative decider: rollout-state owner
- transition: guarded update to
RolloutState
- internal recompute
- recompute
EligibilityStatefrom release config + rollout policy + rollout state
- snapshot propagation
- publish latest
ConfigSnapshot(version)to serving edges/update-check nodes
RecordInstallEvent(device_id, release_id, status, request_id?)
- entry point: async event ingestion endpoint
- authoritative decider: analytics/event pipeline
- transition: append
InstallEvent
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| check for update | edge/update-check node | local config snapshot | edge/update-check node | rollout service |
| release/policy update | control-plane API | config/policy store owner | control-plane node | rollout service |
| rollout-state transition | control-plane API | rollout-state owner | control-plane node | rollout service |
| effective-state recompute | control plane | recompute worker | internal | rollout service |
| snapshot propagation | serving node / control plane | snapshot publisher | control/data-plane | rollout service |
| install reporting | async ingestion endpoint | event pipeline | ingestion node | rollout analytics subsystem |
Concrete HLD #
Main components:
- control-plane API
- release + rollout state store
- effective-eligibility compiler/recompute worker
- snapshot distribution layer
- edge/update-check evaluators
- optional install analytics pipeline
Short interview version #
“I’d design the mobile OS rollout system as a control-plane/data-plane service. Control plane stores release metadata, staged rollout policy, and rollout lifecycle state, then compiles them into an effective eligibility snapshot. Devices don’t query control plane deeply on every check; update-check servers evaluate locally from versioned snapshots and use deterministic bucketing so the same device stays in the same treatment for a given config version. Pause, resume, and rollback are explicit lifecycle transitions, and the main correctness boundary is bounded-stale snapshot propagation to the serving edge.”