- My Development Notes/
- System Design Components/
- Serverless Compute Platform (Lambda-Class) Analysis Note/
Serverless Compute Platform (Lambda-Class) Analysis Note
Table of Contents
Serverless Compute Platform (Lambda-Class) Analysis Note #
This note captures the full step-by-step analysis for a Lambda-class serverless compute platform: function config and code artifacts, versioned routing, invocation admission, worker/node placement, execution claims, and result delivery.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a serverless compute platform like AWS Lambda
- users deploy functions and update configuration
- clients invoke functions synchronously or asynchronously
- platform schedules invocations onto workers/nodes
- execution environment may scale up/down dynamically
- versions/aliases and rollout of function code/config matter
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Developer deploys or updates function code/config | Client | overwrite state | S1update targetFunctionConfig | C1 |
| Developer publishes version/alias routing | Client | overwrite state | S1update targetVersionRouting | C1 |
| Client invokes function | Client | append event | S1create targetInvocationRequest | C1 |
| System admits and schedules invocation to worker/node | System | state transition | S1update targetInvocationState | C1 |
| Worker claims invocation for execution | Worker | state transition | S1update targetInvocationState | C1 |
| Worker reports completion/result | Worker | state transition | S1update targetInvocationState | C1 |
| System manages worker/node capacity and liveness | System | overwrite state | S1update targetWorkerState | C1 |
| System propagates function/config snapshot to workers | System | async process | S1hidden write targetWorkerSnapshot | C1 |
| Client reads invocation status/logs | Client | read projection | S1read projection targetInvocationStatusView | R2 |
Notes on normalization:
- deployment/config updates are overwrite state
- invocation is append event
- invocation request itself is an immutable fact
- scheduling/claim/completion are execution lifecycle transitions
- worker capacity/liveness is overwrite state
- worker snapshot propagation is async process
This system is a composition of:
Control Plane + Data PlaneFuture/Claimable Execution Process
with the execution path being the main correctness center.
Step 2 — Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Deploy/update function config | C1 | changes future execution truth |
| Publish version/alias routing | C1 | changes which code/config invocations should use |
| Invoke function | C1 | invocation request truth must not be lost/duplicated incorrectly |
| Admit/schedule invocation | C1 | invocation must be routed to valid execution capacity |
| Worker claims invocation | C1 | one execution attempt should own the invocation attempt at a time |
| Worker reports completion/result | C1 | result/lifecycle truth depends on it |
| Manage worker/node capacity and liveness | C1 | stale worker state breaks scheduling correctness |
| Propagate function/config snapshot to workers | C1 | stale worker snapshots can run wrong code/config |
| Read invocation status/logs | R2 | operational/user-facing but secondary to correctness baseline |
Critical paths:
P1update function configP2update version routingP3create invocation requestP4schedule invocationP5claim invocationP6complete invocationP7update worker stateP8propagate worker snapshot
Step 3 — Primary State Extraction #
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| FunctionConfig | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | function_id or function_version |
| VersionRouting | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | function_alias |
| InvocationRequest | direct noun | Yes | keep as candidate | event | Yes | service | append-only | instance | invocation_id |
| InvocationState | lifecycle object | Yes | keep as candidate | process | Yes | service | state machine | instance | invocation_id |
| WorkerState | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | instance | worker_id |
| WorkerSnapshot | hidden write target | Yes | keep as candidate | projection | Yes | service | overwrite | instance | worker_id |
| InvocationStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | function_id or invocation_id |
Minimal primary set:
FunctionConfigVersionRoutingInvocationRequestInvocationStateWorkerStateWorkerSnapshot
Important modeling choices:
InvocationRequest #
Primary because:
- invocation request arrival is an immutable fact
InvocationState #
Primary because:
- scheduling, claim, execution, retry, and completion all depend on current execution lifecycle state
WorkerSnapshot #
Worth modeling explicitly because:
- workers often execute from a versioned local view of function code/config/runtime state
- stale snapshot/version mismatch matters
Step 4 — Hard Invariants #
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 update function config | HARD | ordering | Function-config revisions are ordered by monotonic version within function_id or version scope. |
P2 update version routing | HARD | ordering | Alias/version-routing revisions are ordered by monotonic version within routing scope. |
P3 create invocation request | HARD | uniqueness | Key invoke request_id maps to at most one logical outcome invocation request within function scope. |
P4 schedule invocation | HARD | eligibility | schedule_invocation is valid only if a compatible, healthy worker/node is eligible under current function config, version routing, and worker state. |
P5 claim invocation | HARD | uniqueness | Key invocation_id maps to at most one logical outcome current active execution attempt owner within execution-attempt scope. |
P6 complete invocation | HARD | eligibility | complete_invocation is valid only if current InvocationState is owned by the reporting execution attempt and lifecycle state allows completion. |
P7 update worker state | HARD | ordering | Worker-state updates are ordered by monotonic observation revision/timestamp within worker_id. |
P8 propagate worker snapshot | HARD | freshness | WorkerSnapshot(worker_id) reflects authoritative function/version/config state within configured propagation bound. |
What matters most:
- one invocation attempt owner at a time
- workers must execute the intended function version/config
- stale worker/node health or stale snapshot must not cause incorrect execution
- invocation request dedup/idempotency matters for retries
Step 5 — Execution Context #
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical serverless platform across many schedulers and workers |
| Write coordination scope | per object scope | correctness is per function version/routing, per invocation, and per worker |
| Read consistency target | bounded stale allowed | workers can execute from recent versioned snapshots, but invocation lifecycle state should be authoritative |
| Holder model | worker | workers temporarily own execution attempts |
| Compensation acceptable? | No | wrong execution assignment or duplicate execution cannot be blindly compensated in the baseline |
Derived:
holder_may_crash = truebounded_staleness_allowed = trueexclusive_claim_required = trueguarded_by_current_state = true
This implies:
- authoritative control plane for config/routing
- versioned worker snapshots
- claimable invocation execution lifecycle
Step 6 — Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 update function config | overwrite current value | CAS on version | config version |
P2 update version routing | overwrite current value | CAS on version | routing version |
P3 create invocation request | append-only event | append log | idempotency key |
P4 schedule invocation | guarded state transition | single writer scheduler decision or CAS on state/version | placement version |
P5 claim invocation | exclusive claim | lease | fencing token, heartbeat |
P6 complete invocation | guarded state transition | CAS on (state, version) | attempt token / execution epoch |
P7 update worker state | overwrite current value | CAS on version or monotonic overwrite | worker revision/timestamp |
P8 propagate worker snapshot | overwrite current value | single writer snapshot publication | snapshot version |
Why these fit:
- function config and routing are versioned current-state control-plane objects
- invocation request is immutable
- scheduling and completion depend on current lifecycle state
- execution ownership is a classic exclusive claim problem
- worker state and snapshots are current-state overwrite semantics
Step 7 — Read Model / Source of Truth #
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 function config | FunctionConfig | read source directly | authoritative config store |
C2 version/alias routing | VersionRouting | read source directly | authoritative routing store |
C3 invocation request | InvocationRequest | read source directly | authoritative invocation log/store |
C4 invocation lifecycle | InvocationState | read source directly | authoritative lifecycle store |
C5 worker/node state | WorkerState | read source directly | authoritative worker-state store |
C6 worker local snapshot | WorkerSnapshot | materialized view | rebuild from latest function/routing state |
C7 invocation status/logs | derived | materialized view | recompute from primary state |
Important point:
Execution workers should not fetch full control-plane state synchronously for every invocation. They should:
- receive/pull versioned worker snapshots
- validate invocation/version compatibility against those snapshots
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
| config/routing update | retry with version | stale update loses CAS | committed config survives control-plane crash if persisted | snapshot propagation may lag | n/a |
| invocation request | retry with invoke request_id | duplicate invoke request prevented by idempotency key | committed invocation request survives crash if persisted | enqueue/admission publication may lag | n/a |
| schedule invocation | retry with state/version guard | stale scheduler decision loses guarded transition | scheduler crash delays assignment; next scheduler retries | n/a | stale scheduler decision rejected by newer lifecycle state |
| claim invocation | retry with execution attempt epoch | only one worker should hold current active execution claim | if claim committed and worker crashes, lease expires and invocation can be retried/reassigned | n/a | stale worker fenced by attempt token/epoch |
| complete invocation | retry with current attempt token | stale or wrong attempt loses guarded transition | committed completion survives crash if persisted | result publication/logging may lag | stale worker cannot complete after lease loss |
| worker-state update | retry with monotonic worker revision | latest valid worker state wins | committed worker state survives crash if persisted | worker-snapshot propagation may lag | n/a |
| snapshot propagation | retry with versioned snapshot | older snapshot loses to newer version | worker keeps last good snapshot until refresh | failed push retried or pulled later | n/a |
What matters most:
- invocation request idempotency
- worker claim fencing
- worker snapshot versioning
- retry/reassignment after worker crash
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| invocation bursts | write throughput hotspot | shard invocation queues and spread scheduling across workers/zones |
| cold-start pressure | contention hotspot | warm pools / pre-initialized workers and image/runtime caching |
| config/routing churn | fan-out hotspot | batch updates and publish incremental snapshots |
| worker snapshot storms | fan-out hotspot | backoff reconnects and support pull-on-version-miss |
| worker-state heartbeat volume | write throughput hotspot | aggregate/pace heartbeats and compress worker-state deltas |
| invocation status/log reads | read hotspot | keep them as derived views only |
What scales well:
- control plane is narrow and versioned
- invocation requests are append-only
- execution claims are local to schedulers/workers
- workers execute from local snapshots
What fails first:
- bursty invocation admission
- cold-start storms
- worker-snapshot fanout
- stale worker-state causing poor placement
Canonical design conclusion:
- archetype composition:
Control Plane + Data PlaneClaimable Execution Process
- primary truth:
FunctionConfigVersionRoutingInvocationRequestInvocationStateWorkerStateWorkerSnapshot
- hot execution path:
- invocation request append
- scheduler placement
- worker claim
- guarded completion
- control plane:
- authoritative config/routing + snapshot propagation
Concrete Substrate #
- control plane in
Go/Java - authoritative config/routing/state in strongly consistent config/state store
- invocation request path via append-only invocation log or durable queue
- scheduler fleet placing work onto workers
- worker runtime fleet with local versioned snapshots of function/runtime/config state
- optional code artifact storage in object store/CDN-backed artifact distribution
Operation Layer #
DeployFunction(function_id, code_ref, config, expected_version?)
- entry point: control-plane API
- authoritative decider: config store owner
- transition: overwrite
FunctionConfig
PutVersionRouting(function_alias, routing, expected_version?)
- entry point: control-plane API
- authoritative decider: routing store owner
- transition: overwrite
VersionRouting
Invoke(function_ref, payload, request_id?)
- entry point: invoke API
- authoritative decider: invocation-ingest owner
- transition: append
InvocationRequest - response: invocation id or synchronous result handle
- internal scheduling
- reads invocation request + worker state + routing
- updates
InvocationStatewith placement decision
ClaimInvocation(invocation_id, worker_id, attempt_epoch)
- entry point: worker/scheduler
- authoritative decider: invocation-state owner
- transition: exclusive claim on invocation attempt
CompleteInvocation(invocation_id, attempt_epoch, result)
- entry point: worker
- authoritative decider: invocation-state owner
- transition: guarded completion
ReportWorkerState(worker_id, revision, capacity, health)
- entry point: worker-state API
- authoritative decider: worker-state owner
- transition: overwrite
WorkerState
- snapshot propagation
- publish latest
WorkerSnapshot(version)to workers
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
| deploy/update config | control-plane API | config/routing store owner | control-plane node | serverless platform |
| invoke function | invoke API | invocation-ingest owner | API node | serverless platform |
| schedule invocation | scheduler | invocation-state owner + scheduler | internal | serverless platform |
| claim invocation | worker/scheduler | invocation-state owner | state owner/scheduler | serverless platform |
| complete invocation | worker | invocation-state owner | state owner | serverless platform |
| worker-state update | worker-state API | worker-state owner | control-plane node | serverless platform |
| snapshot propagation | worker / control plane | snapshot publisher | control/data-plane | serverless platform |
Concrete HLD #
Main components:
- control-plane API
- function config and routing state store
- invocation ingest log/store
- scheduler fleet
- worker-state store
- worker runtime fleet
- snapshot distribution/artifact distribution layer
- derived invocation status/log views
Short interview version #
“I’d design the Lambda-class platform as a control-plane plus claimable execution system. Control plane stores function config and version routing, then publishes versioned snapshots to workers. Invocation requests are immutable events written to a durable invocation log, schedulers place them onto compatible workers, workers claim execution attempts with fenced ownership, and completion is a guarded lifecycle transition. The main correctness boundaries are request idempotency, one active execution attempt at a time, and workers executing with the intended function/runtime snapshot.”