Table of Contents

System Design Derivation Framework #

A complete, mechanical procedure for deriving distributed system designs from functional requirements — from vague product language to operable architecture, with no hand-wavy steps.

Ordering Principle #

Every layer constrains the next. Skipping a layer makes later choices arbitrary.

Product requirements
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, ordering, evolution         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)  ← the mechanical bridge
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

A design is complete only when all 20 steps produce explicit output artifacts. If you cannot produce the artifact for a step, you have not finished that step.

Step 1 — Problem Normalization #

Goal #

Convert product-language requirements into precise operations over state. This is the first act of removing ambiguity.

Why It Matters #

Product requirements blend UI framing, transport behavior, derived views, and domain semantics. Without normalization, you will:

confuse what the user sees with what the system stores
miss hidden writes behind apparently simple reads
fail to detect that “subscribe to alerts” implies a future-constraint store and a trigger engine
fail to detect that “view live leaderboard” implies a projection, not primary state

Procedure #

For each requirement, identify:

1. Actor — who initiates it?

user, client, internal service, scheduler, admin, external partner

2. Operation — what state operation is actually happening?

UI verb	Actual system operation
view / see	read base record, OR read projection, OR fetch historical slice, OR subscribe from cursor
submit / create	create record, OR append event
update / edit	overwrite mutable state, OR transition state machine
get notification	system evaluates condition + delivers push
subscribe	persist future constraint (intent object)
search	query indexed projection
swipe / like	append immutable decision event
bid	attempt conditional update on contention state

3. State — which underlying data actually changes or is observed? (Not the screen. Not the endpoint. The data.)

Output Artifact #

Original Requirement	Actor	Operation	State Touched
Users can bid on an item	User	attempt conditional update	Auction acceptance state
Users see current highest bid	Client	read projection	AuctionView (derived)
Users get price-drop notifications	System	evaluate + deliver	AlertSubscription + PriceObservation
Drivers go online	Driver	overwrite status	Driver presence state
Rider requests a ride	Rider	create	Trip + LocationPing
Users swipe right on a profile	User	append decision event	Swipe record
Users get a match notification	System	detect mutual condition + push	Swipe store + Match record

Completion Test #

Every requirement is rewritten as actor-operation-state.
Every “view” has been translated into the exact underlying read or projection type.
Every apparently simple requirement has had hidden writes, triggers, or derived views exposed.

Step 2 — Object Extraction #

Goal #

Identify the minimal set of primary state objects. A primary object is one whose state must exist independently and cannot be fully reconstructed from other state.

Object Classes #

These are heuristic categories, not a rigid ontology. The real test is the validity rule below.

Class	Definition	Examples
Stable entity	Long-lived, mutable, continuously existing	User, Product, Video, Driver, Problem
Event	Immutable, point-in-time occurrence	Bid, Swipe, LocationPing, PriceObservation, ActivitySample
Process object	Identity persists across state-machine lifecycle	Auction, Trip, Payment, Activity, JobRun
Intent / future constraint	Persists until a future condition is met	AlertSubscription, ScheduledJob, RetryPolicy
Relationship object	An edge between entities with its own lifecycle	Friendship, Match, SharePermission, FollowRelation
Derived view	Computable from other state	Leaderboard, CurrentHighestBid, Stack, FeedSlice, SurgeMultiplier

Object Validity Rule #

An object is valid as a primary object if and only if it satisfies all four:

1. Ownership purity — there is one clear answer to “who writes this?”

Bad: an object written by users, a pipeline, a scheduler, and a moderator under different rules → split it
Good: Bid is written only by the bidding user

2. Evolution purity — the object evolves in one dominant way

Evolution family	Description
Append-only	New records added; old records never modified
Overwrite	One current state updated in place
State machine	Guarded transitions between named states
Merge	Concurrent writes reconciled via merge law (CRDT)

Bad: one object that needs both append history and overwrite current state → it is two objects
Good: Bid is append-only; Auction is a state machine

3. Ordering purity — there is one ordering relation among its versions/events, bound to a scope

Bad: “events are ordered” (unbound) → ordered within what scope?
Good: “Bids are ordered by acceptance time within one auction”

4. Non-derivability — the object cannot be reconstructed from other primary state

CurrentHighestBid can be derived from accepted Bid records → it is a projection, not primary
Leaderboard can be derived from Scores → it is a projection

Typical Invalid Objects #

Candidate	Problem	Fix
Leaderboard	Derivable from Submissions + ranking rules	Classify as derived view
CurrentHighestBid	Derivable from Auction + Bids	Classify as derived view
UserFeed	Derivable from Follows + Posts	Classify as derived view
ActivityPage	UI artifact	Not a domain object at all
SurgeMultiplier	Derivable from supply/demand counts	Classify as derived view
Stack (Tinder)	Derivable from Users + Swipes + location	Classify as derived view

Output Artifact #

Primary Objects: list with class Derived / Rejected: list with reason for rejection

Completion Test #

Every candidate object has been classified.
Every primary object passes all four validity tests.
Every rejected object has an explicit reason.

Step 3 — Axis Assignment #

Goal #

For every primary object, define exactly three properties. These three axes determine most architecture-relevant behavior.

The Three Axes #

Axis 1: Ownership #

Who is allowed to write this object, and under what contention model?

Ownership pattern	Meaning
Single writer	Exactly one actor writes this object. No concurrency needed.
Single writer per partition	One actor per key/scope, but multiple keys in parallel.
Multi-writer, one winner	Many actors attempt writes; only one succeeds per atomic decision.
Multi-writer, all succeed (commutative)	Concurrent writes merge without conflict.
Multi-writer, pipeline	A processing pipeline is the sole writer; upstream actors just enqueue.
System-only	Only internal services write; users never write directly.

Axis 2: Evolution #

How does the object’s state change over time?

Evolution	Meaning	Implication
Append-only	New records added; old ones never mutated	Immutable; safe to replicate; dedup by content hash or ID
Overwrite	Current state replaced	Need version or timestamp to detect staleness
State machine	Transitions guarded by allowed-transition rules	CAS on (state, version); forbidden transitions must be enforced
Merge	Concurrent writes reconciled by merge law	Requires CRDT or OT; merge must be commutative + associative

Axis 3: Ordering #

What ordering relation must hold, and within what scope?

Ordering	Meaning	Implication
Total order within scope	Any two instances within the scope are comparable	Sequence number per partition key; single-writer or CAS to assign
Partial order	Causal relationship (A happened before B, or concurrent)	Vector clocks; version vectors
No meaningful order	Membership/set semantics only	No ordering mechanism needed
Causal lifecycle order	State machine transitions must follow valid paths	State machine enforcement; forbidden transitions rejected

Critical rule: Always bind ordering to a scope. “Events are ordered” is incomplete. “Accepted bids are totally ordered within one auction” is correct.

Output Artifact #

Object: Auction
  Ownership:   Multi-writer (many bidders), one winner per auction — system enforces
  Evolution:   State machine (created → open → ended; current_highest updated on each acceptance)
  Ordering:    Causal lifecycle order for state transitions; total order on acceptance sequence within auction_id

Object: Bid
  Ownership:   Single writer (the bidding user)
  Evolution:   Append-only (bid attempt is immutable once created)
  Ordering:    Total order by created_at within auction_id (for display); acceptance order is separate

Object: Driver
  Ownership:   Multi-writer, one winner (many dispatchers compete; only one can claim a driver)
  Evolution:   State machine (offline → available → dispatched → on_trip → available)
  Ordering:    Causal lifecycle order (forbidden transitions: dispatched → offline without completing trip)

Object: Swipe
  Ownership:   Single writer (the swiping user)
  Evolution:   Append-only (a swipe decision is immutable)
  Ordering:    Total order by timestamp within swiper_id (for exclusion set scan)

Object: AlertSubscription
  Ownership:   Single writer (the subscribing user)
  Evolution:   Overwrite (user may update threshold, disable, delete)
  Ordering:    No meaningful order (set of active subscriptions per user)

Why Axes Matter More Than Patterns #

The axes directly constrain which mechanisms are valid:

Ownership = “multi-writer, one winner” + Evolution = “state machine” → the mechanism must be CAS or single-writer partition. CRDT is wrong (not commutative). 2PC is wrong (single-service scope). This eliminates the entire pattern space except CAS/Lock/Lease/Single-writer before Q1-Q5 is applied.
Ownership = “multi-writer, all succeed” + Evolution = “merge” → the mechanism must be CRDT. Lock is wrong. 2PC is wrong.

Axes are necessary to make Step 6 (mechanism selection) mechanical.

Completion Test #

Every primary object has explicit ownership, evolution, and ordering.
Every ordering is bound to a scope.
No object has mixed evolution without being flagged for splitting.

Step 4 — Invariant Extraction #

Goal #

Convert product requirements into precise statements that must always be true in every valid execution. Invariants are the bridge from object model to architecture. Patterns and technologies are downstream of invariants; invariants are not downstream of anything.

What An Invariant Is #

A precise, testable property of valid system state:

“For every valid state S, condition X must hold.”

It must be:

Precise — a stranger can determine whether the system satisfies it
Testable — you can write a check that passes or fails
Implementation-independent — it describes what must be true, not how to achieve it
Concurrency-aware — it must hold even under simultaneous writes, retries, and crashes

Invariant Types #

Type 1: Eligibility invariants #

An action is valid only when a condition holds.

An action A on object O is valid only if condition C(O) is true at decision time.

Examples:

A bid is accepted only if auction.status == OPEN AND server_time < auction.end_time AND bid.amount > auction.current_highest
A driver is dispatched only if driver.status == AVAILABLE
A job run starts only if run.status == CLAIMED AND run.claimed_by == this_worker

Implication: Some mechanism must atomically check C and perform A. If C can change between check and action, a concurrency mechanism is required.

Type 2: Ordering invariants #

Events or versions of an object must be comparable and consistent within a scope.

For any two instances I₁ and I₂ within scope S, there is a well-defined ordering, and all readers agree on it.

Examples:

Accepted bid amounts form a strictly increasing sequence within one auction
Comments are delivered in the same order to all subscribers of a video
State machine transitions follow valid paths (no backwards transitions)

Implication: A single authority assigns order within the scope. No two actors may independently assign conflicting positions.

Type 3: Accounting invariants #

An aggregate must exactly reflect its source events.

aggregate(scope S) = f(all events in scope S)

Examples:

All-time score for video V equals the sum of all ScoreEvent.delta where event.video_id = V
Monthly active users for month M = count(distinct user_id where last_active ∈ M)
Inventory quantity = initial_stock − sum(reservations) + sum(cancellations)

Implication: Every event that contributes to the aggregate must be processed exactly once. Late or duplicate events must be handled explicitly.

Type 4: Uniqueness / idempotency invariants #

The same logical request, submitted multiple times, produces the same logical outcome.

If request R is submitted N times (N ≥ 1), the observable outcome is identical to submitting it once.

Examples:

Submitting the same bid request twice does not create two accepted bids
Retrying a payment capture does not charge the user twice
Swiping right on the same profile twice creates at most one Match check

Implication: Either the operation is naturally idempotent (append-only with content-addressed ID), or a dedup mechanism must detect and absorb duplicates.

Type 5: Propagation invariants #

A derived view must eventually reflect its source of truth, and the staleness bound is explicit.

At any time T, derived_view(scope S) = f(source_of_truth(S)) as of time T − ε, where ε is the acceptable staleness.

Examples:

The auction read view reflects the current highest bid within 100ms of acceptance
The Tinder stack excludes profiles already swiped before the stack is displayed
A cache entry is stale by at most TTL seconds

Implication: Some update propagation mechanism must exist. The staleness bound determines whether it must be synchronous (ε = 0) or asynchronous (ε > 0).

Type 6: Access-control invariants #

Reads and writes are gated by authorization.

Actor A may perform operation O on object X only if A has permission P on X.

Examples:

A user may read only their own activities plus friends’ visible activities
Only the auction owner may cancel an auction
A driver may only update their own location

Implication: An authorization check at every service boundary. The authorization state itself (ACL, permission record) is a primary object with its own ownership/evolution model.

Invariant Extraction Procedure #

For each normalized requirement from Step 1:

Ask: What must always be true for this requirement to be satisfied?
Ask: What must never be violated, even under concurrent writes, retries, crashes, or late events?
Classify the invariant by type.
Write it in precise, testable form.

Output Artifact #

Numbered list:

1. [Eligibility] A bid is accepted only if auction.status == OPEN
   AND server_time < auction.end_time AND bid.amount > auction.current_highest_amount.
2. [Ordering] Accepted bid amounts form a strictly increasing sequence within one auction.
3. [Uniqueness] The same bid request_id produces the same logical acceptance result.
4. [Eligibility] A driver is dispatched only if driver.status == AVAILABLE.
5. [Uniqueness] The same dispatch request produces at most one claimed driver.
6. [Accounting] The all-time score for a video equals the sum of all delta values
   in ScoreEvent records where video_id matches.
7. [Propagation] The leaderboard for scope S reflects all ScoreEvents within 60 seconds.
8. [Eligibility] A Match is created only if both users have right-swiped each other.
9. [Uniqueness] A Match between user A and user B is created at most once.

Completion Test #

Every requirement maps to at least one invariant.
Every invariant is typed, precise, and testable.
Every invariant is stated in terms of domain state, not implementation.
Concurrency is explicitly considered for every eligibility invariant.

Step 5 — DP Derivation #

Goal #

Derive the minimal set of design parameters — the minimal mechanisms that enforce the invariant clusters.

A DP is not a technology (not “Postgres” or “Redis”). It is not a pattern name (not “CQRS” or “Event Sourcing”). It is the minimal runtime mechanism that makes an invariant cluster provably enforceable.

Derivation Chain #

requirement → state → invariant → dependency closure → minimal enforcing mechanism

Dependency closure: Ask “what other state is required to enforce this invariant?” This reveals cross-requirement dependencies that are invisible when requirements are read in isolation.

Derivation Procedure #

For each invariant or tightly coupled invariant cluster:

Identify the state the invariant is about.
Ask: what must exist at runtime for this invariant to remain true under adversarial conditions (concurrent writes, retries, crashes)?
Name the minimal mechanism that enforces it.

Examples #

Invariant 1+2 (auction eligibility + ordering):

State: auction acceptance state (status, current_highest, version)
What must exist: something that atomically reads current state, validates eligibility, and conditionally updates — rejecting concurrent updates that lost the race
Minimal DP: Auction acceptance / concurrency control mechanism

Invariant 3 (bid idempotency):

State: prior outcomes keyed by request_id
What must exist: a store that maps request_id → prior result, checked before every acceptance attempt
Minimal DP: Bid idempotency store

Invariant 4+5 (driver dispatch):

State: driver status
What must exist: something that atomically claims a driver exclusively — rejecting competing claims
Minimal DP: Driver claim / exclusion mechanism

Invariant 6+7 (leaderboard accounting + propagation):

State: per-video aggregate score, top-K projection per scope
What must exist: (a) a store for aggregate scores that processes every event exactly once; (b) a mechanism that updates top-K when a score crosses the cutoff; (c) a propagation path from score store to projection within 60s
Minimal DPs: Score aggregate store, Top-K projection maintenance mechanism, Score-to-projection propagation path

DP Validation Rules #

A good DP:

enforces exactly one invariant cluster
is not redundant with another DP
is not a technology or a brand name
is not so broad as to be meaningless (“backend service” is not a DP)

Output Artifact #

DP	Invariant cluster enforced
Auction acceptance mechanism	Eligibility (open, not expired, amount strictly increasing)
Bid idempotency store	Uniqueness of bid acceptance per request_id
Auction view / read projection	Propagation: current highest bid visible to readers
Score aggregate store	Accounting: all-time score = sum of all events
Top-K projection maintenance	Accounting + Propagation: leaderboard reflects aggregate within bound
Driver claim mechanism	Eligibility: driver exclusively assigned to one trip
Dispatch idempotency store	Uniqueness: same dispatch request produces one outcome
Match creation mechanism	Eligibility: Match created only on mutual right-swipe
Match idempotency store	Uniqueness: at most one Match per user pair

Completion Test #

Every invariant maps to at least one DP.
No DP exists without a linked invariant.
No two DPs enforce the same invariant redundantly.
No DP is a technology name.

Step 6 — Mechanism Selection (The Mechanical Bridge) #

Goal #

For each DP, derive the concrete mechanism using a mechanical, table-driven procedure. This step fills the gap between “bid acceptance mechanism” (abstract DP) and “CAS on (auction_id, version, current_highest, status) in a transactional store” (concrete mechanism).

The procedure has four sub-steps applied in order. Earlier sub-steps constrain later ones.

Sub-step 6.1 — Classify the DP by invariant type #

Invariant type	Mechanism family required
Eligibility	Concurrency control — something must atomically check and conditionally update
Ordering	Sequence authority — one actor assigns position within scope
Accounting	Aggregation pipeline — events processed exactly once into aggregate
Uniqueness / idempotency	Dedup gate — lookup prior result before executing
Propagation	Update propagation path — source-of-truth change reaches derived view within ε
Access-control	Authorization boundary — permission check before every operation

Sub-step 6.2 — Apply ownership × evolution to select concurrency mechanism #

This table is the mechanical derivation for concurrency control. Look up the object’s ownership and evolution axes (from Step 3):

Ownership	Evolution	Required concurrency mechanism	Why
Single writer	Any	None	One writer cannot race with itself
Single writer per partition key	Any	None within partition; partition routing is the mechanism	Route all writes for key K to the same owner
Multi-writer	Overwrite	CAS on version	Two writers compete to set the new value; one must lose
Multi-writer	State machine	CAS on (state, version)	Must reject invalid transitions and concurrent winners
Multi-writer	Merge (commutative + associative)	CRDT	Merge is safe without coordination
Multi-writer, one at a time, no crash concern	Exclusive hold	Pessimistic Lock	Blocks others for duration of operation
Multi-writer, one at a time, holder may crash	Exclusive hold	Lease (TTL + fencing token)	Auto-releases if holder dies; fencing token prevents stale commands
Cross-service, decomposable into local steps	Multi-step write	Saga	Each step is local; compensations undo on failure
Cross-service, not decomposable	Atomic multi-service write	2PC	All-or-nothing across services; use only when Saga cannot apply

Decision rules:

If evolution = merge AND ownership = multi-writer → always CRDT. Do not use Lock or CAS.
If ownership = single-writer → never add concurrency mechanism. It adds cost with no benefit.
If holder may crash while holding exclusive access → Lease, not Pessimistic Lock. Lock held by dead process blocks forever.
Use 2PC only as last resort. It blocks on coordinator failure.

Sub-step 6.3 — Apply Q1-Q5 to discriminate within the mechanism family #

After Sub-step 6.2 has selected the mechanism family, Q1-Q5 selects the specific implementation and adds any additional mechanisms required.

Q1 — Coordination scope #

Where does contention actually localize?

Scope	Implication
Within a single data structure (same process, shared memory)	Hardware CAS (compare-and-swap instruction). No network round-trip.
Within a service (multiple stateless instances, shared external store)	Distributed CAS: `SET key value NX` in Redis, conditional write in DynamoDB, `UPDATE ... WHERE version = $v` in Postgres
Cross-service (multiple independent services)	Saga (if decomposable) or 2PC (if not). No in-process lock can span services.
Cross-region (geographically distributed, partition must not block)	CRDT or Leaderless Replication. Any mechanism requiring a quorum write may block during partition.

Application to auction: scope = within auction service → distributed CAS on shared store. Application to driver dispatch: scope = within dispatch service → distributed CAS on shared Redis. Application to collaborative editor: scope = cross-region, partition must not block → CRDT.

Q2 — Failure model #

What breaks, and what must the mechanism survive?

Failure	Required additional mechanism
Writer crashes after write, before response	Idempotency Key: client retries; server detects duplicate via stored outcome
Lock holder crashes while holding lock	Lease (TTL): store auto-releases after timeout. Add fencing token to reject stale commands.
Network partition between services	CRDT (no coordination) or Leaderless with quorum. 2PC blocks during partition.
Slow degradation of downstream service	Circuit Breaker + Timeout. The downstream call must have a maximum wait and a fail-fast threshold.
Retry storm on recovery	Retry + Exponential Backoff + Jitter. Without jitter, all retrying clients hit at the same moment.
Partial write (step A succeeds, step B fails)	Outbox + Relay or Saga compensation. Partial state must be detectable and recoverable.

Application to auction bid: writer (client) may crash after sending → add Idempotency Key on request_id. Application to driver dispatch: dispatcher may crash while driver is claimed → use Lease (TTL on claim), not plain lock. Application to payment service call: payment gateway may degrade → add Circuit Breaker + Timeout.

Q3 — Data properties #

What does the data structure allow?

Property	Implication
Operations are commutative + associative (order doesn’t matter, duplicates are idempotent)	CRDT — zero coordination cost. Use before any lock-based mechanism.
Data is content-addressed (identity = hash of content)	Hash is a natural idempotency key. Dedup is free at the storage layer without a separate idempotency table.
Data is monotonically increasing	CAS with monotonicity check is sufficient. No need for full version history.
Data requires total order within a scope	A single authority must assign sequence numbers within the scope.
Data is spatially structured (lat/lng or region)	Spatial Partition must precede any other read-path optimization.
Data is time-windowed	Windowing mechanism must handle late events and window boundaries explicitly.

Application to score aggregation: ScoreEvent.delta is commutative (addition is associative) → stream aggregation with at-least-once delivery + idempotent delta application is sufficient. No lock needed. Application to block storage (Dropbox): Block identity = SHA-256(content) → content hash IS the idempotency key. Duplicate uploads are free to ignore. Application to geo-search (driver matching): Data is spatially structured → Spatial Partition (H3/geohash) must be the primary index. Do not range-scan on (lat, lng) columns.

Q4 — Access pattern #

How is the data read and written?

Pattern	Mechanism
Read » Write, staleness tolerable	Cache-Aside with TTL
Read » Write, read shape ≠ write shape	CQRS + Materialized View
Read » Write, read shape = write shape, join-heavy	Denormalization
Write » Read	Append-only Log, Hash Partition, Leaderless Replication
Query fans across multiple shards	Scatter-Gather
Query is geospatial (find entities within radius)	Spatial Partition (H3, Geohash, S2, QuadTree)
Reads are time-bounded (recent data dominates)	Temporal Decay, TTL, Windowing

Application to auction read view: reads » writes on auction detail; read shape ≈ write shape → Cache-Aside with short TTL. Invalidate on acceptance event. Application to driver matching: geospatial query (“drivers within 3km”) → Spatial Partition (H3 index in Redis Geo). Do not use Cache-Aside without spatial index — it still requires O(N) scan. Application to leaderboard: reads » writes; read shape (ranked list) ≠ write shape (score events) → CQRS + Materialized View (top-K projection as the read model).

Q5 — Coupling requirement #

How tightly must the producer and consumer of a change synchronize?

Coupling	Mechanism
Synchronous, atomic (producer needs confirmation before proceeding)	In-transaction write. CAS, 2PC, or Pessimistic Lock within same transaction.
Synchronous, best-effort (producer sends, expects delivery but tolerates retry)	Retry + Backoff + Jitter
Asynchronous, event produced by application logic	Outbox + Relay: write event to outbox table in same DB transaction; relay publishes to broker
Asynchronous, event is the DB row change itself	CDC: tail the DB write-ahead log; no application-level outbox needed
Fan-out to N consumers at write time	Fan-out on Write: push to each consumer’s inbox immediately
Fan-out to N consumers at read time	Fan-out on Read: each consumer pulls and merges at query time
Notification must be reliable even if consumer is offline	Persist notification record + async retry; never fire-and-forget

Application to auction acceptance → view update: view can be slightly stale (ε = 100ms) → async propagation via CDC or event publication → eventually consistent view. Application to bid submission → outbox: bid result must trigger downstream events (notify watchers) atomically with the bid write → Outbox + Relay (write both to same Postgres transaction). Application to match notification: user may be offline → persist Match record durably; push notification is best-effort retry; user sees match on app open regardless.

Sub-step 6.4 — Check for required mechanism combinations #

Some invariants require multiple mechanisms. Missing one creates a correctness gap.

Invariant	Required mechanism	Additional mechanism always needed
Eligibility under concurrency	CAS or Lease	+ Idempotency Key (client retry creates duplicate without it)
Eligibility + crash of holder	Lease	+ Fencing token (stale holder must be rejected on every resource access)
Accounting via event stream	Stream aggregation	+ Idempotency per event (duplicate events corrupt aggregate)
Propagation to derived view	CDC or Outbox	+ Idempotency in consumer (message broker delivers at-least-once)
Ordering within scope	Sequence authority	+ Same authority for history and live path (two authorities = split brain)
Cross-service write	Saga	+ Idempotency on each local step (Saga steps may retry)

The most commonly missed combination: CAS without Idempotency Key. CAS prevents concurrent duplicates but not retry duplicates. A client whose request timed out will retry. If the first request succeeded (CAS applied), the retry hits a changed version and either: (a) correctly fails (if client checks), or (b) silently succeeds again if the CAS predicate accidentally matches again. Always pair CAS with an Idempotency Key check first.

Worked Examples of Mechanical Derivation #

Example A: Auction bid acceptance #

From Step 3:

Object: Auction
Ownership: Multi-writer, one winner per auction
Evolution: State machine (current_highest is conditionally updated)
Ordering: Total order on acceptance sequence within auction_id

From Step 4:

Invariant 1 [Eligibility]: bid accepted only if OPEN, not expired, amount > current_highest
Invariant 2 [Ordering]: accepted amounts strictly increase
Invariant 3 [Uniqueness]: same request_id = same outcome

Step 6 derivation:

Sub-step 6.1: Eligibility + Ordering → concurrency control + sequence authority

Sub-step 6.2: Ownership=multi-writer one-winner + Evolution=state machine → CAS on (state, version)

Sub-step 6.3:

Q1: within auction service → distributed CAS on shared store
Q2: client may crash after sending → add Idempotency Key on request_id
Q3: not commutative (two bids can’t both be highest) → CAS confirmed, not CRDT
Q4: write path is the hot path; read path is read-heavy → Cache-Aside for read view
Q5: bid write → view update can be async (ε acceptable) → CDC or event publication

Sub-step 6.4: CAS + Idempotency Key (required combination for retryable eligibility writes)

Result:

CAS: UPDATE auction SET current_highest=$amount, winning_bid_id=$bid_id, version=version+1 WHERE auction_id=$id AND version=$old_version AND status='OPEN' AND end_time > now() AND current_highest < $amount
Idempotency: check idempotency_store[request_id] before executing CAS; return stored result on hit
Read view: Cache-Aside with TTL; invalidate on CDC event from auction table

Example B: Driver dispatch (ride-hailing) #

From Step 3:

Object: Driver
Ownership: Multi-writer, one winner (many dispatchers compete per driver)
Evolution: State machine (available → dispatched → on_trip → available)
Ordering: Causal lifecycle order (forbidden: dispatched → offline without completing trip)

From Step 4:

Invariant [Eligibility]: driver dispatched only if driver.status == AVAILABLE
Invariant [Uniqueness]: same dispatch request → same outcome (one driver claimed)
Invariant [Eligibility crash-safety]: if dispatcher crashes, driver is not held indefinitely

Step 6 derivation:

Sub-step 6.1: Eligibility + crash-safety → concurrency control with timeout

Sub-step 6.2: Ownership=multi-writer one-winner + holder may crash → Lease (TTL + fencing token)

Sub-step 6.3:

Q1: within dispatch service → distributed CAS on shared in-memory store (Redis)
Q2: dispatcher crashes while driver is claimed → TTL auto-releases claim; fencing token rejects stale dispatcher commands
Q3: not commutative → Lease confirmed, not CRDT
Q4: high write rate (many dispatches/sec), O(1) latency required → Redis, not relational DB with row locks
Q5: dispatch result must be synchronous (dispatcher needs immediate response) → in-request atomic claim

Sub-step 6.4: Lease + Idempotency Key; Lease + Fencing Token (both required for crash-safety)

Result:

Claim: SET driver:{id}:status dispatched NX EX {ttl_seconds} (Redis: set if not exists, with TTL)
Fencing token: monotonically increasing claim_id stored with status; dispatcher must present claim_id on every subsequent command; stale claim_id → reject
Idempotency: dispatch_request_id → check idempotency store before attempting claim

This is the mechanical derivation of “CAS on driver status in Redis with TTL” from first principles — without pattern memorization.

Example C: Leaderboard (YouTube Top K) #

From Step 3:

Object: ScoreEvent
Ownership: Pipeline (system writes after processing)
Evolution: Append-only
Ordering: Total order by timestamp within video_id
Object: VideoAggregate (derived → not primary, but a necessary DP)
Ownership: Pipeline (single writer per video_id partition)
Evolution: Overwrite (increment on each event)
Ordering: No meaningful order (only latest value matters)

From Step 4:

Invariant [Accounting]: all-time score = sum of all ScoreEvent.delta for video_id
Invariant [Accounting]: window score = sum of ScoreEvent.delta for video_id within window bucket
Invariant [Uniqueness]: duplicate ScoreEvent is processed at most once
Invariant [Propagation]: top-K projection reflects aggregates within 60s

Step 6 derivation:

Sub-step 6.1: Accounting + Uniqueness + Propagation → aggregation pipeline + dedup + propagation path

Sub-step 6.2: Ownership=pipeline (single writer per partition) → no concurrency mechanism for the aggregate. Dedup required for at-least-once delivery.

Sub-step 6.3:

Q1: within aggregation service, partitioned by video_id → single-writer per partition; no cross-service coordination
Q2: event delivered at-least-once → Idempotency per event (dedup by event_id before incrementing aggregate)
Q3: delta addition is commutative → stream aggregation is correct; CRDT G-Counter semantics apply
Q4: read shape (ranked list) ≠ write shape (event stream) → CQRS + Materialized View; Scatter-Gather across scopes for multi-scope queries; Temporal Decay for recency weighting
Q5: propagation from aggregate to projection can be async (60s bound) → CDC or internal event → projection updater

Sub-step 6.4: Stream aggregation + Idempotency per event + Temporal Decay for recency

Result:

Ingest: Kafka topic partitioned by video_id
Aggregate: Flink job, keyed by video_id, dedup by event_id, increment all-time and windowed aggregates
Projection: Redis sorted set per scope; updated when new score crosses cutoff
Temporal Decay: score in sorted set = raw_score × e^(−λ × age_hours) recomputed on each update

Output Artifact #

For each DP, produce:

DP	Mechanism	Components	Failure handling
Auction acceptance	CAS on (auction_id, version) + Idempotency Key	Postgres conditional UPDATE + idempotency table	Retry: check idempotency first; CAS fail: reread and retry bounded times
Driver claim	Lease (Redis SETNX + EX) + Fencing token	Redis key per driver_id + claim_id	TTL auto-releases stale claim; reject stale claim_id
Score aggregate	Stream aggregation (Flink) + event dedup	Kafka + Flink keyed state	Exactly-once via Flink checkpointing; event dedup by event_id
Top-K projection	CQRS + Materialized View + Temporal Decay	Redis sorted set per scope	Rebuild from aggregate store on projection corruption

Step 7 — Axiomatic Validation #

Goal #

Validate the design for: (1) single source of truth per concept, (2) no hidden coupling between DPs, (3) no circular dependencies.

Single Source of Truth #

For every concept, ask: if two systems disagree about this value, which one is correct?

If the answer is ambiguous, you have dual truth. Dual truth is the source of “sometimes stale, sometimes wrong” systems.

Source-of-truth table:

Concept	Authoritative source	Derived projections
Current highest bid	Auction acceptance state (Postgres)	AuctionView cache (Redis)
Driver status	Driver status store (Redis)	Dispatch service in-memory state
All-time video score	Score aggregate store (Flink keyed state)	Top-K projection (Redis sorted set)
Comment order	Authoritative comment store (sequence assigned at write)	History slice, live subscription
Match existence	Match table (Postgres)	Notification (derived event)

Invalid patterns:

Redis cache is treated as authoritative because it is fast. Fix: Redis is a projection; Postgres is truth.
History and live stream assign their own sequence numbers independently. Fix: one sequence authority for both.
Two services each compute “current surge multiplier” from raw data independently. Fix: one aggregator is authoritative; others read its output.

Independence Check #

Build a DP-to-invariant matrix. Look for:

Hidden shared decisions: two DPs that both depend on the same underlying choice (e.g., two DPs that both need “total order within video_id” — there should be one sequence authority, not two)
Circular dependency: DP A depends on DP B’s output which depends on DP A’s output
Accidental coupling: a cache that can diverge permanently from its source with no rebuild path

Acceptable dependency: Leaderboard projection depends on score aggregate. This is a real, one-directional dependency. Unacceptable dependency: Leaderboard projection can permanently diverge from score aggregate with no reconciliation. Fix: add rebuild path (replay aggregate store → rebuild projection).

Output Artifact #

Source-of-truth table (above)
Dependency table: DP → depends on
List of projections with: upstream source, maximum lag, rebuild path

Completion Test #

Every concept has exactly one authoritative source.
Every projection has a documented upstream source and a rebuild path.
No correctness-critical concept has competing truths.
No circular dependency exists.

Step 8 — Algorithm and State-Transition Design #

Goal #

Make every write path and lifecycle transition directly implementable. “Use CAS” is not an algorithm. “Read current version, validate predicates, attempt UPDATE WHERE version = $v, retry bounded times on failure” is an algorithm.

What to Specify #

For every write path:

Input: what arrives
State read: what must be read before deciding
Validation: what conditions must hold (maps to eligibility invariants)
Atomic update: what changes atomically and under what predicate
Output: what is returned
Failure behavior: what happens on retry, duplicate, race, timeout, or rejection

For every process object: define all states, all allowed transitions, all forbidden transitions, terminal states.

Algorithm Templates #

Template 1: Eligibility write with CAS #

function accept(request_id, object_id, payload):
  // Step 1: Idempotency check
  if prior_result = idempotency_store.get(request_id):
    return prior_result

  // Step 2: Read current state
  state = store.read(object_id)

  // Step 3: Validate eligibility
  if not eligible(state, payload):
    result = rejection_reason(state, payload)
    idempotency_store.set(request_id, result, ttl=24h)
    return result

  // Step 4: Attempt CAS
  new_state = apply(state, payload)
  success = store.conditional_update(
    object_id,
    expected_version = state.version,
    new_state = new_state
  )

  // Step 5: Handle CAS result
  if success:
    result = accepted(new_state)
    idempotency_store.set(request_id, result, ttl=24h)
    emit_event(object_id, new_state)  // for projections
    return result
  else:
    // Another writer won the race
    if retries_remaining > 0:
      return accept(request_id, object_id, payload)  // retry with fresh read
    else:
      return conflict_or_rejection()

Template 2: State machine transition #

allowed_transitions = {
  CREATED: [OPEN],
  OPEN: [ENDED],
  ENDED: [],  // terminal
}

function transition(object_id, new_state, request_id):
  if prior = idempotency_store.get(request_id): return prior

  current = store.read(object_id)
  if new_state not in allowed_transitions[current.state]:
    return invalid_transition_error(current.state, new_state)

  success = store.conditional_update(
    object_id,
    expected_version = current.version,
    new_state = {state: new_state, version: current.version + 1}
  )
  if not success: retry or fail

Template 3: Append-only event with dedup #

function ingest_event(event_id, payload):
  if dedup_store.exists(event_id): return already_processed

  // Atomic: write event + mark dedup
  store.append(event_id, payload)
  dedup_store.set(event_id, processed, ttl=retention_period)

  // Update downstream aggregate
  aggregate = aggregate_store.increment(payload.key, payload.delta)
  if aggregate crosses top_k_cutoff:
    projection_store.update(payload.scope, payload.key, aggregate)

Template 4: Saga (multi-service write) #

function saga(saga_id, steps):
  if prior = idempotency_store.get(saga_id): return prior

  completed_steps = []

  for step in steps:
    result = execute_with_idempotency(saga_id + ":" + step.id, step)
    if result.failed:
      // Compensate all completed steps in reverse
      for completed in reversed(completed_steps):
        compensate_with_idempotency(saga_id + ":compensate:" + completed.id, completed)
      return saga_failed(result)
    completed_steps.append(step)

  return saga_succeeded()

State Machine Specification #

For every process object, produce:

States: {CREATED, OPEN, ENDED}
Allowed transitions:
  CREATED → OPEN    (when: manually opened or scheduled open time reached)
  OPEN → ENDED      (when: end_time reached OR owner closes)
Forbidden transitions:
  ENDED → any       (ENDED is terminal)
  OPEN → CREATED    (no reversal)
Terminal states: {ENDED}

Clock authority: server_time from the auction service. Client clock is never trusted.

Duplicate Handling Rules #

Every write path must explicitly state its idempotency semantics:

Write path	Idempotency key	Duplicate behavior
Bid submission	request_id (client-generated UUID)	Return same acceptance/rejection result
Driver dispatch	dispatch_request_id	Return same claimed/failed result
Score event ingestion	event_id (producer-assigned)	Skip if already processed
Match creation	canonical_key = (min(a,b), max(a,b))	INSERT IF NOT EXISTS; return existing match
Payment capture	trip_id + attempt_number	Return stored capture result

Output Artifact #

For every write path: pseudocode including idempotency, CAS, retry, and rejection logic. For every process object: complete state machine diagram.

Step 9 — Logical Data Model #

Goal #

Convert primary objects, DPs, and access patterns into a concrete logical schema. This step is still logical (technology-independent). Physical decisions come in Step 10.

Schema Derivation Rules #

Rule 1: Every primary object becomes at least one table/collection. Rule 2: Every derived view that is read frequently becomes an explicit read model table/collection. Rule 3: Partition key candidates are derived from the scope of the correctness invariant (not from convenience). Rule 4: Every idempotency requirement produces a dedup key in the schema. Rule 5: Source tables and projection tables are never the same table. Mark each clearly.

Schema Components #

For each table:

Name and class (source-of-truth vs projection)
Fields
Primary key
Important secondary / lookup keys
Partition key candidate (derived from invariant scope)
Uniqueness constraints (for dedup keys)
Ordering keys (for range reads)

Example Schemas #

Auction (source-of-truth) #

auction_id        UUID        PK
seller_id         UUID        FK
starting_price    Decimal
end_time          Timestamp
status            Enum(CREATED, OPEN, ENDED)
current_highest   Decimal
winning_bid_id    UUID nullable
version           Integer     optimistic lock
created_at        Timestamp

Partition key: auction_id
Uniqueness: (auction_id) — one row per auction

Bid (source-of-truth, append-only) #

bid_id            UUID        PK
auction_id        UUID        FK, partition key candidate
bidder_id         UUID
amount            Decimal
created_at        Timestamp
request_id        UUID        UNIQUE — dedup key
status            Enum(PENDING, ACCEPTED, REJECTED)

Partition key: auction_id (scan all bids for an auction)
Secondary key: (bidder_id, created_at) — scan bids by bidder
Uniqueness: (request_id) — idempotency enforcement

IdempotencyRecord #

request_id        UUID        PK
result_json       JSON
created_at        Timestamp
expires_at        Timestamp   — TTL

Partition key: request_id

DriverStatus (source-of-truth, in Redis) #

Key: driver:{driver_id}:status
Value: {status, claim_id, trip_id, claimed_at}
TTL: set on claim; cleared on release

ScoreAggregate (source-of-truth) #

video_id          UUID        PK / partition key
all_time_score    Long
updated_at        Timestamp
version           Long        — for optimistic updates

TopKProjection (read model / derived view) #

scope             String      PK (composite: granularity:bucket_id)
rank              Integer     sort key
video_id          UUID
score             Double      — with temporal decay applied
updated_at        Timestamp

Note: this table is entirely rebuildable from ScoreAggregate + ScoreEvents

Read-Path Alignment #

Every frequent read must have a direct key or index path. For each important query, verify:

Query	Required key
Get auction by id	auction_id PK
Get all bids for auction	auction_id partition key on Bid
Get top K videos for scope	scope PK on TopKProjection
Did user A swipe on user B?	(swiper_id, swiped_id) composite key on Swipe
Did user B swipe on user A?	(swiper_id=B, swiped_id=A) — same table, different lookup
Drivers within 3km of location	Spatial index on (lat, lng) — not a table key

Output Artifact #

Schema definitions for all primary objects and derived view tables, with keys and constraints.

Step 10 — Technology Landscape #

Goal #

Map each DP’s required capabilities to a technology shape, then to specific products. Technology is chosen by capability fit, not by fashion or familiarity.

Technology Shape Categories #

Shape	Best for	Examples
OLTP store	Transactional conditional writes, state machines, point reads/writes, foreign keys	PostgreSQL, MySQL, CockroachDB, DynamoDB (simple KV conditional)
Wide-column / time-series store	High-rate append, per-key range reads (time range), telemetry, events	Cassandra, Scylla, Bigtable, TimescaleDB
In-memory store	Sub-millisecond reads, ephemeral hot state, Lease implementation, spatial index, bounded sorted sets	Redis, Memcached
Append log / event stream	Ordered event ingestion, replay, async fan-out, buffering between producers and consumers	Kafka, Pulsar, Kinesis
Stream processor	Stateful event-by-event computation, windowing, watermarks, exactly-once aggregation	Flink, Spark Streaming, Kafka Streams
Object store	Cheap immutable large-object storage, archival, cold data	S3, GCS, Azure Blob
Search / analytical store	Full-text search, ad-hoc analytics, flexible secondary queries	Elasticsearch, ClickHouse, BigQuery

Capability-to-Shape Mapping #

Required capability	Technology shape
Conditional update (CAS, optimistic lock)	OLTP store
Lease (exclusive time-bound claim)	In-memory store (Redis SETNX + EX)
High-throughput append with per-key ordering	Wide-column store or append log
Spatial query (entities within radius)	In-memory store with geo index (Redis GEOADD/GEORADIUS) OR dedicated spatial DB
Sliding/tumbling window aggregation	Stream processor
Bounded sorted set (top-K)	In-memory store (Redis sorted set)
Async fan-out with replay	Append log
Idempotency store	OLTP store (write once, read on retry)
Full-text search	Search store
Cold archival	Object store

Technology Selection Rules #

Match mutation pattern first. If the invariant requires CAS, the store must support atomic conditional writes. Redis sorted sets do not support CAS on set membership — use Postgres or DynamoDB for that.
Never let a projection store become truth. Redis is fast but not the source of truth for anything correctness-critical. If Redis is unavailable, rebuild from the OLTP store. This must be possible.
Partition key alignment. The technology must support efficient access on the partition key you chose in Step 9. If your partition key is (scope, bucket_id) and your queries are by this key, Cassandra with that as the partition key works. A single Postgres table without sharding does not work at high volume.
Check consistency model of the technology. DynamoDB eventual consistency by default — if your invariant requires strong consistency, use strongly consistent reads (at higher latency and cost). Cassandra QUORUM reads are consistent; ONE reads are not.

Output Artifact #

DP	Required capability	Technology shape	Specific product	Consistency mode
Auction acceptance	CAS + transactions	OLTP	PostgreSQL	Serializable or Read Committed + optimistic lock
Bid idempotency	Write-once, read-on-retry	OLTP	PostgreSQL (same DB)	Read Committed
Driver status / claim	Lease (SETNX + TTL)	In-memory	Redis	Single-instance or Redlock
Driver spatial index	Geo query within radius	In-memory + geo	Redis Geo	Eventual (TTL-based refresh)
Score event ingestion	High-throughput append	Append log	Kafka (partitioned by video_id)	At-least-once
Score aggregation	Windowed stateful compute	Stream processor	Flink	Exactly-once (checkpointing)
Top-K projection	Bounded sorted set	In-memory	Redis sorted set	Eventual (rebuilt from Flink output)
Swipe store	High-write append, per-key scan	Wide-column	Cassandra (partition: swiper_id)	Quorum writes for right-swipes
Match store	CAS on (user_pair canonical key)	OLTP	PostgreSQL	Serializable

Step 11 — Deployment Topology #

Goal #

Define service boundaries, partition ownership, failure domains, and data flow. Architecture becomes deployable here.

Service Boundary Rules #

A service boundary is justified when two responsibilities have:

Different scaling characteristics (write-heavy vs read-heavy)
Different failure isolation requirements (payment must not fail because recommendation is slow)
Different consistency requirements (real-time bid acceptance vs eventual leaderboard)
Different data ownership (auction service owns auction state; notification service owns delivery state)

Do not create service boundaries to satisfy a microservices philosophy. Create them when one of the above is true.

Partition Key Rules #

Partition key must align with the scope of the correctness invariant.

If the invariant is “accepted bids strictly increase within one auction,” the partition key for the acceptance mechanism is auction_id. Partitioning by seller_id puts auctions from the same seller together — this is wrong because two separate auctions from the same seller have no shared invariant.

Invariant scope	Partition key
Per auction	auction_id
Per video	video_id
Per activity	activity_id
Per driver	driver_id
Per user pair (match)	canonical(min(a,b), max(a,b))
Per geographic cell	H3 cell ID or geohash prefix
Per time window + scope	(granularity, bucket_id, scope_id)

Failure Domain Classification #

For each component:

Component	Correctness criticality	Failure impact	Recovery path
Auction acceptance (Postgres)	CRITICAL	No bids accepted	Fail closed; primary failover
Bid idempotency store	CRITICAL	Duplicate bids possible	Fail closed; replica promotion
Driver status (Redis)	CRITICAL	Dispatches fail	Fail closed; Redis Sentinel / Cluster
Score event ingest (Kafka)	HIGH	Events lost if unrecoverable	Retry from producer; DLQ
Score aggregate (Flink)	HIGH	Leaderboard stale	Restore from Flink checkpoint
Top-K projection (Redis)	MEDIUM	Stale leaderboard	Rebuild from aggregate store
Notification service	LOW	Match notification delayed	Retry; user sees match on app open

Correctness-critical components fail closed. Projection-only components may degrade gracefully (serve stale data).

Output Artifact #

Service boundary diagram with data flows
Partition topology table
Failure domain table (above)

Step 12 — Consistency Model #

Goal #

State explicitly which paths require strong consistency and which may be eventually consistent. Systems almost never need the same consistency everywhere. Applying strong consistency uniformly destroys scalability; applying eventual consistency uniformly introduces correctness bugs.

Classification Procedure #

For each write path and read path, ask:

Strong consistency required if: violating the invariant changes business truth — money is charged twice, a driver is double-dispatched, a bid is accepted that should have been rejected.

Eventual consistency acceptable if: the only consequence of a stale read is temporary UX lag — leaderboard is 60s stale, cache is 5s stale, notification arrives 2s after the event.

Output Artifact #

Path	Consistency	Staleness bound	Why
Bid acceptance write	Strong	0	Eligibility invariant must hold at decision time
Auction view read	Eventual	100ms	Stale view is visible briefly; correctness is in the write path
Driver status write (claim)	Strong	0	Double-dispatch is a correctness violation
Driver location read (for matching)	Eventual	30s	Location ping is approximate; stale ping = slightly worse match
Score event ingest	At-least-once	N/A	Duplicates handled by dedup; missing events are not acceptable
Top-K leaderboard read	Eventual	60s	Product accepts 60s staleness
Match creation write	Strong	0	Must not create duplicate match
Match notification delivery	Best-effort	N/A	Push is unreliable; durable match record is the truth

Step 13 — Scaling Model #

Goal #

Identify how load grows and where bottlenecks appear. You are not yet tuning numbers; you are naming the dimensions.

Scale Type Taxonomy #

Type	Definition	Bottleneck location	Strategy
Write-heavy / append	Many writes, few reads, low contention per key	Ingest throughput	Partition by key, parallel ingest, wide-column store
Contention-heavy	Many competing writes on the same key	Single-key throughput	Single-writer partition, or CAS with retry budget
Read-heavy	Many reads, low write rate	Read throughput + latency	Cache, CQRS, CDN
Fan-out heavy	One write causes many downstream reads or notifications	Fan-out path (notify N subscribers)	Fan-out on Write (small N), Fan-out on Read (large N), hybrid
Window / aggregation heavy	Events must be bucketed and aggregated in time windows	Aggregation throughput + state size	Stream processor, pre-aggregation, approximate counting

Hotspot Identification #

A hotspot is a key or partition that receives disproportionately more traffic than others.

Hotspot type	Example	Mitigation
Single-key write hotspot	One live auction with 10K bids/sec	Single-writer partition for that auction_id; queue bids in front
Single-key read hotspot	One video goes viral	CDN caching; in-memory cache with short TTL; cache stampede protection (single-flight)
Fan-out hotspot	Celebrity with 100M followers posts	Fan-out on Read for celebrity accounts; hybrid fan-out
Partition hotspot	All keys hash to same shard	Add random suffix to key for writes; aggregate at read time
Time-boundary hotspot	All window aggregates flush at the same second	Stagger flush times; jitter

Output Artifact #

Dimension	Driver	Hotspot key	Likely bottleneck	Strategy
Auction bids	bids per auction	Popular auction_id	CAS retry rate on hot auction	Single-writer queue
Score events	events per video	Viral video_id	Kafka partition throughput	Partition by video_id; increase partition count
Driver matches	dispatches per cell	Dense geo cell	Spatial index read hotspot	Cell-level cache; increase Redis cluster size
Match notifications	matches per second	Global (peak time)	Push notification throughput	Async queue with backpressure

Step 14 — Failure Model #

Goal #

Enumerate what can go wrong and what correct behavior is in each case. A design is incomplete until every failure mode has an explicit response.

Failure Taxonomy #

Client retry / timeout #

Client does not know whether the write succeeded. May retry.

Required: Idempotency key on every non-idempotent write.
Every write path must be safe to retry N times.

Duplicate delivery #

Same event or message arrives twice from broker (at-least-once delivery).

Required: Dedup by event_id before applying to aggregate or state.
Never assume exactly-once from a message broker without explicit configuration.

Partial write #

Write to primary state succeeds; downstream event publication or projection update fails.

Required: Outbox pattern (event written in same transaction as state change) OR CDC (event derived from WAL).
The downstream failure must not leave the system in a state where the event is permanently lost.

Out-of-order arrival #

Events arrive in a different order than they were produced.

Required: Explicit policy for each ordered object. Options:
- Reject and requeue (wait for in-order delivery)
- Accept and reconcile (watermark-based window)
- Idempotent application (if operation is order-independent)

Temporal boundary #

Event arrives near a window close or a deadline (e.g., bid arrives at exactly end_time).

Required: Explicit clock authority. Client clock is never authoritative. Server time is used for all temporal decisions. Define behavior for events arriving within [end_time - ε, end_time + ε].

Infrastructure failure #

Node crashes, DB primary fails, cache is cold, queue consumer lags.

Required: For each component, explicit answers to:
- What is the impact on correctness? (writes fail closed? reads serve stale?)
- What is the recovery path? (failover, rebuild from source, restart consumer)

Output Artifact #

Failure mode	Can it happen?	Correct behavior	Mechanism enforcing it	Recovery if mechanism fails
Client retries bid	Yes	Same result as first attempt	Idempotency Key	Manual dedup audit
Duplicate score event	Yes (Kafka at-least-once)	Ignored; aggregate unchanged	Dedup by event_id	Recompute aggregate from event log
Bid write succeeds; event not published	Yes (crash after write)	Event eventually published	Outbox + Relay	Outbox relay catches up on restart
Auction end_time race	Yes (bid arrives at exact end_time)	Server time is authoritative; server rejects if server_time ≥ end_time	Server-side time check in CAS predicate	Audit log of rejection vs acceptance at boundary
Driver dispatch crashed holding claim	Yes	Claim auto-expires via TTL	Lease TTL	Redis TTL fires; driver returns to available
Score aggregate consumer lags	Yes	Leaderboard is stale by lag amount	SLO on freshness; alert on lag > threshold	Restart consumer; replay from last checkpoint

Step 15 — SLO Definition #

Goal #

Convert broad product expectations into measurable promises. SLOs are the contract between architecture and operations.

SLO Types #

Type	Definition	Example
Latency	Percentile latency of a request or event path	Bid acceptance p95 < 200ms
Availability	Fraction of time the service is usable	Bid service 99.9% available
Correctness	Rate at which invariants are violated	Zero accepted bids after auction close
Freshness	Maximum staleness of a projection	Leaderboard lag < 60s at p99
Throughput	Sustained operation rate	10K score events/sec ingested

SLO Derivation #

SLOs are derived from:

Product requirements (user-visible expectations)
Invariant strength (correctness invariants have SLO = zero violations)
Downstream SLO budgets (each call in a path consumes part of the latency budget)

Latency budget principle: if the end-to-end SLO is 500ms and the path has 3 hops, each hop must have a timeout budget. Sum of timeouts must be < 500ms. Every timeout in the system is derived from this budget, not set arbitrarily.

Output Artifact #

Path	Metric	Target	Derived from
Bid acceptance	p95 latency	200ms	Product UX requirement
Bid acceptance	Correctness violations	0 per day	Invariant is absolute
Live comment delivery	p99 delivery lag	500ms	Product UX requirement
Leaderboard freshness	p99 lag	60s	Product requirement
Driver match latency	p95 end-to-end	2s	Rider UX expectation
Payment capture	Duplicate charge rate	0	Legal + invariant
Score event ingest	Throughput	50K events/sec	Capacity model

Step 16 — Operational Parameters #

Goal #

Convert every architectural lever into a named, tunable parameter with its valid range and effect.

Every decision you made in Steps 6–11 that involved a number, a threshold, or a timeout must appear here.

Parameter Categories #

Concurrency parameters #

CAS retry limit: max attempts before returning conflict error
Lease TTL: how long a claim is held before auto-release
Lock timeout: how long a pessimistic lock is held

Cache parameters #

Cache TTL: staleness bound; sets the freshness SLO
Cache size: eviction begins when exceeded
Cache stampede protection: request coalescing window

Queue parameters #

Retry budget: max retries before DLQ
Backoff base + max: controls retry storm risk
Jitter factor: fraction of backoff that is randomized

Window parameters #

Window size: tumbling or sliding interval
Allowed lateness: how long after window close late events are accepted
Watermark lag: how far behind real time the watermark is held

Idempotency parameters #

Idempotency key TTL: how long outcomes are retained
Key scope: per-user, per-session, or global

Output Artifact #

Parameter	Component	Default	Min	Max	Effect of increasing	Effect of decreasing
CAS retry limit	Auction acceptance	3	1	10	More retries = higher acceptance rate under contention, higher tail latency	Fewer retries = faster failure, lower acceptance rate
Lease TTL	Driver claim	30s	5s	120s	Longer = more time before stale claim auto-releases	Shorter = more dispatcher timeouts on slow operations
Cache TTL	AuctionView	100ms	0	5s	More staleness	Less staleness, more DB load
Retry budget	Score event ingest	5	1	20	More attempts before DLQ	More events lost on transient failure
Backoff base	All retry paths	100ms	10ms	1s	Slower retry, less storm	Faster retry, higher storm risk
Window size	Hourly leaderboard	1h	5m	24h	Coarser time granularity	Finer granularity, more state
Watermark lag	Score events	10s	0	60s	More late events captured	More latency in window close

Step 17 — Runbooks #

Goal #

For every known failure mode (from Step 14) and every hotspot (from Step 13), define the exact procedure an operator takes.

Runbook Template #

Incident: [Name]
Trigger: [What signal indicates this is happening — metric, alert, symptom]
Diagnosis: [How to confirm]
Immediate mitigation: [What to do right now to reduce impact]
Root cause investigation: [How to find why it happened]
Resolution: [How to fully recover]
Prevention: [What to change to prevent recurrence]

Example Runbooks #

Runbook: Auction CAS contention storm #

Trigger: CAS retry rate > 80% on bid acceptance service; p99 bid latency > 2s Diagnosis: Check bid volume per auction_id; identify hot auction Immediate mitigation: Enable bid queue in front of hot auction_id; switch to single-writer for that auction Root cause: Unexpected viral auction; more bidders than designed for Resolution: Queue drains; single-writer processes bids sequentially; switch back to CAS when load drops Prevention: Detect auction heat (bid rate > threshold) earlier; pre-emptively queue before contention saturates

Runbook: Leaderboard freshness SLO breach #

Trigger: Leaderboard lag > 60s; alert fires Diagnosis: Check Flink consumer lag on Kafka score event topic; check Flink job health; check Redis write latency Immediate mitigation: None needed if correctness is not affected (leaderboard is a projection) Root cause investigation: Flink checkpoint failure? Consumer restart? Kafka partition rebalance? Resolution: Restart Flink job from last checkpoint; it will catch up Prevention: Alert on consumer lag > 30s to catch before SLO breach; increase checkpoint frequency

Runbook: Driver not releasing after dispatch crash #

Trigger: Rider waiting > TTL seconds; driver marked dispatched but unresponsive Diagnosis: Check driver TTL expiry in Redis; check dispatch service health Immediate mitigation: Redis TTL fires automatically; driver returns to available pool Root cause investigation: Was TTL set correctly at claim time? Did dispatcher crash before setting TTL? Resolution: Verify TTL is always set atomically with claim (SETNX + EX in one command) Prevention: Audit: never allow claim without TTL in the same Redis command

Step 18 — Observability #

Goal #

Define what must be instrumented for every SLO to be measurable, every failure mode to be detectable, and every parameter to be tunable at runtime.

Observability Types #

Type	Purpose	Examples
Metrics	Quantitative, time-series; powers SLO dashboards	CAS success rate, retry rate, latency percentiles, consumer lag
Traces	Request-scoped; shows where latency is spent	Distributed trace from bid submission through CAS through event publication
Logs	Discrete events; powers incident investigation	CAS rejections with reason, idempotency hits, transition failures
Alerts	Fires when metric crosses SLO threshold	Bid latency p95 > 200ms, consumer lag > 30s, CAS retry rate > 50%

Required Instrumentation #

Component	Required metric	Alert condition
Auction acceptance	CAS success rate, CAS retry count, p95/p99 latency	Retry rate > 50%, latency p95 > 200ms
Idempotency store	Hit rate, miss rate, TTL expiry rate	Hit rate < expected (suggests missing keys)
Driver claim	Claim success rate, TTL expiry rate, fencing token rejections	Fencing rejections > 0 (indicates stale dispatcher)
Score event ingest	Events/sec, lag, DLQ depth, dedup hit rate	Lag > 30s, DLQ depth > 0
Leaderboard projection	Freshness lag, Redis write latency	Freshness > 60s
Match detection	Match creation rate, duplicate match attempts	Duplicate attempts > 0

Output Artifact #

Instrumentation plan: for each component, list of metrics, log events, trace spans, and alert thresholds.

Step 19 — Cost Model #

Goal #

Quantify what grows with users, events, and hotspots so that cost surprises can be predicted and capacity planned.

Growth Dimensions #

For each component, identify:

What user behavior drives its growth?
What is the growth rate (linear, quadratic, logarithmic)?
What is the dominant cost (storage, compute, network, IOPS)?

Example #

Component	Growth driver	Growth rate	Dominant cost
Bid store	Bids per auction × active auctions	Linear in auctions	Storage + IOPS
Score event store (Kafka)	Score events per video × active videos	Linear in content	Kafka storage (retention × throughput)
Score aggregate (Flink state)	Active videos	Linear	Memory (Flink state size)
Top-K projection (Redis)	Scopes × K	Constant per scope (bounded by K)	Redis memory
Driver status (Redis)	Active drivers	Linear	Redis keys
Swipe store (Cassandra)	Swipes per user × active users	Quadratic in users (O(U²) worst case if everyone swipes everyone)	Cassandra storage
Idempotency store	Write rate × TTL	Constant at steady state	OLTP storage

Step 20 — Evolution Path #

Goal #

Define the smallest correct initial design and the signals that trigger each architectural upgrade.

Staged Design Principle #

Start with the minimum correct design. Add complexity only when a measured signal crosses a defined threshold.

Evolution Stages #

Stage 1: Single service, single OLTP DB #

One Postgres instance handles auctions, bids, idempotency
Leaderboard computed by query at read time
No caching, no streaming
Correct up to: ~100 writes/sec, ~1K reads/sec, <10K objects

Upgrade signal: Postgres CPU > 60%, p95 read latency > 500ms

Stage 2: Add caching and async projections #

Redis cache for hot read paths (AuctionView, TopK)
Background job to compute leaderboard
Idempotency TTL cleanup job

Upgrade signal: Write contention on hot objects; CAS retry rate > 20%

Stage 3: Partition write-heavy objects #

Shard Cassandra for append-heavy event stores (score events, swipe store)
Kafka for score event fan-out to aggregation
Flink for windowed aggregation

Upgrade signal: Kafka consumer lag > 30s; Flink state > available memory

Stage 4: Specialized stores and geographic distribution #

Redis Geo for spatial driver matching
H3 cell index for surge zone computation
Multi-region Cassandra for swipe store

Trigger: Geographic expansion; latency SLO from distant regions fails

Data Model Preservation #

At each stage, identify:

What schema changes are additive (safe)?
What schema changes require migration (risky)?
What data model decisions lock you into a pattern (e.g., partition key choice in Cassandra cannot be changed without full table rewrite)?

Flag these decisions early. They are the ones that are hardest to reverse.

Pattern Reference #

Quick lookup: given a mechanism type from Step 6, find the pattern.

Concurrency Control #

Mechanism	Pattern	Key lever	Failure mode
CAS on version	Optimistic Lock	Retry budget; version column	ABA problem — add timestamp or monotonic ID to version
CAS + idempotency	Optimistic Lock + Idempotency Key	Idempotency TTL	Key collision if client reuses keys
SETNX + EX	Lease	TTL duration; fencing token	Clock skew between holder and store — use fencing token on all resource access
Row lock + transaction	Pessimistic Lock	Lock timeout	Deadlock; lock not released on crash
Commutative merge	CRDT	CRDT type (G-Counter, OR-Set, YATA)	Tombstone accumulation for deleted elements
Saga	Saga	Compensation idempotency; saga state durability	Compensation failure (double fault)
2PC	2PC	Coordinator crash recovery	Blocked participants if coordinator crashes after prepare

Read Acceleration #

Pattern	When to use	Discriminant
Cache-Aside + TTL	Read » write; staleness OK	Simplest; use first
CQRS + Materialized View	Read shape ≠ write shape	Necessary when shape differs
Denormalization	Join-heavy hot reads; same shape	When joins dominate latency
Scatter-Gather	Query fans across shards	When data is already partitioned
Spatial Partition	Geospatial query	Always use before other read patterns for geo queries

Write Scaling #

Pattern	When to use	Tradeoff
Hash Partition	Key-based access, uniform distribution	Loses range queries
Range Partition	Range queries dominant	Hot “latest” partition for time-series
Leaderless Replication	Availability > consistency, multi-region writes	Eventual consistency; conflict handling needed
Single-writer per partition	Very hot single key; CAS retry rate too high	Adds queue latency; operational complexity

Event Flow #

Pattern	When to use	Discriminant
Message Queue	Work distribution; rate decoupling	Use when one consumer per message
Pub/Sub	Fan-out; multiple independent consumers	Use when N consumers per message
Outbox + Relay	Event produced by application logic; must be atomic with state change	Application controls event content
CDC	Event IS the DB row change	No application-level event needed
Fan-out on Write	Small follower counts; real-time	Fails at celebrity scale
Fan-out on Read	Large follower counts; infrequent reads	Higher read latency

Fault Tolerance #

Pattern	When to use	Lever
Idempotency Key	Every non-idempotent write on unreliable network	TTL; key scope
Retry + Backoff + Jitter	Any network call	Base delay; max delay; jitter fraction
Circuit Breaker	Downstream can degrade (not just fail)	Error rate threshold; half-open probe interval
Bulkhead	High-priority traffic must be isolated from low-priority	Thread pool or connection pool size per tier
Timeout	Every outbound call	Must be derived from caller’s latency budget

Consensus and Sync #

Pattern	When to use	Tradeoff
Leader-Follower	Single-region; strong consistency writes	Leader is bottleneck; failover requires election
Quorum (W + R > N)	Leaderless; tunable consistency	Network partition can still split quorum
Gossip	Membership; failure detection; eventual propagation	Convergence time = O(log N)
State Vector Sync	Distributed sync; two nodes converge incrementally	Vector size grows with node count
Merkle Tree	Anti-entropy between replicas; efficient diff	Rehash cost on hot write path

Time and Approximation #

Pattern	When to use	Lever
Windowing	Aggregate over finite time interval	Window size; allowed lateness; watermark lag
TTL	Auto-expire entries	Duration; lazy vs eager expiry; add ±10% jitter
Approximate Counting	Memory-constrained cardinality or frequency	Count-Min (frequency), HyperLogLog (cardinality), Bloom Filter (membership)
Temporal Decay	Recency-weighted ranking	Decay rate λ; floor value
Spatial Partition	Geographic proximity queries	Cell resolution; neighbor depth
Staged Rollout	Deploy changes safely	Rollout percentage; metric thresholds for pause
Scheduled Trigger	Periodic maintenance or aggregation	Schedule frequency; distributed lock to prevent overlap

Cross-Step Anti-Patterns #

These mistakes appear at specific steps and propagate forward as structural errors.

Anti-pattern	Where it appears	Consequence	Fix
Treating a projection as primary state	Step 2	Dual truth; inconsistent reads	Derivability test: can it be recomputed? If yes, it’s a projection.
Mixing evolution models in one object	Step 2-3	CAS predicate is ambiguous; tests become impossible	Split into process object + append event stream
Ordering without scope	Step 3	“Total order” applied globally when only per-partition order is needed → unnecessary coordination	Always bind ordering to a scope
Vague invariant (“system should be correct”)	Step 4	Cannot derive mechanism; design is hand-wavy	Rewrite: “for every bid, amount > current_highest at decision time”
Technology as DP (“use Kafka”)	Step 5	Technology lock-in before invariant is understood	Name the invariant-enforcing mechanism; choose technology in Step 10
CAS without idempotency key	Step 6	Client retry creates duplicate outcome	Always pair CAS with idempotency check first
Lock where Lease is needed	Step 6	Holder crashes; resource blocked until manual intervention	If holder can crash, use Lease (TTL)
2PC where Saga applies	Step 6	Coordinator crash blocks all participants	Check: can each step be local with compensation? If yes, use Saga
CRDT for non-commutative domain	Step 6	Merge produces incorrect state	Check commutativity first. “Bid accepted if highest” is not commutative → not CRDT.
Dual truth source	Step 7	“Sometimes stale, sometimes wrong” bugs	One authoritative source; all others are projections with rebuild path
Partition key by convenience not invariant	Step 9-10	Hotspot misaligned with contention; cross-partition transactions required	Partition key = scope of correctness invariant
Projection technology as truth	Step 10	Redis unavailable = correctness unavailable	Redis is always a cache or projection. Rebuild path from OLTP source must exist.
Eventual consistency on correctness path	Step 12	Silent correctness violations	Classify each path explicitly; strong consistency on eligibility-enforcing writes
Skipping failure model	Step 14	Duplicates, partial writes, and race conditions not handled → production bugs	Every write path must answer: what happens on retry? On crash? On duplicate?

Compact Procedure Checklist #

Use this to verify each step produced its required artifact.

Step 1  — Normalization
  □ Every requirement rewritten as actor-operation-state
  □ Hidden writes and projections surfaced

Step 2  — Object extraction
  □ Primary objects identified with class
  □ Derived views rejected with reason
  □ Every object passes ownership/evolution/ordering/derivability tests

Step 3  — Axis assignment
  □ Every primary object has ownership, evolution, ordering
  □ Every ordering bound to a scope

Step 4  — Invariant extraction
  □ Every requirement maps to typed, testable invariants
  □ Concurrency considered for every eligibility invariant

Step 5  — DP derivation
  □ Every invariant cluster maps to a DP
  □ No DP is a technology name

Step 6  — Mechanism selection
  □ 6.1: Every DP classified by invariant type
  □ 6.2: Ownership × evolution table applied to select mechanism family
  □ 6.3: Q1-Q5 applied to select specific implementation
  □ 6.4: Required mechanism combinations identified (CAS + idempotency, etc.)

Step 7  — Axiomatic validation
  □ Source-of-truth table complete
  □ Dependency table complete; no circular dependencies
  □ Every projection has upstream source and rebuild path

Step 8  — Algorithm design
  □ Every write path has pseudocode
  □ Every process object has complete state machine
  □ Idempotency, retry, and rejection explicitly handled for every path

Step 9  — Logical data model
  □ Every primary object has schema
  □ Every derived view is a separate table marked as projection
  □ Partition keys derived from invariant scope
  □ Dedup keys explicit

Step 10 — Technology landscape
  □ Every DP maps to capability → shape → specific product
  □ Every correctness-critical invariant has named enforcing technology

Step 11 — Deployment topology
  □ Service boundaries justified
  □ Partition keys aligned with invariant scope
  □ Failure domains classified by correctness criticality

Step 12 — Consistency model
  □ Every write and read path classified (strong/eventual/bounded)
  □ Correctness-critical paths are strong
  □ Staleness bounds explicit for eventual paths

Step 13 — Scaling model
  □ Scale type identified (write/contention/read/fanout/aggregation)
  □ Hotspot keys identified
  □ Strategy per hotspot

Step 14 — Failure model
  □ Every write path: retry semantics explicit
  □ Every aggregate: duplicate event handling explicit
  □ Every component: correctness impact on failure classified

Step 15 — SLOs
  □ Latency SLOs per path
  □ Correctness SLOs per invariant
  □ Freshness SLOs per projection
  □ Latency budget allocated across hops

Step 16 — Operational parameters
  □ Every tunable threshold named with valid range and effect

Step 17 — Runbooks
  □ Every failure mode from Step 14 has a runbook
  □ Every hotspot from Step 13 has a mitigation runbook

Step 18 — Observability
  □ Every SLO is measurable
  □ Every failure mode is detectable
  □ Alerts defined for every SLO threshold

Step 19 — Cost model
  □ Growth driver named per component
  □ Dominant cost identified

Step 20 — Evolution
  □ Stage 1 (minimum correct) defined
  □ Upgrade signal per stage defined
  □ Irreversible data model decisions flagged