Skip to main content
  1. System Design Components/

Table of Contents

System Design Derivation Framework #

A complete, mechanical procedure for deriving distributed system designs from functional requirements — from vague product language to operable architecture, with no hand-wavy steps.


Ordering Principle #

Every layer constrains the next. Skipping a layer makes later choices arbitrary.

Product requirements
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, ordering, evolution         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)  ← the mechanical bridge
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

A design is complete only when all 20 steps produce explicit output artifacts. If you cannot produce the artifact for a step, you have not finished that step.


Step 1 — Problem Normalization #

Goal #

Convert product-language requirements into precise operations over state. This is the first act of removing ambiguity.

Why It Matters #

Product requirements blend UI framing, transport behavior, derived views, and domain semantics. Without normalization, you will:

  • confuse what the user sees with what the system stores
  • miss hidden writes behind apparently simple reads
  • fail to detect that “subscribe to alerts” implies a future-constraint store and a trigger engine
  • fail to detect that “view live leaderboard” implies a projection, not primary state

Procedure #

For each requirement, identify:

1. Actor — who initiates it?

  • user, client, internal service, scheduler, admin, external partner

2. Operation — what state operation is actually happening?

UI verbActual system operation
view / seeread base record, OR read projection, OR fetch historical slice, OR subscribe from cursor
submit / createcreate record, OR append event
update / editoverwrite mutable state, OR transition state machine
get notificationsystem evaluates condition + delivers push
subscribepersist future constraint (intent object)
searchquery indexed projection
swipe / likeappend immutable decision event
bidattempt conditional update on contention state

3. State — which underlying data actually changes or is observed? (Not the screen. Not the endpoint. The data.)

Output Artifact #

Original RequirementActorOperationState Touched
Users can bid on an itemUserattempt conditional updateAuction acceptance state
Users see current highest bidClientread projectionAuctionView (derived)
Users get price-drop notificationsSystemevaluate + deliverAlertSubscription + PriceObservation
Drivers go onlineDriveroverwrite statusDriver presence state
Rider requests a rideRidercreateTrip + LocationPing
Users swipe right on a profileUserappend decision eventSwipe record
Users get a match notificationSystemdetect mutual condition + pushSwipe store + Match record

Completion Test #

  • Every requirement is rewritten as actor-operation-state.
  • Every “view” has been translated into the exact underlying read or projection type.
  • Every apparently simple requirement has had hidden writes, triggers, or derived views exposed.

Step 2 — Object Extraction #

Goal #

Identify the minimal set of primary state objects. A primary object is one whose state must exist independently and cannot be fully reconstructed from other state.

Object Classes #

These are heuristic categories, not a rigid ontology. The real test is the validity rule below.

ClassDefinitionExamples
Stable entityLong-lived, mutable, continuously existingUser, Product, Video, Driver, Problem
EventImmutable, point-in-time occurrenceBid, Swipe, LocationPing, PriceObservation, ActivitySample
Process objectIdentity persists across state-machine lifecycleAuction, Trip, Payment, Activity, JobRun
Intent / future constraintPersists until a future condition is metAlertSubscription, ScheduledJob, RetryPolicy
Relationship objectAn edge between entities with its own lifecycleFriendship, Match, SharePermission, FollowRelation
Derived viewComputable from other stateLeaderboard, CurrentHighestBid, Stack, FeedSlice, SurgeMultiplier

Object Validity Rule #

An object is valid as a primary object if and only if it satisfies all four:

1. Ownership purity — there is one clear answer to “who writes this?”

  • Bad: an object written by users, a pipeline, a scheduler, and a moderator under different rules → split it
  • Good: Bid is written only by the bidding user

2. Evolution purity — the object evolves in one dominant way

Evolution familyDescription
Append-onlyNew records added; old records never modified
OverwriteOne current state updated in place
State machineGuarded transitions between named states
MergeConcurrent writes reconciled via merge law (CRDT)
  • Bad: one object that needs both append history and overwrite current state → it is two objects
  • Good: Bid is append-only; Auction is a state machine

3. Ordering purity — there is one ordering relation among its versions/events, bound to a scope

  • Bad: “events are ordered” (unbound) → ordered within what scope?
  • Good: “Bids are ordered by acceptance time within one auction”

4. Non-derivability — the object cannot be reconstructed from other primary state

  • CurrentHighestBid can be derived from accepted Bid records → it is a projection, not primary
  • Leaderboard can be derived from Scores → it is a projection

Typical Invalid Objects #

CandidateProblemFix
LeaderboardDerivable from Submissions + ranking rulesClassify as derived view
CurrentHighestBidDerivable from Auction + BidsClassify as derived view
UserFeedDerivable from Follows + PostsClassify as derived view
ActivityPageUI artifactNot a domain object at all
SurgeMultiplierDerivable from supply/demand countsClassify as derived view
Stack (Tinder)Derivable from Users + Swipes + locationClassify as derived view

Output Artifact #

Primary Objects: list with class Derived / Rejected: list with reason for rejection

Completion Test #

  • Every candidate object has been classified.
  • Every primary object passes all four validity tests.
  • Every rejected object has an explicit reason.

Step 3 — Axis Assignment #

Goal #

For every primary object, define exactly three properties. These three axes determine most architecture-relevant behavior.

The Three Axes #

Axis 1: Ownership #

Who is allowed to write this object, and under what contention model?

Ownership patternMeaning
Single writerExactly one actor writes this object. No concurrency needed.
Single writer per partitionOne actor per key/scope, but multiple keys in parallel.
Multi-writer, one winnerMany actors attempt writes; only one succeeds per atomic decision.
Multi-writer, all succeed (commutative)Concurrent writes merge without conflict.
Multi-writer, pipelineA processing pipeline is the sole writer; upstream actors just enqueue.
System-onlyOnly internal services write; users never write directly.

Axis 2: Evolution #

How does the object’s state change over time?

EvolutionMeaningImplication
Append-onlyNew records added; old ones never mutatedImmutable; safe to replicate; dedup by content hash or ID
OverwriteCurrent state replacedNeed version or timestamp to detect staleness
State machineTransitions guarded by allowed-transition rulesCAS on (state, version); forbidden transitions must be enforced
MergeConcurrent writes reconciled by merge lawRequires CRDT or OT; merge must be commutative + associative

Axis 3: Ordering #

What ordering relation must hold, and within what scope?

OrderingMeaningImplication
Total order within scopeAny two instances within the scope are comparableSequence number per partition key; single-writer or CAS to assign
Partial orderCausal relationship (A happened before B, or concurrent)Vector clocks; version vectors
No meaningful orderMembership/set semantics onlyNo ordering mechanism needed
Causal lifecycle orderState machine transitions must follow valid pathsState machine enforcement; forbidden transitions rejected

Critical rule: Always bind ordering to a scope. “Events are ordered” is incomplete. “Accepted bids are totally ordered within one auction” is correct.

Output Artifact #

Object: Auction
  Ownership:   Multi-writer (many bidders), one winner per auction — system enforces
  Evolution:   State machine (created → open → ended; current_highest updated on each acceptance)
  Ordering:    Causal lifecycle order for state transitions; total order on acceptance sequence within auction_id

Object: Bid
  Ownership:   Single writer (the bidding user)
  Evolution:   Append-only (bid attempt is immutable once created)
  Ordering:    Total order by created_at within auction_id (for display); acceptance order is separate

Object: Driver
  Ownership:   Multi-writer, one winner (many dispatchers compete; only one can claim a driver)
  Evolution:   State machine (offline → available → dispatched → on_trip → available)
  Ordering:    Causal lifecycle order (forbidden transitions: dispatched → offline without completing trip)

Object: Swipe
  Ownership:   Single writer (the swiping user)
  Evolution:   Append-only (a swipe decision is immutable)
  Ordering:    Total order by timestamp within swiper_id (for exclusion set scan)

Object: AlertSubscription
  Ownership:   Single writer (the subscribing user)
  Evolution:   Overwrite (user may update threshold, disable, delete)
  Ordering:    No meaningful order (set of active subscriptions per user)

Why Axes Matter More Than Patterns #

The axes directly constrain which mechanisms are valid:

  • Ownership = “multi-writer, one winner” + Evolution = “state machine” → the mechanism must be CAS or single-writer partition. CRDT is wrong (not commutative). 2PC is wrong (single-service scope). This eliminates the entire pattern space except CAS/Lock/Lease/Single-writer before Q1-Q5 is applied.
  • Ownership = “multi-writer, all succeed” + Evolution = “merge” → the mechanism must be CRDT. Lock is wrong. 2PC is wrong.

Axes are necessary to make Step 6 (mechanism selection) mechanical.

Completion Test #

  • Every primary object has explicit ownership, evolution, and ordering.
  • Every ordering is bound to a scope.
  • No object has mixed evolution without being flagged for splitting.

Step 4 — Invariant Extraction #

Goal #

Convert product requirements into precise statements that must always be true in every valid execution. Invariants are the bridge from object model to architecture. Patterns and technologies are downstream of invariants; invariants are not downstream of anything.

What An Invariant Is #

A precise, testable property of valid system state:

“For every valid state S, condition X must hold.”

It must be:

  • Precise — a stranger can determine whether the system satisfies it
  • Testable — you can write a check that passes or fails
  • Implementation-independent — it describes what must be true, not how to achieve it
  • Concurrency-aware — it must hold even under simultaneous writes, retries, and crashes

Invariant Types #

Type 1: Eligibility invariants #

An action is valid only when a condition holds.

An action A on object O is valid only if condition C(O) is true at decision time.

Examples:

  • A bid is accepted only if auction.status == OPEN AND server_time < auction.end_time AND bid.amount > auction.current_highest
  • A driver is dispatched only if driver.status == AVAILABLE
  • A job run starts only if run.status == CLAIMED AND run.claimed_by == this_worker

Implication: Some mechanism must atomically check C and perform A. If C can change between check and action, a concurrency mechanism is required.

Type 2: Ordering invariants #

Events or versions of an object must be comparable and consistent within a scope.

For any two instances I₁ and I₂ within scope S, there is a well-defined ordering, and all readers agree on it.

Examples:

  • Accepted bid amounts form a strictly increasing sequence within one auction
  • Comments are delivered in the same order to all subscribers of a video
  • State machine transitions follow valid paths (no backwards transitions)

Implication: A single authority assigns order within the scope. No two actors may independently assign conflicting positions.

Type 3: Accounting invariants #

An aggregate must exactly reflect its source events.

aggregate(scope S) = f(all events in scope S)

Examples:

  • All-time score for video V equals the sum of all ScoreEvent.delta where event.video_id = V
  • Monthly active users for month M = count(distinct user_id where last_active ∈ M)
  • Inventory quantity = initial_stock − sum(reservations) + sum(cancellations)

Implication: Every event that contributes to the aggregate must be processed exactly once. Late or duplicate events must be handled explicitly.

Type 4: Uniqueness / idempotency invariants #

The same logical request, submitted multiple times, produces the same logical outcome.

If request R is submitted N times (N ≥ 1), the observable outcome is identical to submitting it once.

Examples:

  • Submitting the same bid request twice does not create two accepted bids
  • Retrying a payment capture does not charge the user twice
  • Swiping right on the same profile twice creates at most one Match check

Implication: Either the operation is naturally idempotent (append-only with content-addressed ID), or a dedup mechanism must detect and absorb duplicates.

Type 5: Propagation invariants #

A derived view must eventually reflect its source of truth, and the staleness bound is explicit.

At any time T, derived_view(scope S) = f(source_of_truth(S)) as of time T − ε, where ε is the acceptable staleness.

Examples:

  • The auction read view reflects the current highest bid within 100ms of acceptance
  • The Tinder stack excludes profiles already swiped before the stack is displayed
  • A cache entry is stale by at most TTL seconds

Implication: Some update propagation mechanism must exist. The staleness bound determines whether it must be synchronous (ε = 0) or asynchronous (ε > 0).

Type 6: Access-control invariants #

Reads and writes are gated by authorization.

Actor A may perform operation O on object X only if A has permission P on X.

Examples:

  • A user may read only their own activities plus friends’ visible activities
  • Only the auction owner may cancel an auction
  • A driver may only update their own location

Implication: An authorization check at every service boundary. The authorization state itself (ACL, permission record) is a primary object with its own ownership/evolution model.

Invariant Extraction Procedure #

For each normalized requirement from Step 1:

  1. Ask: What must always be true for this requirement to be satisfied?
  2. Ask: What must never be violated, even under concurrent writes, retries, crashes, or late events?
  3. Classify the invariant by type.
  4. Write it in precise, testable form.

Output Artifact #

Numbered list:

1. [Eligibility] A bid is accepted only if auction.status == OPEN
   AND server_time < auction.end_time AND bid.amount > auction.current_highest_amount.
2. [Ordering] Accepted bid amounts form a strictly increasing sequence within one auction.
3. [Uniqueness] The same bid request_id produces the same logical acceptance result.
4. [Eligibility] A driver is dispatched only if driver.status == AVAILABLE.
5. [Uniqueness] The same dispatch request produces at most one claimed driver.
6. [Accounting] The all-time score for a video equals the sum of all delta values
   in ScoreEvent records where video_id matches.
7. [Propagation] The leaderboard for scope S reflects all ScoreEvents within 60 seconds.
8. [Eligibility] A Match is created only if both users have right-swiped each other.
9. [Uniqueness] A Match between user A and user B is created at most once.

Completion Test #

  • Every requirement maps to at least one invariant.
  • Every invariant is typed, precise, and testable.
  • Every invariant is stated in terms of domain state, not implementation.
  • Concurrency is explicitly considered for every eligibility invariant.

Step 5 — DP Derivation #

Goal #

Derive the minimal set of design parameters — the minimal mechanisms that enforce the invariant clusters.

A DP is not a technology (not “Postgres” or “Redis”). It is not a pattern name (not “CQRS” or “Event Sourcing”). It is the minimal runtime mechanism that makes an invariant cluster provably enforceable.

Derivation Chain #

requirement → state → invariant → dependency closure → minimal enforcing mechanism

Dependency closure: Ask “what other state is required to enforce this invariant?” This reveals cross-requirement dependencies that are invisible when requirements are read in isolation.

Derivation Procedure #

For each invariant or tightly coupled invariant cluster:

  1. Identify the state the invariant is about.
  2. Ask: what must exist at runtime for this invariant to remain true under adversarial conditions (concurrent writes, retries, crashes)?
  3. Name the minimal mechanism that enforces it.

Examples #

Invariant 1+2 (auction eligibility + ordering):

  • State: auction acceptance state (status, current_highest, version)
  • What must exist: something that atomically reads current state, validates eligibility, and conditionally updates — rejecting concurrent updates that lost the race
  • Minimal DP: Auction acceptance / concurrency control mechanism

Invariant 3 (bid idempotency):

  • State: prior outcomes keyed by request_id
  • What must exist: a store that maps request_id → prior result, checked before every acceptance attempt
  • Minimal DP: Bid idempotency store

Invariant 4+5 (driver dispatch):

  • State: driver status
  • What must exist: something that atomically claims a driver exclusively — rejecting competing claims
  • Minimal DP: Driver claim / exclusion mechanism

Invariant 6+7 (leaderboard accounting + propagation):

  • State: per-video aggregate score, top-K projection per scope
  • What must exist: (a) a store for aggregate scores that processes every event exactly once; (b) a mechanism that updates top-K when a score crosses the cutoff; (c) a propagation path from score store to projection within 60s
  • Minimal DPs: Score aggregate store, Top-K projection maintenance mechanism, Score-to-projection propagation path

DP Validation Rules #

A good DP:

  • enforces exactly one invariant cluster
  • is not redundant with another DP
  • is not a technology or a brand name
  • is not so broad as to be meaningless (“backend service” is not a DP)

Output Artifact #

DPInvariant cluster enforced
Auction acceptance mechanismEligibility (open, not expired, amount strictly increasing)
Bid idempotency storeUniqueness of bid acceptance per request_id
Auction view / read projectionPropagation: current highest bid visible to readers
Score aggregate storeAccounting: all-time score = sum of all events
Top-K projection maintenanceAccounting + Propagation: leaderboard reflects aggregate within bound
Driver claim mechanismEligibility: driver exclusively assigned to one trip
Dispatch idempotency storeUniqueness: same dispatch request produces one outcome
Match creation mechanismEligibility: Match created only on mutual right-swipe
Match idempotency storeUniqueness: at most one Match per user pair

Completion Test #

  • Every invariant maps to at least one DP.
  • No DP exists without a linked invariant.
  • No two DPs enforce the same invariant redundantly.
  • No DP is a technology name.

Step 6 — Mechanism Selection (The Mechanical Bridge) #

Goal #

For each DP, derive the concrete mechanism using a mechanical, table-driven procedure. This step fills the gap between “bid acceptance mechanism” (abstract DP) and “CAS on (auction_id, version, current_highest, status) in a transactional store” (concrete mechanism).

The procedure has four sub-steps applied in order. Earlier sub-steps constrain later ones.


Sub-step 6.1 — Classify the DP by invariant type #

Invariant typeMechanism family required
EligibilityConcurrency control — something must atomically check and conditionally update
OrderingSequence authority — one actor assigns position within scope
AccountingAggregation pipeline — events processed exactly once into aggregate
Uniqueness / idempotencyDedup gate — lookup prior result before executing
PropagationUpdate propagation path — source-of-truth change reaches derived view within ε
Access-controlAuthorization boundary — permission check before every operation

Sub-step 6.2 — Apply ownership × evolution to select concurrency mechanism #

This table is the mechanical derivation for concurrency control. Look up the object’s ownership and evolution axes (from Step 3):

OwnershipEvolutionRequired concurrency mechanismWhy
Single writerAnyNoneOne writer cannot race with itself
Single writer per partition keyAnyNone within partition; partition routing is the mechanismRoute all writes for key K to the same owner
Multi-writerOverwriteCAS on versionTwo writers compete to set the new value; one must lose
Multi-writerState machineCAS on (state, version)Must reject invalid transitions and concurrent winners
Multi-writerMerge (commutative + associative)CRDTMerge is safe without coordination
Multi-writer, one at a time, no crash concernExclusive holdPessimistic LockBlocks others for duration of operation
Multi-writer, one at a time, holder may crashExclusive holdLease (TTL + fencing token)Auto-releases if holder dies; fencing token prevents stale commands
Cross-service, decomposable into local stepsMulti-step writeSagaEach step is local; compensations undo on failure
Cross-service, not decomposableAtomic multi-service write2PCAll-or-nothing across services; use only when Saga cannot apply

Decision rules:

  • If evolution = merge AND ownership = multi-writer → always CRDT. Do not use Lock or CAS.
  • If ownership = single-writer → never add concurrency mechanism. It adds cost with no benefit.
  • If holder may crash while holding exclusive access → Lease, not Pessimistic Lock. Lock held by dead process blocks forever.
  • Use 2PC only as last resort. It blocks on coordinator failure.

Sub-step 6.3 — Apply Q1-Q5 to discriminate within the mechanism family #

After Sub-step 6.2 has selected the mechanism family, Q1-Q5 selects the specific implementation and adds any additional mechanisms required.

Q1 — Coordination scope #

Where does contention actually localize?

ScopeImplication
Within a single data structure (same process, shared memory)Hardware CAS (compare-and-swap instruction). No network round-trip.
Within a service (multiple stateless instances, shared external store)Distributed CAS: SET key value NX in Redis, conditional write in DynamoDB, UPDATE ... WHERE version = $v in Postgres
Cross-service (multiple independent services)Saga (if decomposable) or 2PC (if not). No in-process lock can span services.
Cross-region (geographically distributed, partition must not block)CRDT or Leaderless Replication. Any mechanism requiring a quorum write may block during partition.

Application to auction: scope = within auction service → distributed CAS on shared store. Application to driver dispatch: scope = within dispatch service → distributed CAS on shared Redis. Application to collaborative editor: scope = cross-region, partition must not block → CRDT.

Q2 — Failure model #

What breaks, and what must the mechanism survive?

FailureRequired additional mechanism
Writer crashes after write, before responseIdempotency Key: client retries; server detects duplicate via stored outcome
Lock holder crashes while holding lockLease (TTL): store auto-releases after timeout. Add fencing token to reject stale commands.
Network partition between servicesCRDT (no coordination) or Leaderless with quorum. 2PC blocks during partition.
Slow degradation of downstream serviceCircuit Breaker + Timeout. The downstream call must have a maximum wait and a fail-fast threshold.
Retry storm on recoveryRetry + Exponential Backoff + Jitter. Without jitter, all retrying clients hit at the same moment.
Partial write (step A succeeds, step B fails)Outbox + Relay or Saga compensation. Partial state must be detectable and recoverable.

Application to auction bid: writer (client) may crash after sending → add Idempotency Key on request_id. Application to driver dispatch: dispatcher may crash while driver is claimed → use Lease (TTL on claim), not plain lock. Application to payment service call: payment gateway may degrade → add Circuit Breaker + Timeout.

Q3 — Data properties #

What does the data structure allow?

PropertyImplication
Operations are commutative + associative (order doesn’t matter, duplicates are idempotent)CRDT — zero coordination cost. Use before any lock-based mechanism.
Data is content-addressed (identity = hash of content)Hash is a natural idempotency key. Dedup is free at the storage layer without a separate idempotency table.
Data is monotonically increasingCAS with monotonicity check is sufficient. No need for full version history.
Data requires total order within a scopeA single authority must assign sequence numbers within the scope.
Data is spatially structured (lat/lng or region)Spatial Partition must precede any other read-path optimization.
Data is time-windowedWindowing mechanism must handle late events and window boundaries explicitly.

Application to score aggregation: ScoreEvent.delta is commutative (addition is associative) → stream aggregation with at-least-once delivery + idempotent delta application is sufficient. No lock needed. Application to block storage (Dropbox): Block identity = SHA-256(content) → content hash IS the idempotency key. Duplicate uploads are free to ignore. Application to geo-search (driver matching): Data is spatially structured → Spatial Partition (H3/geohash) must be the primary index. Do not range-scan on (lat, lng) columns.

Q4 — Access pattern #

How is the data read and written?

PatternMechanism
Read » Write, staleness tolerableCache-Aside with TTL
Read » Write, read shape ≠ write shapeCQRS + Materialized View
Read » Write, read shape = write shape, join-heavyDenormalization
Write » ReadAppend-only Log, Hash Partition, Leaderless Replication
Query fans across multiple shardsScatter-Gather
Query is geospatial (find entities within radius)Spatial Partition (H3, Geohash, S2, QuadTree)
Reads are time-bounded (recent data dominates)Temporal Decay, TTL, Windowing

Application to auction read view: reads » writes on auction detail; read shape ≈ write shape → Cache-Aside with short TTL. Invalidate on acceptance event. Application to driver matching: geospatial query (“drivers within 3km”) → Spatial Partition (H3 index in Redis Geo). Do not use Cache-Aside without spatial index — it still requires O(N) scan. Application to leaderboard: reads » writes; read shape (ranked list) ≠ write shape (score events) → CQRS + Materialized View (top-K projection as the read model).

Q5 — Coupling requirement #

How tightly must the producer and consumer of a change synchronize?

CouplingMechanism
Synchronous, atomic (producer needs confirmation before proceeding)In-transaction write. CAS, 2PC, or Pessimistic Lock within same transaction.
Synchronous, best-effort (producer sends, expects delivery but tolerates retry)Retry + Backoff + Jitter
Asynchronous, event produced by application logicOutbox + Relay: write event to outbox table in same DB transaction; relay publishes to broker
Asynchronous, event is the DB row change itselfCDC: tail the DB write-ahead log; no application-level outbox needed
Fan-out to N consumers at write timeFan-out on Write: push to each consumer’s inbox immediately
Fan-out to N consumers at read timeFan-out on Read: each consumer pulls and merges at query time
Notification must be reliable even if consumer is offlinePersist notification record + async retry; never fire-and-forget

Application to auction acceptance → view update: view can be slightly stale (ε = 100ms) → async propagation via CDC or event publication → eventually consistent view. Application to bid submission → outbox: bid result must trigger downstream events (notify watchers) atomically with the bid write → Outbox + Relay (write both to same Postgres transaction). Application to match notification: user may be offline → persist Match record durably; push notification is best-effort retry; user sees match on app open regardless.


Sub-step 6.4 — Check for required mechanism combinations #

Some invariants require multiple mechanisms. Missing one creates a correctness gap.

InvariantRequired mechanismAdditional mechanism always needed
Eligibility under concurrencyCAS or Lease+ Idempotency Key (client retry creates duplicate without it)
Eligibility + crash of holderLease+ Fencing token (stale holder must be rejected on every resource access)
Accounting via event streamStream aggregation+ Idempotency per event (duplicate events corrupt aggregate)
Propagation to derived viewCDC or Outbox+ Idempotency in consumer (message broker delivers at-least-once)
Ordering within scopeSequence authority+ Same authority for history and live path (two authorities = split brain)
Cross-service writeSaga+ Idempotency on each local step (Saga steps may retry)

The most commonly missed combination: CAS without Idempotency Key. CAS prevents concurrent duplicates but not retry duplicates. A client whose request timed out will retry. If the first request succeeded (CAS applied), the retry hits a changed version and either: (a) correctly fails (if client checks), or (b) silently succeeds again if the CAS predicate accidentally matches again. Always pair CAS with an Idempotency Key check first.


Worked Examples of Mechanical Derivation #

Example A: Auction bid acceptance #

From Step 3:

  • Object: Auction
  • Ownership: Multi-writer, one winner per auction
  • Evolution: State machine (current_highest is conditionally updated)
  • Ordering: Total order on acceptance sequence within auction_id

From Step 4:

  • Invariant 1 [Eligibility]: bid accepted only if OPEN, not expired, amount > current_highest
  • Invariant 2 [Ordering]: accepted amounts strictly increase
  • Invariant 3 [Uniqueness]: same request_id = same outcome

Step 6 derivation:

Sub-step 6.1: Eligibility + Ordering → concurrency control + sequence authority

Sub-step 6.2: Ownership=multi-writer one-winner + Evolution=state machine → CAS on (state, version)

Sub-step 6.3:

  • Q1: within auction service → distributed CAS on shared store
  • Q2: client may crash after sending → add Idempotency Key on request_id
  • Q3: not commutative (two bids can’t both be highest) → CAS confirmed, not CRDT
  • Q4: write path is the hot path; read path is read-heavy → Cache-Aside for read view
  • Q5: bid write → view update can be async (ε acceptable) → CDC or event publication

Sub-step 6.4: CAS + Idempotency Key (required combination for retryable eligibility writes)

Result:

  • CAS: UPDATE auction SET current_highest=$amount, winning_bid_id=$bid_id, version=version+1 WHERE auction_id=$id AND version=$old_version AND status='OPEN' AND end_time > now() AND current_highest < $amount
  • Idempotency: check idempotency_store[request_id] before executing CAS; return stored result on hit
  • Read view: Cache-Aside with TTL; invalidate on CDC event from auction table

Example B: Driver dispatch (ride-hailing) #

From Step 3:

  • Object: Driver
  • Ownership: Multi-writer, one winner (many dispatchers compete per driver)
  • Evolution: State machine (available → dispatched → on_trip → available)
  • Ordering: Causal lifecycle order (forbidden: dispatched → offline without completing trip)

From Step 4:

  • Invariant [Eligibility]: driver dispatched only if driver.status == AVAILABLE
  • Invariant [Uniqueness]: same dispatch request → same outcome (one driver claimed)
  • Invariant [Eligibility crash-safety]: if dispatcher crashes, driver is not held indefinitely

Step 6 derivation:

Sub-step 6.1: Eligibility + crash-safety → concurrency control with timeout

Sub-step 6.2: Ownership=multi-writer one-winner + holder may crash → Lease (TTL + fencing token)

Sub-step 6.3:

  • Q1: within dispatch service → distributed CAS on shared in-memory store (Redis)
  • Q2: dispatcher crashes while driver is claimed → TTL auto-releases claim; fencing token rejects stale dispatcher commands
  • Q3: not commutative → Lease confirmed, not CRDT
  • Q4: high write rate (many dispatches/sec), O(1) latency required → Redis, not relational DB with row locks
  • Q5: dispatch result must be synchronous (dispatcher needs immediate response) → in-request atomic claim

Sub-step 6.4: Lease + Idempotency Key; Lease + Fencing Token (both required for crash-safety)

Result:

  • Claim: SET driver:{id}:status dispatched NX EX {ttl_seconds} (Redis: set if not exists, with TTL)
  • Fencing token: monotonically increasing claim_id stored with status; dispatcher must present claim_id on every subsequent command; stale claim_id → reject
  • Idempotency: dispatch_request_id → check idempotency store before attempting claim

This is the mechanical derivation of “CAS on driver status in Redis with TTL” from first principles — without pattern memorization.


Example C: Leaderboard (YouTube Top K) #

From Step 3:

  • Object: ScoreEvent

  • Ownership: Pipeline (system writes after processing)

  • Evolution: Append-only

  • Ordering: Total order by timestamp within video_id

  • Object: VideoAggregate (derived → not primary, but a necessary DP)

  • Ownership: Pipeline (single writer per video_id partition)

  • Evolution: Overwrite (increment on each event)

  • Ordering: No meaningful order (only latest value matters)

From Step 4:

  • Invariant [Accounting]: all-time score = sum of all ScoreEvent.delta for video_id
  • Invariant [Accounting]: window score = sum of ScoreEvent.delta for video_id within window bucket
  • Invariant [Uniqueness]: duplicate ScoreEvent is processed at most once
  • Invariant [Propagation]: top-K projection reflects aggregates within 60s

Step 6 derivation:

Sub-step 6.1: Accounting + Uniqueness + Propagation → aggregation pipeline + dedup + propagation path

Sub-step 6.2: Ownership=pipeline (single writer per partition) → no concurrency mechanism for the aggregate. Dedup required for at-least-once delivery.

Sub-step 6.3:

  • Q1: within aggregation service, partitioned by video_id → single-writer per partition; no cross-service coordination
  • Q2: event delivered at-least-once → Idempotency per event (dedup by event_id before incrementing aggregate)
  • Q3: delta addition is commutative → stream aggregation is correct; CRDT G-Counter semantics apply
  • Q4: read shape (ranked list) ≠ write shape (event stream) → CQRS + Materialized View; Scatter-Gather across scopes for multi-scope queries; Temporal Decay for recency weighting
  • Q5: propagation from aggregate to projection can be async (60s bound) → CDC or internal event → projection updater

Sub-step 6.4: Stream aggregation + Idempotency per event + Temporal Decay for recency

Result:

  • Ingest: Kafka topic partitioned by video_id
  • Aggregate: Flink job, keyed by video_id, dedup by event_id, increment all-time and windowed aggregates
  • Projection: Redis sorted set per scope; updated when new score crosses cutoff
  • Temporal Decay: score in sorted set = raw_score × e^(−λ × age_hours) recomputed on each update

Output Artifact #

For each DP, produce:

DPMechanismComponentsFailure handling
Auction acceptanceCAS on (auction_id, version) + Idempotency KeyPostgres conditional UPDATE + idempotency tableRetry: check idempotency first; CAS fail: reread and retry bounded times
Driver claimLease (Redis SETNX + EX) + Fencing tokenRedis key per driver_id + claim_idTTL auto-releases stale claim; reject stale claim_id
Score aggregateStream aggregation (Flink) + event dedupKafka + Flink keyed stateExactly-once via Flink checkpointing; event dedup by event_id
Top-K projectionCQRS + Materialized View + Temporal DecayRedis sorted set per scopeRebuild from aggregate store on projection corruption

Step 7 — Axiomatic Validation #

Goal #

Validate the design for: (1) single source of truth per concept, (2) no hidden coupling between DPs, (3) no circular dependencies.

Single Source of Truth #

For every concept, ask: if two systems disagree about this value, which one is correct?

If the answer is ambiguous, you have dual truth. Dual truth is the source of “sometimes stale, sometimes wrong” systems.

Source-of-truth table:

ConceptAuthoritative sourceDerived projections
Current highest bidAuction acceptance state (Postgres)AuctionView cache (Redis)
Driver statusDriver status store (Redis)Dispatch service in-memory state
All-time video scoreScore aggregate store (Flink keyed state)Top-K projection (Redis sorted set)
Comment orderAuthoritative comment store (sequence assigned at write)History slice, live subscription
Match existenceMatch table (Postgres)Notification (derived event)

Invalid patterns:

  • Redis cache is treated as authoritative because it is fast. Fix: Redis is a projection; Postgres is truth.
  • History and live stream assign their own sequence numbers independently. Fix: one sequence authority for both.
  • Two services each compute “current surge multiplier” from raw data independently. Fix: one aggregator is authoritative; others read its output.

Independence Check #

Build a DP-to-invariant matrix. Look for:

  • Hidden shared decisions: two DPs that both depend on the same underlying choice (e.g., two DPs that both need “total order within video_id” — there should be one sequence authority, not two)
  • Circular dependency: DP A depends on DP B’s output which depends on DP A’s output
  • Accidental coupling: a cache that can diverge permanently from its source with no rebuild path

Acceptable dependency: Leaderboard projection depends on score aggregate. This is a real, one-directional dependency. Unacceptable dependency: Leaderboard projection can permanently diverge from score aggregate with no reconciliation. Fix: add rebuild path (replay aggregate store → rebuild projection).

Output Artifact #

  1. Source-of-truth table (above)
  2. Dependency table: DP → depends on
  3. List of projections with: upstream source, maximum lag, rebuild path

Completion Test #

  • Every concept has exactly one authoritative source.
  • Every projection has a documented upstream source and a rebuild path.
  • No correctness-critical concept has competing truths.
  • No circular dependency exists.

Step 8 — Algorithm and State-Transition Design #

Goal #

Make every write path and lifecycle transition directly implementable. “Use CAS” is not an algorithm. “Read current version, validate predicates, attempt UPDATE WHERE version = $v, retry bounded times on failure” is an algorithm.

What to Specify #

For every write path:

  1. Input: what arrives
  2. State read: what must be read before deciding
  3. Validation: what conditions must hold (maps to eligibility invariants)
  4. Atomic update: what changes atomically and under what predicate
  5. Output: what is returned
  6. Failure behavior: what happens on retry, duplicate, race, timeout, or rejection

For every process object: define all states, all allowed transitions, all forbidden transitions, terminal states.

Algorithm Templates #

Template 1: Eligibility write with CAS #

function accept(request_id, object_id, payload):
  // Step 1: Idempotency check
  if prior_result = idempotency_store.get(request_id):
    return prior_result

  // Step 2: Read current state
  state = store.read(object_id)

  // Step 3: Validate eligibility
  if not eligible(state, payload):
    result = rejection_reason(state, payload)
    idempotency_store.set(request_id, result, ttl=24h)
    return result

  // Step 4: Attempt CAS
  new_state = apply(state, payload)
  success = store.conditional_update(
    object_id,
    expected_version = state.version,
    new_state = new_state
  )

  // Step 5: Handle CAS result
  if success:
    result = accepted(new_state)
    idempotency_store.set(request_id, result, ttl=24h)
    emit_event(object_id, new_state)  // for projections
    return result
  else:
    // Another writer won the race
    if retries_remaining > 0:
      return accept(request_id, object_id, payload)  // retry with fresh read
    else:
      return conflict_or_rejection()

Template 2: State machine transition #

allowed_transitions = {
  CREATED: [OPEN],
  OPEN: [ENDED],
  ENDED: [],  // terminal
}

function transition(object_id, new_state, request_id):
  if prior = idempotency_store.get(request_id): return prior

  current = store.read(object_id)
  if new_state not in allowed_transitions[current.state]:
    return invalid_transition_error(current.state, new_state)

  success = store.conditional_update(
    object_id,
    expected_version = current.version,
    new_state = {state: new_state, version: current.version + 1}
  )
  if not success: retry or fail

Template 3: Append-only event with dedup #

function ingest_event(event_id, payload):
  if dedup_store.exists(event_id): return already_processed

  // Atomic: write event + mark dedup
  store.append(event_id, payload)
  dedup_store.set(event_id, processed, ttl=retention_period)

  // Update downstream aggregate
  aggregate = aggregate_store.increment(payload.key, payload.delta)
  if aggregate crosses top_k_cutoff:
    projection_store.update(payload.scope, payload.key, aggregate)

Template 4: Saga (multi-service write) #

function saga(saga_id, steps):
  if prior = idempotency_store.get(saga_id): return prior

  completed_steps = []

  for step in steps:
    result = execute_with_idempotency(saga_id + ":" + step.id, step)
    if result.failed:
      // Compensate all completed steps in reverse
      for completed in reversed(completed_steps):
        compensate_with_idempotency(saga_id + ":compensate:" + completed.id, completed)
      return saga_failed(result)
    completed_steps.append(step)

  return saga_succeeded()

State Machine Specification #

For every process object, produce:

States: {CREATED, OPEN, ENDED}
Allowed transitions:
  CREATED → OPEN    (when: manually opened or scheduled open time reached)
  OPEN → ENDED      (when: end_time reached OR owner closes)
Forbidden transitions:
  ENDED → any       (ENDED is terminal)
  OPEN → CREATED    (no reversal)
Terminal states: {ENDED}

Clock authority: server_time from the auction service. Client clock is never trusted.

Duplicate Handling Rules #

Every write path must explicitly state its idempotency semantics:

Write pathIdempotency keyDuplicate behavior
Bid submissionrequest_id (client-generated UUID)Return same acceptance/rejection result
Driver dispatchdispatch_request_idReturn same claimed/failed result
Score event ingestionevent_id (producer-assigned)Skip if already processed
Match creationcanonical_key = (min(a,b), max(a,b))INSERT IF NOT EXISTS; return existing match
Payment capturetrip_id + attempt_numberReturn stored capture result

Output Artifact #

For every write path: pseudocode including idempotency, CAS, retry, and rejection logic. For every process object: complete state machine diagram.


Step 9 — Logical Data Model #

Goal #

Convert primary objects, DPs, and access patterns into a concrete logical schema. This step is still logical (technology-independent). Physical decisions come in Step 10.

Schema Derivation Rules #

Rule 1: Every primary object becomes at least one table/collection. Rule 2: Every derived view that is read frequently becomes an explicit read model table/collection. Rule 3: Partition key candidates are derived from the scope of the correctness invariant (not from convenience). Rule 4: Every idempotency requirement produces a dedup key in the schema. Rule 5: Source tables and projection tables are never the same table. Mark each clearly.

Schema Components #

For each table:

  • Name and class (source-of-truth vs projection)
  • Fields
  • Primary key
  • Important secondary / lookup keys
  • Partition key candidate (derived from invariant scope)
  • Uniqueness constraints (for dedup keys)
  • Ordering keys (for range reads)

Example Schemas #

Auction (source-of-truth) #

auction_id        UUID        PK
seller_id         UUID        FK
starting_price    Decimal
end_time          Timestamp
status            Enum(CREATED, OPEN, ENDED)
current_highest   Decimal
winning_bid_id    UUID nullable
version           Integer     optimistic lock
created_at        Timestamp

Partition key: auction_id
Uniqueness: (auction_id) — one row per auction

Bid (source-of-truth, append-only) #

bid_id            UUID        PK
auction_id        UUID        FK, partition key candidate
bidder_id         UUID
amount            Decimal
created_at        Timestamp
request_id        UUID        UNIQUE — dedup key
status            Enum(PENDING, ACCEPTED, REJECTED)

Partition key: auction_id (scan all bids for an auction)
Secondary key: (bidder_id, created_at) — scan bids by bidder
Uniqueness: (request_id) — idempotency enforcement

IdempotencyRecord #

request_id        UUID        PK
result_json       JSON
created_at        Timestamp
expires_at        Timestamp   — TTL

Partition key: request_id

DriverStatus (source-of-truth, in Redis) #

Key: driver:{driver_id}:status
Value: {status, claim_id, trip_id, claimed_at}
TTL: set on claim; cleared on release

ScoreAggregate (source-of-truth) #

video_id          UUID        PK / partition key
all_time_score    Long
updated_at        Timestamp
version           Long        — for optimistic updates

TopKProjection (read model / derived view) #

scope             String      PK (composite: granularity:bucket_id)
rank              Integer     sort key
video_id          UUID
score             Double      — with temporal decay applied
updated_at        Timestamp

Note: this table is entirely rebuildable from ScoreAggregate + ScoreEvents

Read-Path Alignment #

Every frequent read must have a direct key or index path. For each important query, verify:

QueryRequired key
Get auction by idauction_id PK
Get all bids for auctionauction_id partition key on Bid
Get top K videos for scopescope PK on TopKProjection
Did user A swipe on user B?(swiper_id, swiped_id) composite key on Swipe
Did user B swipe on user A?(swiper_id=B, swiped_id=A) — same table, different lookup
Drivers within 3km of locationSpatial index on (lat, lng) — not a table key

Output Artifact #

Schema definitions for all primary objects and derived view tables, with keys and constraints.


Step 10 — Technology Landscape #

Goal #

Map each DP’s required capabilities to a technology shape, then to specific products. Technology is chosen by capability fit, not by fashion or familiarity.

Technology Shape Categories #

ShapeBest forExamples
OLTP storeTransactional conditional writes, state machines, point reads/writes, foreign keysPostgreSQL, MySQL, CockroachDB, DynamoDB (simple KV conditional)
Wide-column / time-series storeHigh-rate append, per-key range reads (time range), telemetry, eventsCassandra, Scylla, Bigtable, TimescaleDB
In-memory storeSub-millisecond reads, ephemeral hot state, Lease implementation, spatial index, bounded sorted setsRedis, Memcached
Append log / event streamOrdered event ingestion, replay, async fan-out, buffering between producers and consumersKafka, Pulsar, Kinesis
Stream processorStateful event-by-event computation, windowing, watermarks, exactly-once aggregationFlink, Spark Streaming, Kafka Streams
Object storeCheap immutable large-object storage, archival, cold dataS3, GCS, Azure Blob
Search / analytical storeFull-text search, ad-hoc analytics, flexible secondary queriesElasticsearch, ClickHouse, BigQuery

Capability-to-Shape Mapping #

Required capabilityTechnology shape
Conditional update (CAS, optimistic lock)OLTP store
Lease (exclusive time-bound claim)In-memory store (Redis SETNX + EX)
High-throughput append with per-key orderingWide-column store or append log
Spatial query (entities within radius)In-memory store with geo index (Redis GEOADD/GEORADIUS) OR dedicated spatial DB
Sliding/tumbling window aggregationStream processor
Bounded sorted set (top-K)In-memory store (Redis sorted set)
Async fan-out with replayAppend log
Idempotency storeOLTP store (write once, read on retry)
Full-text searchSearch store
Cold archivalObject store

Technology Selection Rules #

  1. Match mutation pattern first. If the invariant requires CAS, the store must support atomic conditional writes. Redis sorted sets do not support CAS on set membership — use Postgres or DynamoDB for that.

  2. Never let a projection store become truth. Redis is fast but not the source of truth for anything correctness-critical. If Redis is unavailable, rebuild from the OLTP store. This must be possible.

  3. Partition key alignment. The technology must support efficient access on the partition key you chose in Step 9. If your partition key is (scope, bucket_id) and your queries are by this key, Cassandra with that as the partition key works. A single Postgres table without sharding does not work at high volume.

  4. Check consistency model of the technology. DynamoDB eventual consistency by default — if your invariant requires strong consistency, use strongly consistent reads (at higher latency and cost). Cassandra QUORUM reads are consistent; ONE reads are not.

Output Artifact #

DPRequired capabilityTechnology shapeSpecific productConsistency mode
Auction acceptanceCAS + transactionsOLTPPostgreSQLSerializable or Read Committed + optimistic lock
Bid idempotencyWrite-once, read-on-retryOLTPPostgreSQL (same DB)Read Committed
Driver status / claimLease (SETNX + TTL)In-memoryRedisSingle-instance or Redlock
Driver spatial indexGeo query within radiusIn-memory + geoRedis GeoEventual (TTL-based refresh)
Score event ingestionHigh-throughput appendAppend logKafka (partitioned by video_id)At-least-once
Score aggregationWindowed stateful computeStream processorFlinkExactly-once (checkpointing)
Top-K projectionBounded sorted setIn-memoryRedis sorted setEventual (rebuilt from Flink output)
Swipe storeHigh-write append, per-key scanWide-columnCassandra (partition: swiper_id)Quorum writes for right-swipes
Match storeCAS on (user_pair canonical key)OLTPPostgreSQLSerializable

Step 11 — Deployment Topology #

Goal #

Define service boundaries, partition ownership, failure domains, and data flow. Architecture becomes deployable here.

Service Boundary Rules #

A service boundary is justified when two responsibilities have:

  • Different scaling characteristics (write-heavy vs read-heavy)
  • Different failure isolation requirements (payment must not fail because recommendation is slow)
  • Different consistency requirements (real-time bid acceptance vs eventual leaderboard)
  • Different data ownership (auction service owns auction state; notification service owns delivery state)

Do not create service boundaries to satisfy a microservices philosophy. Create them when one of the above is true.

Partition Key Rules #

Partition key must align with the scope of the correctness invariant.

If the invariant is “accepted bids strictly increase within one auction,” the partition key for the acceptance mechanism is auction_id. Partitioning by seller_id puts auctions from the same seller together — this is wrong because two separate auctions from the same seller have no shared invariant.

Invariant scopePartition key
Per auctionauction_id
Per videovideo_id
Per activityactivity_id
Per driverdriver_id
Per user pair (match)canonical(min(a,b), max(a,b))
Per geographic cellH3 cell ID or geohash prefix
Per time window + scope(granularity, bucket_id, scope_id)

Failure Domain Classification #

For each component:

ComponentCorrectness criticalityFailure impactRecovery path
Auction acceptance (Postgres)CRITICALNo bids acceptedFail closed; primary failover
Bid idempotency storeCRITICALDuplicate bids possibleFail closed; replica promotion
Driver status (Redis)CRITICALDispatches failFail closed; Redis Sentinel / Cluster
Score event ingest (Kafka)HIGHEvents lost if unrecoverableRetry from producer; DLQ
Score aggregate (Flink)HIGHLeaderboard staleRestore from Flink checkpoint
Top-K projection (Redis)MEDIUMStale leaderboardRebuild from aggregate store
Notification serviceLOWMatch notification delayedRetry; user sees match on app open

Correctness-critical components fail closed. Projection-only components may degrade gracefully (serve stale data).

Output Artifact #

  • Service boundary diagram with data flows
  • Partition topology table
  • Failure domain table (above)

Step 12 — Consistency Model #

Goal #

State explicitly which paths require strong consistency and which may be eventually consistent. Systems almost never need the same consistency everywhere. Applying strong consistency uniformly destroys scalability; applying eventual consistency uniformly introduces correctness bugs.

Classification Procedure #

For each write path and read path, ask:

Strong consistency required if: violating the invariant changes business truth — money is charged twice, a driver is double-dispatched, a bid is accepted that should have been rejected.

Eventual consistency acceptable if: the only consequence of a stale read is temporary UX lag — leaderboard is 60s stale, cache is 5s stale, notification arrives 2s after the event.

Output Artifact #

PathConsistencyStaleness boundWhy
Bid acceptance writeStrong0Eligibility invariant must hold at decision time
Auction view readEventual100msStale view is visible briefly; correctness is in the write path
Driver status write (claim)Strong0Double-dispatch is a correctness violation
Driver location read (for matching)Eventual30sLocation ping is approximate; stale ping = slightly worse match
Score event ingestAt-least-onceN/ADuplicates handled by dedup; missing events are not acceptable
Top-K leaderboard readEventual60sProduct accepts 60s staleness
Match creation writeStrong0Must not create duplicate match
Match notification deliveryBest-effortN/APush is unreliable; durable match record is the truth

Step 13 — Scaling Model #

Goal #

Identify how load grows and where bottlenecks appear. You are not yet tuning numbers; you are naming the dimensions.

Scale Type Taxonomy #

TypeDefinitionBottleneck locationStrategy
Write-heavy / appendMany writes, few reads, low contention per keyIngest throughputPartition by key, parallel ingest, wide-column store
Contention-heavyMany competing writes on the same keySingle-key throughputSingle-writer partition, or CAS with retry budget
Read-heavyMany reads, low write rateRead throughput + latencyCache, CQRS, CDN
Fan-out heavyOne write causes many downstream reads or notificationsFan-out path (notify N subscribers)Fan-out on Write (small N), Fan-out on Read (large N), hybrid
Window / aggregation heavyEvents must be bucketed and aggregated in time windowsAggregation throughput + state sizeStream processor, pre-aggregation, approximate counting

Hotspot Identification #

A hotspot is a key or partition that receives disproportionately more traffic than others.

Hotspot typeExampleMitigation
Single-key write hotspotOne live auction with 10K bids/secSingle-writer partition for that auction_id; queue bids in front
Single-key read hotspotOne video goes viralCDN caching; in-memory cache with short TTL; cache stampede protection (single-flight)
Fan-out hotspotCelebrity with 100M followers postsFan-out on Read for celebrity accounts; hybrid fan-out
Partition hotspotAll keys hash to same shardAdd random suffix to key for writes; aggregate at read time
Time-boundary hotspotAll window aggregates flush at the same secondStagger flush times; jitter

Output Artifact #

DimensionDriverHotspot keyLikely bottleneckStrategy
Auction bidsbids per auctionPopular auction_idCAS retry rate on hot auctionSingle-writer queue
Score eventsevents per videoViral video_idKafka partition throughputPartition by video_id; increase partition count
Driver matchesdispatches per cellDense geo cellSpatial index read hotspotCell-level cache; increase Redis cluster size
Match notificationsmatches per secondGlobal (peak time)Push notification throughputAsync queue with backpressure

Step 14 — Failure Model #

Goal #

Enumerate what can go wrong and what correct behavior is in each case. A design is incomplete until every failure mode has an explicit response.

Failure Taxonomy #

Client retry / timeout #

Client does not know whether the write succeeded. May retry.

  • Required: Idempotency key on every non-idempotent write.
  • Every write path must be safe to retry N times.

Duplicate delivery #

Same event or message arrives twice from broker (at-least-once delivery).

  • Required: Dedup by event_id before applying to aggregate or state.
  • Never assume exactly-once from a message broker without explicit configuration.

Partial write #

Write to primary state succeeds; downstream event publication or projection update fails.

  • Required: Outbox pattern (event written in same transaction as state change) OR CDC (event derived from WAL).
  • The downstream failure must not leave the system in a state where the event is permanently lost.

Out-of-order arrival #

Events arrive in a different order than they were produced.

  • Required: Explicit policy for each ordered object. Options:
    • Reject and requeue (wait for in-order delivery)
    • Accept and reconcile (watermark-based window)
    • Idempotent application (if operation is order-independent)

Temporal boundary #

Event arrives near a window close or a deadline (e.g., bid arrives at exactly end_time).

  • Required: Explicit clock authority. Client clock is never authoritative. Server time is used for all temporal decisions. Define behavior for events arriving within [end_time - ε, end_time + ε].

Infrastructure failure #

Node crashes, DB primary fails, cache is cold, queue consumer lags.

  • Required: For each component, explicit answers to:
    • What is the impact on correctness? (writes fail closed? reads serve stale?)
    • What is the recovery path? (failover, rebuild from source, restart consumer)

Output Artifact #

Failure modeCan it happen?Correct behaviorMechanism enforcing itRecovery if mechanism fails
Client retries bidYesSame result as first attemptIdempotency KeyManual dedup audit
Duplicate score eventYes (Kafka at-least-once)Ignored; aggregate unchangedDedup by event_idRecompute aggregate from event log
Bid write succeeds; event not publishedYes (crash after write)Event eventually publishedOutbox + RelayOutbox relay catches up on restart
Auction end_time raceYes (bid arrives at exact end_time)Server time is authoritative; server rejects if server_time ≥ end_timeServer-side time check in CAS predicateAudit log of rejection vs acceptance at boundary
Driver dispatch crashed holding claimYesClaim auto-expires via TTLLease TTLRedis TTL fires; driver returns to available
Score aggregate consumer lagsYesLeaderboard is stale by lag amountSLO on freshness; alert on lag > thresholdRestart consumer; replay from last checkpoint

Step 15 — SLO Definition #

Goal #

Convert broad product expectations into measurable promises. SLOs are the contract between architecture and operations.

SLO Types #

TypeDefinitionExample
LatencyPercentile latency of a request or event pathBid acceptance p95 < 200ms
AvailabilityFraction of time the service is usableBid service 99.9% available
CorrectnessRate at which invariants are violatedZero accepted bids after auction close
FreshnessMaximum staleness of a projectionLeaderboard lag < 60s at p99
ThroughputSustained operation rate10K score events/sec ingested

SLO Derivation #

SLOs are derived from:

  • Product requirements (user-visible expectations)
  • Invariant strength (correctness invariants have SLO = zero violations)
  • Downstream SLO budgets (each call in a path consumes part of the latency budget)

Latency budget principle: if the end-to-end SLO is 500ms and the path has 3 hops, each hop must have a timeout budget. Sum of timeouts must be < 500ms. Every timeout in the system is derived from this budget, not set arbitrarily.

Output Artifact #

PathMetricTargetDerived from
Bid acceptancep95 latency200msProduct UX requirement
Bid acceptanceCorrectness violations0 per dayInvariant is absolute
Live comment deliveryp99 delivery lag500msProduct UX requirement
Leaderboard freshnessp99 lag60sProduct requirement
Driver match latencyp95 end-to-end2sRider UX expectation
Payment captureDuplicate charge rate0Legal + invariant
Score event ingestThroughput50K events/secCapacity model

Step 16 — Operational Parameters #

Goal #

Convert every architectural lever into a named, tunable parameter with its valid range and effect.

Every decision you made in Steps 6–11 that involved a number, a threshold, or a timeout must appear here.

Parameter Categories #

Concurrency parameters #

  • CAS retry limit: max attempts before returning conflict error
  • Lease TTL: how long a claim is held before auto-release
  • Lock timeout: how long a pessimistic lock is held

Cache parameters #

  • Cache TTL: staleness bound; sets the freshness SLO
  • Cache size: eviction begins when exceeded
  • Cache stampede protection: request coalescing window

Queue parameters #

  • Retry budget: max retries before DLQ
  • Backoff base + max: controls retry storm risk
  • Jitter factor: fraction of backoff that is randomized

Window parameters #

  • Window size: tumbling or sliding interval
  • Allowed lateness: how long after window close late events are accepted
  • Watermark lag: how far behind real time the watermark is held

Idempotency parameters #

  • Idempotency key TTL: how long outcomes are retained
  • Key scope: per-user, per-session, or global

Output Artifact #

ParameterComponentDefaultMinMaxEffect of increasingEffect of decreasing
CAS retry limitAuction acceptance3110More retries = higher acceptance rate under contention, higher tail latencyFewer retries = faster failure, lower acceptance rate
Lease TTLDriver claim30s5s120sLonger = more time before stale claim auto-releasesShorter = more dispatcher timeouts on slow operations
Cache TTLAuctionView100ms05sMore stalenessLess staleness, more DB load
Retry budgetScore event ingest5120More attempts before DLQMore events lost on transient failure
Backoff baseAll retry paths100ms10ms1sSlower retry, less stormFaster retry, higher storm risk
Window sizeHourly leaderboard1h5m24hCoarser time granularityFiner granularity, more state
Watermark lagScore events10s060sMore late events capturedMore latency in window close

Step 17 — Runbooks #

Goal #

For every known failure mode (from Step 14) and every hotspot (from Step 13), define the exact procedure an operator takes.

Runbook Template #

Incident: [Name]
Trigger: [What signal indicates this is happening — metric, alert, symptom]
Diagnosis: [How to confirm]
Immediate mitigation: [What to do right now to reduce impact]
Root cause investigation: [How to find why it happened]
Resolution: [How to fully recover]
Prevention: [What to change to prevent recurrence]

Example Runbooks #

Runbook: Auction CAS contention storm #

Trigger: CAS retry rate > 80% on bid acceptance service; p99 bid latency > 2s Diagnosis: Check bid volume per auction_id; identify hot auction Immediate mitigation: Enable bid queue in front of hot auction_id; switch to single-writer for that auction Root cause: Unexpected viral auction; more bidders than designed for Resolution: Queue drains; single-writer processes bids sequentially; switch back to CAS when load drops Prevention: Detect auction heat (bid rate > threshold) earlier; pre-emptively queue before contention saturates

Runbook: Leaderboard freshness SLO breach #

Trigger: Leaderboard lag > 60s; alert fires Diagnosis: Check Flink consumer lag on Kafka score event topic; check Flink job health; check Redis write latency Immediate mitigation: None needed if correctness is not affected (leaderboard is a projection) Root cause investigation: Flink checkpoint failure? Consumer restart? Kafka partition rebalance? Resolution: Restart Flink job from last checkpoint; it will catch up Prevention: Alert on consumer lag > 30s to catch before SLO breach; increase checkpoint frequency

Runbook: Driver not releasing after dispatch crash #

Trigger: Rider waiting > TTL seconds; driver marked dispatched but unresponsive Diagnosis: Check driver TTL expiry in Redis; check dispatch service health Immediate mitigation: Redis TTL fires automatically; driver returns to available pool Root cause investigation: Was TTL set correctly at claim time? Did dispatcher crash before setting TTL? Resolution: Verify TTL is always set atomically with claim (SETNX + EX in one command) Prevention: Audit: never allow claim without TTL in the same Redis command


Step 18 — Observability #

Goal #

Define what must be instrumented for every SLO to be measurable, every failure mode to be detectable, and every parameter to be tunable at runtime.

Observability Types #

TypePurposeExamples
MetricsQuantitative, time-series; powers SLO dashboardsCAS success rate, retry rate, latency percentiles, consumer lag
TracesRequest-scoped; shows where latency is spentDistributed trace from bid submission through CAS through event publication
LogsDiscrete events; powers incident investigationCAS rejections with reason, idempotency hits, transition failures
AlertsFires when metric crosses SLO thresholdBid latency p95 > 200ms, consumer lag > 30s, CAS retry rate > 50%

Required Instrumentation #

ComponentRequired metricAlert condition
Auction acceptanceCAS success rate, CAS retry count, p95/p99 latencyRetry rate > 50%, latency p95 > 200ms
Idempotency storeHit rate, miss rate, TTL expiry rateHit rate < expected (suggests missing keys)
Driver claimClaim success rate, TTL expiry rate, fencing token rejectionsFencing rejections > 0 (indicates stale dispatcher)
Score event ingestEvents/sec, lag, DLQ depth, dedup hit rateLag > 30s, DLQ depth > 0
Leaderboard projectionFreshness lag, Redis write latencyFreshness > 60s
Match detectionMatch creation rate, duplicate match attemptsDuplicate attempts > 0

Output Artifact #

Instrumentation plan: for each component, list of metrics, log events, trace spans, and alert thresholds.


Step 19 — Cost Model #

Goal #

Quantify what grows with users, events, and hotspots so that cost surprises can be predicted and capacity planned.

Growth Dimensions #

For each component, identify:

  • What user behavior drives its growth?
  • What is the growth rate (linear, quadratic, logarithmic)?
  • What is the dominant cost (storage, compute, network, IOPS)?

Example #

ComponentGrowth driverGrowth rateDominant cost
Bid storeBids per auction × active auctionsLinear in auctionsStorage + IOPS
Score event store (Kafka)Score events per video × active videosLinear in contentKafka storage (retention × throughput)
Score aggregate (Flink state)Active videosLinearMemory (Flink state size)
Top-K projection (Redis)Scopes × KConstant per scope (bounded by K)Redis memory
Driver status (Redis)Active driversLinearRedis keys
Swipe store (Cassandra)Swipes per user × active usersQuadratic in users (O(U²) worst case if everyone swipes everyone)Cassandra storage
Idempotency storeWrite rate × TTLConstant at steady stateOLTP storage

Step 20 — Evolution Path #

Goal #

Define the smallest correct initial design and the signals that trigger each architectural upgrade.

Staged Design Principle #

Start with the minimum correct design. Add complexity only when a measured signal crosses a defined threshold.

Evolution Stages #

Stage 1: Single service, single OLTP DB #

  • One Postgres instance handles auctions, bids, idempotency
  • Leaderboard computed by query at read time
  • No caching, no streaming
  • Correct up to: ~100 writes/sec, ~1K reads/sec, <10K objects

Upgrade signal: Postgres CPU > 60%, p95 read latency > 500ms

Stage 2: Add caching and async projections #

  • Redis cache for hot read paths (AuctionView, TopK)
  • Background job to compute leaderboard
  • Idempotency TTL cleanup job

Upgrade signal: Write contention on hot objects; CAS retry rate > 20%

Stage 3: Partition write-heavy objects #

  • Shard Cassandra for append-heavy event stores (score events, swipe store)
  • Kafka for score event fan-out to aggregation
  • Flink for windowed aggregation

Upgrade signal: Kafka consumer lag > 30s; Flink state > available memory

Stage 4: Specialized stores and geographic distribution #

  • Redis Geo for spatial driver matching
  • H3 cell index for surge zone computation
  • Multi-region Cassandra for swipe store

Trigger: Geographic expansion; latency SLO from distant regions fails

Data Model Preservation #

At each stage, identify:

  • What schema changes are additive (safe)?
  • What schema changes require migration (risky)?
  • What data model decisions lock you into a pattern (e.g., partition key choice in Cassandra cannot be changed without full table rewrite)?

Flag these decisions early. They are the ones that are hardest to reverse.


Pattern Reference #

Quick lookup: given a mechanism type from Step 6, find the pattern.

Concurrency Control #

MechanismPatternKey leverFailure mode
CAS on versionOptimistic LockRetry budget; version columnABA problem — add timestamp or monotonic ID to version
CAS + idempotencyOptimistic Lock + Idempotency KeyIdempotency TTLKey collision if client reuses keys
SETNX + EXLeaseTTL duration; fencing tokenClock skew between holder and store — use fencing token on all resource access
Row lock + transactionPessimistic LockLock timeoutDeadlock; lock not released on crash
Commutative mergeCRDTCRDT type (G-Counter, OR-Set, YATA)Tombstone accumulation for deleted elements
SagaSagaCompensation idempotency; saga state durabilityCompensation failure (double fault)
2PC2PCCoordinator crash recoveryBlocked participants if coordinator crashes after prepare

Read Acceleration #

PatternWhen to useDiscriminant
Cache-Aside + TTLRead » write; staleness OKSimplest; use first
CQRS + Materialized ViewRead shape ≠ write shapeNecessary when shape differs
DenormalizationJoin-heavy hot reads; same shapeWhen joins dominate latency
Scatter-GatherQuery fans across shardsWhen data is already partitioned
Spatial PartitionGeospatial queryAlways use before other read patterns for geo queries

Write Scaling #

PatternWhen to useTradeoff
Hash PartitionKey-based access, uniform distributionLoses range queries
Range PartitionRange queries dominantHot “latest” partition for time-series
Leaderless ReplicationAvailability > consistency, multi-region writesEventual consistency; conflict handling needed
Single-writer per partitionVery hot single key; CAS retry rate too highAdds queue latency; operational complexity

Event Flow #

PatternWhen to useDiscriminant
Message QueueWork distribution; rate decouplingUse when one consumer per message
Pub/SubFan-out; multiple independent consumersUse when N consumers per message
Outbox + RelayEvent produced by application logic; must be atomic with state changeApplication controls event content
CDCEvent IS the DB row changeNo application-level event needed
Fan-out on WriteSmall follower counts; real-timeFails at celebrity scale
Fan-out on ReadLarge follower counts; infrequent readsHigher read latency

Fault Tolerance #

PatternWhen to useLever
Idempotency KeyEvery non-idempotent write on unreliable networkTTL; key scope
Retry + Backoff + JitterAny network callBase delay; max delay; jitter fraction
Circuit BreakerDownstream can degrade (not just fail)Error rate threshold; half-open probe interval
BulkheadHigh-priority traffic must be isolated from low-priorityThread pool or connection pool size per tier
TimeoutEvery outbound callMust be derived from caller’s latency budget

Consensus and Sync #

PatternWhen to useTradeoff
Leader-FollowerSingle-region; strong consistency writesLeader is bottleneck; failover requires election
Quorum (W + R > N)Leaderless; tunable consistencyNetwork partition can still split quorum
GossipMembership; failure detection; eventual propagationConvergence time = O(log N)
State Vector SyncDistributed sync; two nodes converge incrementallyVector size grows with node count
Merkle TreeAnti-entropy between replicas; efficient diffRehash cost on hot write path

Time and Approximation #

PatternWhen to useLever
WindowingAggregate over finite time intervalWindow size; allowed lateness; watermark lag
TTLAuto-expire entriesDuration; lazy vs eager expiry; add ±10% jitter
Approximate CountingMemory-constrained cardinality or frequencyCount-Min (frequency), HyperLogLog (cardinality), Bloom Filter (membership)
Temporal DecayRecency-weighted rankingDecay rate λ; floor value
Spatial PartitionGeographic proximity queriesCell resolution; neighbor depth
Staged RolloutDeploy changes safelyRollout percentage; metric thresholds for pause
Scheduled TriggerPeriodic maintenance or aggregationSchedule frequency; distributed lock to prevent overlap

Cross-Step Anti-Patterns #

These mistakes appear at specific steps and propagate forward as structural errors.

Anti-patternWhere it appearsConsequenceFix
Treating a projection as primary stateStep 2Dual truth; inconsistent readsDerivability test: can it be recomputed? If yes, it’s a projection.
Mixing evolution models in one objectStep 2-3CAS predicate is ambiguous; tests become impossibleSplit into process object + append event stream
Ordering without scopeStep 3“Total order” applied globally when only per-partition order is needed → unnecessary coordinationAlways bind ordering to a scope
Vague invariant (“system should be correct”)Step 4Cannot derive mechanism; design is hand-wavyRewrite: “for every bid, amount > current_highest at decision time”
Technology as DP (“use Kafka”)Step 5Technology lock-in before invariant is understoodName the invariant-enforcing mechanism; choose technology in Step 10
CAS without idempotency keyStep 6Client retry creates duplicate outcomeAlways pair CAS with idempotency check first
Lock where Lease is neededStep 6Holder crashes; resource blocked until manual interventionIf holder can crash, use Lease (TTL)
2PC where Saga appliesStep 6Coordinator crash blocks all participantsCheck: can each step be local with compensation? If yes, use Saga
CRDT for non-commutative domainStep 6Merge produces incorrect stateCheck commutativity first. “Bid accepted if highest” is not commutative → not CRDT.
Dual truth sourceStep 7“Sometimes stale, sometimes wrong” bugsOne authoritative source; all others are projections with rebuild path
Partition key by convenience not invariantStep 9-10Hotspot misaligned with contention; cross-partition transactions requiredPartition key = scope of correctness invariant
Projection technology as truthStep 10Redis unavailable = correctness unavailableRedis is always a cache or projection. Rebuild path from OLTP source must exist.
Eventual consistency on correctness pathStep 12Silent correctness violationsClassify each path explicitly; strong consistency on eligibility-enforcing writes
Skipping failure modelStep 14Duplicates, partial writes, and race conditions not handled → production bugsEvery write path must answer: what happens on retry? On crash? On duplicate?

Compact Procedure Checklist #

Use this to verify each step produced its required artifact.

Step 1  — Normalization
  □ Every requirement rewritten as actor-operation-state
  □ Hidden writes and projections surfaced

Step 2  — Object extraction
  □ Primary objects identified with class
  □ Derived views rejected with reason
  □ Every object passes ownership/evolution/ordering/derivability tests

Step 3  — Axis assignment
  □ Every primary object has ownership, evolution, ordering
  □ Every ordering bound to a scope

Step 4  — Invariant extraction
  □ Every requirement maps to typed, testable invariants
  □ Concurrency considered for every eligibility invariant

Step 5  — DP derivation
  □ Every invariant cluster maps to a DP
  □ No DP is a technology name

Step 6  — Mechanism selection
  □ 6.1: Every DP classified by invariant type
  □ 6.2: Ownership × evolution table applied to select mechanism family
  □ 6.3: Q1-Q5 applied to select specific implementation
  □ 6.4: Required mechanism combinations identified (CAS + idempotency, etc.)

Step 7  — Axiomatic validation
  □ Source-of-truth table complete
  □ Dependency table complete; no circular dependencies
  □ Every projection has upstream source and rebuild path

Step 8  — Algorithm design
  □ Every write path has pseudocode
  □ Every process object has complete state machine
  □ Idempotency, retry, and rejection explicitly handled for every path

Step 9  — Logical data model
  □ Every primary object has schema
  □ Every derived view is a separate table marked as projection
  □ Partition keys derived from invariant scope
  □ Dedup keys explicit

Step 10 — Technology landscape
  □ Every DP maps to capability → shape → specific product
  □ Every correctness-critical invariant has named enforcing technology

Step 11 — Deployment topology
  □ Service boundaries justified
  □ Partition keys aligned with invariant scope
  □ Failure domains classified by correctness criticality

Step 12 — Consistency model
  □ Every write and read path classified (strong/eventual/bounded)
  □ Correctness-critical paths are strong
  □ Staleness bounds explicit for eventual paths

Step 13 — Scaling model
  □ Scale type identified (write/contention/read/fanout/aggregation)
  □ Hotspot keys identified
  □ Strategy per hotspot

Step 14 — Failure model
  □ Every write path: retry semantics explicit
  □ Every aggregate: duplicate event handling explicit
  □ Every component: correctness impact on failure classified

Step 15 — SLOs
  □ Latency SLOs per path
  □ Correctness SLOs per invariant
  □ Freshness SLOs per projection
  □ Latency budget allocated across hops

Step 16 — Operational parameters
  □ Every tunable threshold named with valid range and effect

Step 17 — Runbooks
  □ Every failure mode from Step 14 has a runbook
  □ Every hotspot from Step 13 has a mitigation runbook

Step 18 — Observability
  □ Every SLO is measurable
  □ Every failure mode is detectable
  □ Alerts defined for every SLO threshold

Step 19 — Cost model
  □ Growth driver named per component
  □ Dominant cost identified

Step 20 — Evolution
  □ Stage 1 (minimum correct) defined
  □ Upgrade signal per stage defined
  □ Irreversible data model decisions flagged

There's no articles to list here yet.