Table of Contents
System Design Derivation Framework #
A complete, mechanical procedure for deriving distributed system designs from functional requirements — from vague product language to operable architecture, with no hand-wavy steps.
Ordering Principle #
Every layer constrains the next. Skipping a layer makes later choices arbitrary.
Product requirements
→ normalize into operations over state (Step 1)
→ extract primary objects (Step 2)
→ assign ownership, ordering, evolution (Step 3)
→ extract invariants (Step 4)
→ derive minimal DPs from invariants (Step 5)
→ select concrete mechanisms (Step 6) ← the mechanical bridge
→ validate independence and source-of-truth (Step 7)
→ specify exact algorithms (Step 8)
→ define logical data model (Step 9)
→ map to technology landscape (Step 10)
→ define deployment topology (Step 11)
→ classify consistency per path (Step 12)
→ identify scaling dimensions and hotspots (Step 13)
→ enumerate failure modes (Step 14)
→ define SLOs (Step 15)
→ define operational parameters (Step 16)
→ write runbooks (Step 17)
→ define observability (Step 18)
→ estimate costs (Step 19)
→ plan evolution (Step 20)
A design is complete only when all 20 steps produce explicit output artifacts. If you cannot produce the artifact for a step, you have not finished that step.
Step 1 — Problem Normalization #
Goal #
Convert product-language requirements into precise operations over state. This is the first act of removing ambiguity.
Why It Matters #
Product requirements blend UI framing, transport behavior, derived views, and domain semantics. Without normalization, you will:
- confuse what the user sees with what the system stores
- miss hidden writes behind apparently simple reads
- fail to detect that “subscribe to alerts” implies a future-constraint store and a trigger engine
- fail to detect that “view live leaderboard” implies a projection, not primary state
Procedure #
For each requirement, identify:
1. Actor — who initiates it?
- user, client, internal service, scheduler, admin, external partner
2. Operation — what state operation is actually happening?
| UI verb | Actual system operation |
|---|---|
| view / see | read base record, OR read projection, OR fetch historical slice, OR subscribe from cursor |
| submit / create | create record, OR append event |
| update / edit | overwrite mutable state, OR transition state machine |
| get notification | system evaluates condition + delivers push |
| subscribe | persist future constraint (intent object) |
| search | query indexed projection |
| swipe / like | append immutable decision event |
| bid | attempt conditional update on contention state |
3. State — which underlying data actually changes or is observed? (Not the screen. Not the endpoint. The data.)
Output Artifact #
| Original Requirement | Actor | Operation | State Touched |
|---|---|---|---|
| Users can bid on an item | User | attempt conditional update | Auction acceptance state |
| Users see current highest bid | Client | read projection | AuctionView (derived) |
| Users get price-drop notifications | System | evaluate + deliver | AlertSubscription + PriceObservation |
| Drivers go online | Driver | overwrite status | Driver presence state |
| Rider requests a ride | Rider | create | Trip + LocationPing |
| Users swipe right on a profile | User | append decision event | Swipe record |
| Users get a match notification | System | detect mutual condition + push | Swipe store + Match record |
Completion Test #
- Every requirement is rewritten as actor-operation-state.
- Every “view” has been translated into the exact underlying read or projection type.
- Every apparently simple requirement has had hidden writes, triggers, or derived views exposed.
Step 2 — Object Extraction #
Goal #
Identify the minimal set of primary state objects. A primary object is one whose state must exist independently and cannot be fully reconstructed from other state.
Object Classes #
These are heuristic categories, not a rigid ontology. The real test is the validity rule below.
| Class | Definition | Examples |
|---|---|---|
| Stable entity | Long-lived, mutable, continuously existing | User, Product, Video, Driver, Problem |
| Event | Immutable, point-in-time occurrence | Bid, Swipe, LocationPing, PriceObservation, ActivitySample |
| Process object | Identity persists across state-machine lifecycle | Auction, Trip, Payment, Activity, JobRun |
| Intent / future constraint | Persists until a future condition is met | AlertSubscription, ScheduledJob, RetryPolicy |
| Relationship object | An edge between entities with its own lifecycle | Friendship, Match, SharePermission, FollowRelation |
| Derived view | Computable from other state | Leaderboard, CurrentHighestBid, Stack, FeedSlice, SurgeMultiplier |
Object Validity Rule #
An object is valid as a primary object if and only if it satisfies all four:
1. Ownership purity — there is one clear answer to “who writes this?”
- Bad: an object written by users, a pipeline, a scheduler, and a moderator under different rules → split it
- Good: Bid is written only by the bidding user
2. Evolution purity — the object evolves in one dominant way
| Evolution family | Description |
|---|---|
| Append-only | New records added; old records never modified |
| Overwrite | One current state updated in place |
| State machine | Guarded transitions between named states |
| Merge | Concurrent writes reconciled via merge law (CRDT) |
- Bad: one object that needs both append history and overwrite current state → it is two objects
- Good: Bid is append-only; Auction is a state machine
3. Ordering purity — there is one ordering relation among its versions/events, bound to a scope
- Bad: “events are ordered” (unbound) → ordered within what scope?
- Good: “Bids are ordered by acceptance time within one auction”
4. Non-derivability — the object cannot be reconstructed from other primary state
- CurrentHighestBid can be derived from accepted Bid records → it is a projection, not primary
- Leaderboard can be derived from Scores → it is a projection
Typical Invalid Objects #
| Candidate | Problem | Fix |
|---|---|---|
| Leaderboard | Derivable from Submissions + ranking rules | Classify as derived view |
| CurrentHighestBid | Derivable from Auction + Bids | Classify as derived view |
| UserFeed | Derivable from Follows + Posts | Classify as derived view |
| ActivityPage | UI artifact | Not a domain object at all |
| SurgeMultiplier | Derivable from supply/demand counts | Classify as derived view |
| Stack (Tinder) | Derivable from Users + Swipes + location | Classify as derived view |
Output Artifact #
Primary Objects: list with class Derived / Rejected: list with reason for rejection
Completion Test #
- Every candidate object has been classified.
- Every primary object passes all four validity tests.
- Every rejected object has an explicit reason.
Step 3 — Axis Assignment #
Goal #
For every primary object, define exactly three properties. These three axes determine most architecture-relevant behavior.
The Three Axes #
Axis 1: Ownership #
Who is allowed to write this object, and under what contention model?
| Ownership pattern | Meaning |
|---|---|
| Single writer | Exactly one actor writes this object. No concurrency needed. |
| Single writer per partition | One actor per key/scope, but multiple keys in parallel. |
| Multi-writer, one winner | Many actors attempt writes; only one succeeds per atomic decision. |
| Multi-writer, all succeed (commutative) | Concurrent writes merge without conflict. |
| Multi-writer, pipeline | A processing pipeline is the sole writer; upstream actors just enqueue. |
| System-only | Only internal services write; users never write directly. |
Axis 2: Evolution #
How does the object’s state change over time?
| Evolution | Meaning | Implication |
|---|---|---|
| Append-only | New records added; old ones never mutated | Immutable; safe to replicate; dedup by content hash or ID |
| Overwrite | Current state replaced | Need version or timestamp to detect staleness |
| State machine | Transitions guarded by allowed-transition rules | CAS on (state, version); forbidden transitions must be enforced |
| Merge | Concurrent writes reconciled by merge law | Requires CRDT or OT; merge must be commutative + associative |
Axis 3: Ordering #
What ordering relation must hold, and within what scope?
| Ordering | Meaning | Implication |
|---|---|---|
| Total order within scope | Any two instances within the scope are comparable | Sequence number per partition key; single-writer or CAS to assign |
| Partial order | Causal relationship (A happened before B, or concurrent) | Vector clocks; version vectors |
| No meaningful order | Membership/set semantics only | No ordering mechanism needed |
| Causal lifecycle order | State machine transitions must follow valid paths | State machine enforcement; forbidden transitions rejected |
Critical rule: Always bind ordering to a scope. “Events are ordered” is incomplete. “Accepted bids are totally ordered within one auction” is correct.
Output Artifact #
Object: Auction
Ownership: Multi-writer (many bidders), one winner per auction — system enforces
Evolution: State machine (created → open → ended; current_highest updated on each acceptance)
Ordering: Causal lifecycle order for state transitions; total order on acceptance sequence within auction_id
Object: Bid
Ownership: Single writer (the bidding user)
Evolution: Append-only (bid attempt is immutable once created)
Ordering: Total order by created_at within auction_id (for display); acceptance order is separate
Object: Driver
Ownership: Multi-writer, one winner (many dispatchers compete; only one can claim a driver)
Evolution: State machine (offline → available → dispatched → on_trip → available)
Ordering: Causal lifecycle order (forbidden transitions: dispatched → offline without completing trip)
Object: Swipe
Ownership: Single writer (the swiping user)
Evolution: Append-only (a swipe decision is immutable)
Ordering: Total order by timestamp within swiper_id (for exclusion set scan)
Object: AlertSubscription
Ownership: Single writer (the subscribing user)
Evolution: Overwrite (user may update threshold, disable, delete)
Ordering: No meaningful order (set of active subscriptions per user)
Why Axes Matter More Than Patterns #
The axes directly constrain which mechanisms are valid:
- Ownership = “multi-writer, one winner” + Evolution = “state machine” → the mechanism must be CAS or single-writer partition. CRDT is wrong (not commutative). 2PC is wrong (single-service scope). This eliminates the entire pattern space except CAS/Lock/Lease/Single-writer before Q1-Q5 is applied.
- Ownership = “multi-writer, all succeed” + Evolution = “merge” → the mechanism must be CRDT. Lock is wrong. 2PC is wrong.
Axes are necessary to make Step 6 (mechanism selection) mechanical.
Completion Test #
- Every primary object has explicit ownership, evolution, and ordering.
- Every ordering is bound to a scope.
- No object has mixed evolution without being flagged for splitting.
Step 4 — Invariant Extraction #
Goal #
Convert product requirements into precise statements that must always be true in every valid execution. Invariants are the bridge from object model to architecture. Patterns and technologies are downstream of invariants; invariants are not downstream of anything.
What An Invariant Is #
A precise, testable property of valid system state:
“For every valid state S, condition X must hold.”
It must be:
- Precise — a stranger can determine whether the system satisfies it
- Testable — you can write a check that passes or fails
- Implementation-independent — it describes what must be true, not how to achieve it
- Concurrency-aware — it must hold even under simultaneous writes, retries, and crashes
Invariant Types #
Type 1: Eligibility invariants #
An action is valid only when a condition holds.
An action A on object O is valid only if condition C(O) is true at decision time.
Examples:
- A bid is accepted only if
auction.status == OPEN AND server_time < auction.end_time AND bid.amount > auction.current_highest - A driver is dispatched only if
driver.status == AVAILABLE - A job run starts only if
run.status == CLAIMED AND run.claimed_by == this_worker
Implication: Some mechanism must atomically check C and perform A. If C can change between check and action, a concurrency mechanism is required.
Type 2: Ordering invariants #
Events or versions of an object must be comparable and consistent within a scope.
For any two instances I₁ and I₂ within scope S, there is a well-defined ordering, and all readers agree on it.
Examples:
- Accepted bid amounts form a strictly increasing sequence within one auction
- Comments are delivered in the same order to all subscribers of a video
- State machine transitions follow valid paths (no backwards transitions)
Implication: A single authority assigns order within the scope. No two actors may independently assign conflicting positions.
Type 3: Accounting invariants #
An aggregate must exactly reflect its source events.
aggregate(scope S) = f(all events in scope S)
Examples:
- All-time score for video V equals the sum of all ScoreEvent.delta where event.video_id = V
- Monthly active users for month M = count(distinct user_id where last_active ∈ M)
- Inventory quantity = initial_stock − sum(reservations) + sum(cancellations)
Implication: Every event that contributes to the aggregate must be processed exactly once. Late or duplicate events must be handled explicitly.
Type 4: Uniqueness / idempotency invariants #
The same logical request, submitted multiple times, produces the same logical outcome.
If request R is submitted N times (N ≥ 1), the observable outcome is identical to submitting it once.
Examples:
- Submitting the same bid request twice does not create two accepted bids
- Retrying a payment capture does not charge the user twice
- Swiping right on the same profile twice creates at most one Match check
Implication: Either the operation is naturally idempotent (append-only with content-addressed ID), or a dedup mechanism must detect and absorb duplicates.
Type 5: Propagation invariants #
A derived view must eventually reflect its source of truth, and the staleness bound is explicit.
At any time T, derived_view(scope S) = f(source_of_truth(S)) as of time T − ε, where ε is the acceptable staleness.
Examples:
- The auction read view reflects the current highest bid within 100ms of acceptance
- The Tinder stack excludes profiles already swiped before the stack is displayed
- A cache entry is stale by at most TTL seconds
Implication: Some update propagation mechanism must exist. The staleness bound determines whether it must be synchronous (ε = 0) or asynchronous (ε > 0).
Type 6: Access-control invariants #
Reads and writes are gated by authorization.
Actor A may perform operation O on object X only if A has permission P on X.
Examples:
- A user may read only their own activities plus friends’ visible activities
- Only the auction owner may cancel an auction
- A driver may only update their own location
Implication: An authorization check at every service boundary. The authorization state itself (ACL, permission record) is a primary object with its own ownership/evolution model.
Invariant Extraction Procedure #
For each normalized requirement from Step 1:
- Ask: What must always be true for this requirement to be satisfied?
- Ask: What must never be violated, even under concurrent writes, retries, crashes, or late events?
- Classify the invariant by type.
- Write it in precise, testable form.
Output Artifact #
Numbered list:
1. [Eligibility] A bid is accepted only if auction.status == OPEN
AND server_time < auction.end_time AND bid.amount > auction.current_highest_amount.
2. [Ordering] Accepted bid amounts form a strictly increasing sequence within one auction.
3. [Uniqueness] The same bid request_id produces the same logical acceptance result.
4. [Eligibility] A driver is dispatched only if driver.status == AVAILABLE.
5. [Uniqueness] The same dispatch request produces at most one claimed driver.
6. [Accounting] The all-time score for a video equals the sum of all delta values
in ScoreEvent records where video_id matches.
7. [Propagation] The leaderboard for scope S reflects all ScoreEvents within 60 seconds.
8. [Eligibility] A Match is created only if both users have right-swiped each other.
9. [Uniqueness] A Match between user A and user B is created at most once.
Completion Test #
- Every requirement maps to at least one invariant.
- Every invariant is typed, precise, and testable.
- Every invariant is stated in terms of domain state, not implementation.
- Concurrency is explicitly considered for every eligibility invariant.
Step 5 — DP Derivation #
Goal #
Derive the minimal set of design parameters — the minimal mechanisms that enforce the invariant clusters.
A DP is not a technology (not “Postgres” or “Redis”). It is not a pattern name (not “CQRS” or “Event Sourcing”). It is the minimal runtime mechanism that makes an invariant cluster provably enforceable.
Derivation Chain #
requirement → state → invariant → dependency closure → minimal enforcing mechanism
Dependency closure: Ask “what other state is required to enforce this invariant?” This reveals cross-requirement dependencies that are invisible when requirements are read in isolation.
Derivation Procedure #
For each invariant or tightly coupled invariant cluster:
- Identify the state the invariant is about.
- Ask: what must exist at runtime for this invariant to remain true under adversarial conditions (concurrent writes, retries, crashes)?
- Name the minimal mechanism that enforces it.
Examples #
Invariant 1+2 (auction eligibility + ordering):
- State: auction acceptance state (status, current_highest, version)
- What must exist: something that atomically reads current state, validates eligibility, and conditionally updates — rejecting concurrent updates that lost the race
- Minimal DP: Auction acceptance / concurrency control mechanism
Invariant 3 (bid idempotency):
- State: prior outcomes keyed by request_id
- What must exist: a store that maps request_id → prior result, checked before every acceptance attempt
- Minimal DP: Bid idempotency store
Invariant 4+5 (driver dispatch):
- State: driver status
- What must exist: something that atomically claims a driver exclusively — rejecting competing claims
- Minimal DP: Driver claim / exclusion mechanism
Invariant 6+7 (leaderboard accounting + propagation):
- State: per-video aggregate score, top-K projection per scope
- What must exist: (a) a store for aggregate scores that processes every event exactly once; (b) a mechanism that updates top-K when a score crosses the cutoff; (c) a propagation path from score store to projection within 60s
- Minimal DPs: Score aggregate store, Top-K projection maintenance mechanism, Score-to-projection propagation path
DP Validation Rules #
A good DP:
- enforces exactly one invariant cluster
- is not redundant with another DP
- is not a technology or a brand name
- is not so broad as to be meaningless (“backend service” is not a DP)
Output Artifact #
| DP | Invariant cluster enforced |
|---|---|
| Auction acceptance mechanism | Eligibility (open, not expired, amount strictly increasing) |
| Bid idempotency store | Uniqueness of bid acceptance per request_id |
| Auction view / read projection | Propagation: current highest bid visible to readers |
| Score aggregate store | Accounting: all-time score = sum of all events |
| Top-K projection maintenance | Accounting + Propagation: leaderboard reflects aggregate within bound |
| Driver claim mechanism | Eligibility: driver exclusively assigned to one trip |
| Dispatch idempotency store | Uniqueness: same dispatch request produces one outcome |
| Match creation mechanism | Eligibility: Match created only on mutual right-swipe |
| Match idempotency store | Uniqueness: at most one Match per user pair |
Completion Test #
- Every invariant maps to at least one DP.
- No DP exists without a linked invariant.
- No two DPs enforce the same invariant redundantly.
- No DP is a technology name.
Step 6 — Mechanism Selection (The Mechanical Bridge) #
Goal #
For each DP, derive the concrete mechanism using a mechanical, table-driven procedure. This step fills the gap between “bid acceptance mechanism” (abstract DP) and “CAS on (auction_id, version, current_highest, status) in a transactional store” (concrete mechanism).
The procedure has four sub-steps applied in order. Earlier sub-steps constrain later ones.
Sub-step 6.1 — Classify the DP by invariant type #
| Invariant type | Mechanism family required |
|---|---|
| Eligibility | Concurrency control — something must atomically check and conditionally update |
| Ordering | Sequence authority — one actor assigns position within scope |
| Accounting | Aggregation pipeline — events processed exactly once into aggregate |
| Uniqueness / idempotency | Dedup gate — lookup prior result before executing |
| Propagation | Update propagation path — source-of-truth change reaches derived view within ε |
| Access-control | Authorization boundary — permission check before every operation |
Sub-step 6.2 — Apply ownership × evolution to select concurrency mechanism #
This table is the mechanical derivation for concurrency control. Look up the object’s ownership and evolution axes (from Step 3):
| Ownership | Evolution | Required concurrency mechanism | Why |
|---|---|---|---|
| Single writer | Any | None | One writer cannot race with itself |
| Single writer per partition key | Any | None within partition; partition routing is the mechanism | Route all writes for key K to the same owner |
| Multi-writer | Overwrite | CAS on version | Two writers compete to set the new value; one must lose |
| Multi-writer | State machine | CAS on (state, version) | Must reject invalid transitions and concurrent winners |
| Multi-writer | Merge (commutative + associative) | CRDT | Merge is safe without coordination |
| Multi-writer, one at a time, no crash concern | Exclusive hold | Pessimistic Lock | Blocks others for duration of operation |
| Multi-writer, one at a time, holder may crash | Exclusive hold | Lease (TTL + fencing token) | Auto-releases if holder dies; fencing token prevents stale commands |
| Cross-service, decomposable into local steps | Multi-step write | Saga | Each step is local; compensations undo on failure |
| Cross-service, not decomposable | Atomic multi-service write | 2PC | All-or-nothing across services; use only when Saga cannot apply |
Decision rules:
- If evolution = merge AND ownership = multi-writer → always CRDT. Do not use Lock or CAS.
- If ownership = single-writer → never add concurrency mechanism. It adds cost with no benefit.
- If holder may crash while holding exclusive access → Lease, not Pessimistic Lock. Lock held by dead process blocks forever.
- Use 2PC only as last resort. It blocks on coordinator failure.
Sub-step 6.3 — Apply Q1-Q5 to discriminate within the mechanism family #
After Sub-step 6.2 has selected the mechanism family, Q1-Q5 selects the specific implementation and adds any additional mechanisms required.
Q1 — Coordination scope #
Where does contention actually localize?
| Scope | Implication |
|---|---|
| Within a single data structure (same process, shared memory) | Hardware CAS (compare-and-swap instruction). No network round-trip. |
| Within a service (multiple stateless instances, shared external store) | Distributed CAS: SET key value NX in Redis, conditional write in DynamoDB, UPDATE ... WHERE version = $v in Postgres |
| Cross-service (multiple independent services) | Saga (if decomposable) or 2PC (if not). No in-process lock can span services. |
| Cross-region (geographically distributed, partition must not block) | CRDT or Leaderless Replication. Any mechanism requiring a quorum write may block during partition. |
Application to auction: scope = within auction service → distributed CAS on shared store. Application to driver dispatch: scope = within dispatch service → distributed CAS on shared Redis. Application to collaborative editor: scope = cross-region, partition must not block → CRDT.
Q2 — Failure model #
What breaks, and what must the mechanism survive?
| Failure | Required additional mechanism |
|---|---|
| Writer crashes after write, before response | Idempotency Key: client retries; server detects duplicate via stored outcome |
| Lock holder crashes while holding lock | Lease (TTL): store auto-releases after timeout. Add fencing token to reject stale commands. |
| Network partition between services | CRDT (no coordination) or Leaderless with quorum. 2PC blocks during partition. |
| Slow degradation of downstream service | Circuit Breaker + Timeout. The downstream call must have a maximum wait and a fail-fast threshold. |
| Retry storm on recovery | Retry + Exponential Backoff + Jitter. Without jitter, all retrying clients hit at the same moment. |
| Partial write (step A succeeds, step B fails) | Outbox + Relay or Saga compensation. Partial state must be detectable and recoverable. |
Application to auction bid: writer (client) may crash after sending → add Idempotency Key on request_id. Application to driver dispatch: dispatcher may crash while driver is claimed → use Lease (TTL on claim), not plain lock. Application to payment service call: payment gateway may degrade → add Circuit Breaker + Timeout.
Q3 — Data properties #
What does the data structure allow?
| Property | Implication |
|---|---|
| Operations are commutative + associative (order doesn’t matter, duplicates are idempotent) | CRDT — zero coordination cost. Use before any lock-based mechanism. |
| Data is content-addressed (identity = hash of content) | Hash is a natural idempotency key. Dedup is free at the storage layer without a separate idempotency table. |
| Data is monotonically increasing | CAS with monotonicity check is sufficient. No need for full version history. |
| Data requires total order within a scope | A single authority must assign sequence numbers within the scope. |
| Data is spatially structured (lat/lng or region) | Spatial Partition must precede any other read-path optimization. |
| Data is time-windowed | Windowing mechanism must handle late events and window boundaries explicitly. |
Application to score aggregation: ScoreEvent.delta is commutative (addition is associative) → stream aggregation with at-least-once delivery + idempotent delta application is sufficient. No lock needed. Application to block storage (Dropbox): Block identity = SHA-256(content) → content hash IS the idempotency key. Duplicate uploads are free to ignore. Application to geo-search (driver matching): Data is spatially structured → Spatial Partition (H3/geohash) must be the primary index. Do not range-scan on (lat, lng) columns.
Q4 — Access pattern #
How is the data read and written?
| Pattern | Mechanism |
|---|---|
| Read » Write, staleness tolerable | Cache-Aside with TTL |
| Read » Write, read shape ≠ write shape | CQRS + Materialized View |
| Read » Write, read shape = write shape, join-heavy | Denormalization |
| Write » Read | Append-only Log, Hash Partition, Leaderless Replication |
| Query fans across multiple shards | Scatter-Gather |
| Query is geospatial (find entities within radius) | Spatial Partition (H3, Geohash, S2, QuadTree) |
| Reads are time-bounded (recent data dominates) | Temporal Decay, TTL, Windowing |
Application to auction read view: reads » writes on auction detail; read shape ≈ write shape → Cache-Aside with short TTL. Invalidate on acceptance event. Application to driver matching: geospatial query (“drivers within 3km”) → Spatial Partition (H3 index in Redis Geo). Do not use Cache-Aside without spatial index — it still requires O(N) scan. Application to leaderboard: reads » writes; read shape (ranked list) ≠ write shape (score events) → CQRS + Materialized View (top-K projection as the read model).
Q5 — Coupling requirement #
How tightly must the producer and consumer of a change synchronize?
| Coupling | Mechanism |
|---|---|
| Synchronous, atomic (producer needs confirmation before proceeding) | In-transaction write. CAS, 2PC, or Pessimistic Lock within same transaction. |
| Synchronous, best-effort (producer sends, expects delivery but tolerates retry) | Retry + Backoff + Jitter |
| Asynchronous, event produced by application logic | Outbox + Relay: write event to outbox table in same DB transaction; relay publishes to broker |
| Asynchronous, event is the DB row change itself | CDC: tail the DB write-ahead log; no application-level outbox needed |
| Fan-out to N consumers at write time | Fan-out on Write: push to each consumer’s inbox immediately |
| Fan-out to N consumers at read time | Fan-out on Read: each consumer pulls and merges at query time |
| Notification must be reliable even if consumer is offline | Persist notification record + async retry; never fire-and-forget |
Application to auction acceptance → view update: view can be slightly stale (ε = 100ms) → async propagation via CDC or event publication → eventually consistent view. Application to bid submission → outbox: bid result must trigger downstream events (notify watchers) atomically with the bid write → Outbox + Relay (write both to same Postgres transaction). Application to match notification: user may be offline → persist Match record durably; push notification is best-effort retry; user sees match on app open regardless.
Sub-step 6.4 — Check for required mechanism combinations #
Some invariants require multiple mechanisms. Missing one creates a correctness gap.
| Invariant | Required mechanism | Additional mechanism always needed |
|---|---|---|
| Eligibility under concurrency | CAS or Lease | + Idempotency Key (client retry creates duplicate without it) |
| Eligibility + crash of holder | Lease | + Fencing token (stale holder must be rejected on every resource access) |
| Accounting via event stream | Stream aggregation | + Idempotency per event (duplicate events corrupt aggregate) |
| Propagation to derived view | CDC or Outbox | + Idempotency in consumer (message broker delivers at-least-once) |
| Ordering within scope | Sequence authority | + Same authority for history and live path (two authorities = split brain) |
| Cross-service write | Saga | + Idempotency on each local step (Saga steps may retry) |
The most commonly missed combination: CAS without Idempotency Key. CAS prevents concurrent duplicates but not retry duplicates. A client whose request timed out will retry. If the first request succeeded (CAS applied), the retry hits a changed version and either: (a) correctly fails (if client checks), or (b) silently succeeds again if the CAS predicate accidentally matches again. Always pair CAS with an Idempotency Key check first.
Worked Examples of Mechanical Derivation #
Example A: Auction bid acceptance #
From Step 3:
- Object: Auction
- Ownership: Multi-writer, one winner per auction
- Evolution: State machine (current_highest is conditionally updated)
- Ordering: Total order on acceptance sequence within auction_id
From Step 4:
- Invariant 1 [Eligibility]: bid accepted only if OPEN, not expired, amount > current_highest
- Invariant 2 [Ordering]: accepted amounts strictly increase
- Invariant 3 [Uniqueness]: same request_id = same outcome
Step 6 derivation:
Sub-step 6.1: Eligibility + Ordering → concurrency control + sequence authority
Sub-step 6.2: Ownership=multi-writer one-winner + Evolution=state machine → CAS on (state, version)
Sub-step 6.3:
- Q1: within auction service → distributed CAS on shared store
- Q2: client may crash after sending → add Idempotency Key on request_id
- Q3: not commutative (two bids can’t both be highest) → CAS confirmed, not CRDT
- Q4: write path is the hot path; read path is read-heavy → Cache-Aside for read view
- Q5: bid write → view update can be async (ε acceptable) → CDC or event publication
Sub-step 6.4: CAS + Idempotency Key (required combination for retryable eligibility writes)
Result:
- CAS:
UPDATE auction SET current_highest=$amount, winning_bid_id=$bid_id, version=version+1 WHERE auction_id=$id AND version=$old_version AND status='OPEN' AND end_time > now() AND current_highest < $amount - Idempotency: check
idempotency_store[request_id]before executing CAS; return stored result on hit - Read view: Cache-Aside with TTL; invalidate on CDC event from auction table
Example B: Driver dispatch (ride-hailing) #
From Step 3:
- Object: Driver
- Ownership: Multi-writer, one winner (many dispatchers compete per driver)
- Evolution: State machine (available → dispatched → on_trip → available)
- Ordering: Causal lifecycle order (forbidden: dispatched → offline without completing trip)
From Step 4:
- Invariant [Eligibility]: driver dispatched only if driver.status == AVAILABLE
- Invariant [Uniqueness]: same dispatch request → same outcome (one driver claimed)
- Invariant [Eligibility crash-safety]: if dispatcher crashes, driver is not held indefinitely
Step 6 derivation:
Sub-step 6.1: Eligibility + crash-safety → concurrency control with timeout
Sub-step 6.2: Ownership=multi-writer one-winner + holder may crash → Lease (TTL + fencing token)
Sub-step 6.3:
- Q1: within dispatch service → distributed CAS on shared in-memory store (Redis)
- Q2: dispatcher crashes while driver is claimed → TTL auto-releases claim; fencing token rejects stale dispatcher commands
- Q3: not commutative → Lease confirmed, not CRDT
- Q4: high write rate (many dispatches/sec), O(1) latency required → Redis, not relational DB with row locks
- Q5: dispatch result must be synchronous (dispatcher needs immediate response) → in-request atomic claim
Sub-step 6.4: Lease + Idempotency Key; Lease + Fencing Token (both required for crash-safety)
Result:
- Claim:
SET driver:{id}:status dispatched NX EX {ttl_seconds}(Redis: set if not exists, with TTL) - Fencing token: monotonically increasing claim_id stored with status; dispatcher must present claim_id on every subsequent command; stale claim_id → reject
- Idempotency:
dispatch_request_id→ check idempotency store before attempting claim
This is the mechanical derivation of “CAS on driver status in Redis with TTL” from first principles — without pattern memorization.
Example C: Leaderboard (YouTube Top K) #
From Step 3:
Object: ScoreEvent
Ownership: Pipeline (system writes after processing)
Evolution: Append-only
Ordering: Total order by timestamp within video_id
Object: VideoAggregate (derived → not primary, but a necessary DP)
Ownership: Pipeline (single writer per video_id partition)
Evolution: Overwrite (increment on each event)
Ordering: No meaningful order (only latest value matters)
From Step 4:
- Invariant [Accounting]: all-time score = sum of all ScoreEvent.delta for video_id
- Invariant [Accounting]: window score = sum of ScoreEvent.delta for video_id within window bucket
- Invariant [Uniqueness]: duplicate ScoreEvent is processed at most once
- Invariant [Propagation]: top-K projection reflects aggregates within 60s
Step 6 derivation:
Sub-step 6.1: Accounting + Uniqueness + Propagation → aggregation pipeline + dedup + propagation path
Sub-step 6.2: Ownership=pipeline (single writer per partition) → no concurrency mechanism for the aggregate. Dedup required for at-least-once delivery.
Sub-step 6.3:
- Q1: within aggregation service, partitioned by video_id → single-writer per partition; no cross-service coordination
- Q2: event delivered at-least-once → Idempotency per event (dedup by event_id before incrementing aggregate)
- Q3: delta addition is commutative → stream aggregation is correct; CRDT G-Counter semantics apply
- Q4: read shape (ranked list) ≠ write shape (event stream) → CQRS + Materialized View; Scatter-Gather across scopes for multi-scope queries; Temporal Decay for recency weighting
- Q5: propagation from aggregate to projection can be async (60s bound) → CDC or internal event → projection updater
Sub-step 6.4: Stream aggregation + Idempotency per event + Temporal Decay for recency
Result:
- Ingest: Kafka topic partitioned by video_id
- Aggregate: Flink job, keyed by video_id, dedup by event_id, increment all-time and windowed aggregates
- Projection: Redis sorted set per scope; updated when new score crosses cutoff
- Temporal Decay: score in sorted set = raw_score × e^(−λ × age_hours) recomputed on each update
Output Artifact #
For each DP, produce:
| DP | Mechanism | Components | Failure handling |
|---|---|---|---|
| Auction acceptance | CAS on (auction_id, version) + Idempotency Key | Postgres conditional UPDATE + idempotency table | Retry: check idempotency first; CAS fail: reread and retry bounded times |
| Driver claim | Lease (Redis SETNX + EX) + Fencing token | Redis key per driver_id + claim_id | TTL auto-releases stale claim; reject stale claim_id |
| Score aggregate | Stream aggregation (Flink) + event dedup | Kafka + Flink keyed state | Exactly-once via Flink checkpointing; event dedup by event_id |
| Top-K projection | CQRS + Materialized View + Temporal Decay | Redis sorted set per scope | Rebuild from aggregate store on projection corruption |
Step 7 — Axiomatic Validation #
Goal #
Validate the design for: (1) single source of truth per concept, (2) no hidden coupling between DPs, (3) no circular dependencies.
Single Source of Truth #
For every concept, ask: if two systems disagree about this value, which one is correct?
If the answer is ambiguous, you have dual truth. Dual truth is the source of “sometimes stale, sometimes wrong” systems.
Source-of-truth table:
| Concept | Authoritative source | Derived projections |
|---|---|---|
| Current highest bid | Auction acceptance state (Postgres) | AuctionView cache (Redis) |
| Driver status | Driver status store (Redis) | Dispatch service in-memory state |
| All-time video score | Score aggregate store (Flink keyed state) | Top-K projection (Redis sorted set) |
| Comment order | Authoritative comment store (sequence assigned at write) | History slice, live subscription |
| Match existence | Match table (Postgres) | Notification (derived event) |
Invalid patterns:
- Redis cache is treated as authoritative because it is fast. Fix: Redis is a projection; Postgres is truth.
- History and live stream assign their own sequence numbers independently. Fix: one sequence authority for both.
- Two services each compute “current surge multiplier” from raw data independently. Fix: one aggregator is authoritative; others read its output.
Independence Check #
Build a DP-to-invariant matrix. Look for:
- Hidden shared decisions: two DPs that both depend on the same underlying choice (e.g., two DPs that both need “total order within video_id” — there should be one sequence authority, not two)
- Circular dependency: DP A depends on DP B’s output which depends on DP A’s output
- Accidental coupling: a cache that can diverge permanently from its source with no rebuild path
Acceptable dependency: Leaderboard projection depends on score aggregate. This is a real, one-directional dependency. Unacceptable dependency: Leaderboard projection can permanently diverge from score aggregate with no reconciliation. Fix: add rebuild path (replay aggregate store → rebuild projection).
Output Artifact #
- Source-of-truth table (above)
- Dependency table: DP → depends on
- List of projections with: upstream source, maximum lag, rebuild path
Completion Test #
- Every concept has exactly one authoritative source.
- Every projection has a documented upstream source and a rebuild path.
- No correctness-critical concept has competing truths.
- No circular dependency exists.
Step 8 — Algorithm and State-Transition Design #
Goal #
Make every write path and lifecycle transition directly implementable. “Use CAS” is not an algorithm. “Read current version, validate predicates, attempt UPDATE WHERE version = $v, retry bounded times on failure” is an algorithm.
What to Specify #
For every write path:
- Input: what arrives
- State read: what must be read before deciding
- Validation: what conditions must hold (maps to eligibility invariants)
- Atomic update: what changes atomically and under what predicate
- Output: what is returned
- Failure behavior: what happens on retry, duplicate, race, timeout, or rejection
For every process object: define all states, all allowed transitions, all forbidden transitions, terminal states.
Algorithm Templates #
Template 1: Eligibility write with CAS #
function accept(request_id, object_id, payload):
// Step 1: Idempotency check
if prior_result = idempotency_store.get(request_id):
return prior_result
// Step 2: Read current state
state = store.read(object_id)
// Step 3: Validate eligibility
if not eligible(state, payload):
result = rejection_reason(state, payload)
idempotency_store.set(request_id, result, ttl=24h)
return result
// Step 4: Attempt CAS
new_state = apply(state, payload)
success = store.conditional_update(
object_id,
expected_version = state.version,
new_state = new_state
)
// Step 5: Handle CAS result
if success:
result = accepted(new_state)
idempotency_store.set(request_id, result, ttl=24h)
emit_event(object_id, new_state) // for projections
return result
else:
// Another writer won the race
if retries_remaining > 0:
return accept(request_id, object_id, payload) // retry with fresh read
else:
return conflict_or_rejection()
Template 2: State machine transition #
allowed_transitions = {
CREATED: [OPEN],
OPEN: [ENDED],
ENDED: [], // terminal
}
function transition(object_id, new_state, request_id):
if prior = idempotency_store.get(request_id): return prior
current = store.read(object_id)
if new_state not in allowed_transitions[current.state]:
return invalid_transition_error(current.state, new_state)
success = store.conditional_update(
object_id,
expected_version = current.version,
new_state = {state: new_state, version: current.version + 1}
)
if not success: retry or fail
Template 3: Append-only event with dedup #
function ingest_event(event_id, payload):
if dedup_store.exists(event_id): return already_processed
// Atomic: write event + mark dedup
store.append(event_id, payload)
dedup_store.set(event_id, processed, ttl=retention_period)
// Update downstream aggregate
aggregate = aggregate_store.increment(payload.key, payload.delta)
if aggregate crosses top_k_cutoff:
projection_store.update(payload.scope, payload.key, aggregate)
Template 4: Saga (multi-service write) #
function saga(saga_id, steps):
if prior = idempotency_store.get(saga_id): return prior
completed_steps = []
for step in steps:
result = execute_with_idempotency(saga_id + ":" + step.id, step)
if result.failed:
// Compensate all completed steps in reverse
for completed in reversed(completed_steps):
compensate_with_idempotency(saga_id + ":compensate:" + completed.id, completed)
return saga_failed(result)
completed_steps.append(step)
return saga_succeeded()
State Machine Specification #
For every process object, produce:
States: {CREATED, OPEN, ENDED}
Allowed transitions:
CREATED → OPEN (when: manually opened or scheduled open time reached)
OPEN → ENDED (when: end_time reached OR owner closes)
Forbidden transitions:
ENDED → any (ENDED is terminal)
OPEN → CREATED (no reversal)
Terminal states: {ENDED}
Clock authority: server_time from the auction service. Client clock is never trusted.
Duplicate Handling Rules #
Every write path must explicitly state its idempotency semantics:
| Write path | Idempotency key | Duplicate behavior |
|---|---|---|
| Bid submission | request_id (client-generated UUID) | Return same acceptance/rejection result |
| Driver dispatch | dispatch_request_id | Return same claimed/failed result |
| Score event ingestion | event_id (producer-assigned) | Skip if already processed |
| Match creation | canonical_key = (min(a,b), max(a,b)) | INSERT IF NOT EXISTS; return existing match |
| Payment capture | trip_id + attempt_number | Return stored capture result |
Output Artifact #
For every write path: pseudocode including idempotency, CAS, retry, and rejection logic. For every process object: complete state machine diagram.
Step 9 — Logical Data Model #
Goal #
Convert primary objects, DPs, and access patterns into a concrete logical schema. This step is still logical (technology-independent). Physical decisions come in Step 10.
Schema Derivation Rules #
Rule 1: Every primary object becomes at least one table/collection. Rule 2: Every derived view that is read frequently becomes an explicit read model table/collection. Rule 3: Partition key candidates are derived from the scope of the correctness invariant (not from convenience). Rule 4: Every idempotency requirement produces a dedup key in the schema. Rule 5: Source tables and projection tables are never the same table. Mark each clearly.
Schema Components #
For each table:
- Name and class (source-of-truth vs projection)
- Fields
- Primary key
- Important secondary / lookup keys
- Partition key candidate (derived from invariant scope)
- Uniqueness constraints (for dedup keys)
- Ordering keys (for range reads)
Example Schemas #
Auction (source-of-truth) #
auction_id UUID PK
seller_id UUID FK
starting_price Decimal
end_time Timestamp
status Enum(CREATED, OPEN, ENDED)
current_highest Decimal
winning_bid_id UUID nullable
version Integer optimistic lock
created_at Timestamp
Partition key: auction_id
Uniqueness: (auction_id) — one row per auction
Bid (source-of-truth, append-only) #
bid_id UUID PK
auction_id UUID FK, partition key candidate
bidder_id UUID
amount Decimal
created_at Timestamp
request_id UUID UNIQUE — dedup key
status Enum(PENDING, ACCEPTED, REJECTED)
Partition key: auction_id (scan all bids for an auction)
Secondary key: (bidder_id, created_at) — scan bids by bidder
Uniqueness: (request_id) — idempotency enforcement
IdempotencyRecord #
request_id UUID PK
result_json JSON
created_at Timestamp
expires_at Timestamp — TTL
Partition key: request_id
DriverStatus (source-of-truth, in Redis) #
Key: driver:{driver_id}:status
Value: {status, claim_id, trip_id, claimed_at}
TTL: set on claim; cleared on release
ScoreAggregate (source-of-truth) #
video_id UUID PK / partition key
all_time_score Long
updated_at Timestamp
version Long — for optimistic updates
TopKProjection (read model / derived view) #
scope String PK (composite: granularity:bucket_id)
rank Integer sort key
video_id UUID
score Double — with temporal decay applied
updated_at Timestamp
Note: this table is entirely rebuildable from ScoreAggregate + ScoreEvents
Read-Path Alignment #
Every frequent read must have a direct key or index path. For each important query, verify:
| Query | Required key |
|---|---|
| Get auction by id | auction_id PK |
| Get all bids for auction | auction_id partition key on Bid |
| Get top K videos for scope | scope PK on TopKProjection |
| Did user A swipe on user B? | (swiper_id, swiped_id) composite key on Swipe |
| Did user B swipe on user A? | (swiper_id=B, swiped_id=A) — same table, different lookup |
| Drivers within 3km of location | Spatial index on (lat, lng) — not a table key |
Output Artifact #
Schema definitions for all primary objects and derived view tables, with keys and constraints.
Step 10 — Technology Landscape #
Goal #
Map each DP’s required capabilities to a technology shape, then to specific products. Technology is chosen by capability fit, not by fashion or familiarity.
Technology Shape Categories #
| Shape | Best for | Examples |
|---|---|---|
| OLTP store | Transactional conditional writes, state machines, point reads/writes, foreign keys | PostgreSQL, MySQL, CockroachDB, DynamoDB (simple KV conditional) |
| Wide-column / time-series store | High-rate append, per-key range reads (time range), telemetry, events | Cassandra, Scylla, Bigtable, TimescaleDB |
| In-memory store | Sub-millisecond reads, ephemeral hot state, Lease implementation, spatial index, bounded sorted sets | Redis, Memcached |
| Append log / event stream | Ordered event ingestion, replay, async fan-out, buffering between producers and consumers | Kafka, Pulsar, Kinesis |
| Stream processor | Stateful event-by-event computation, windowing, watermarks, exactly-once aggregation | Flink, Spark Streaming, Kafka Streams |
| Object store | Cheap immutable large-object storage, archival, cold data | S3, GCS, Azure Blob |
| Search / analytical store | Full-text search, ad-hoc analytics, flexible secondary queries | Elasticsearch, ClickHouse, BigQuery |
Capability-to-Shape Mapping #
| Required capability | Technology shape |
|---|---|
| Conditional update (CAS, optimistic lock) | OLTP store |
| Lease (exclusive time-bound claim) | In-memory store (Redis SETNX + EX) |
| High-throughput append with per-key ordering | Wide-column store or append log |
| Spatial query (entities within radius) | In-memory store with geo index (Redis GEOADD/GEORADIUS) OR dedicated spatial DB |
| Sliding/tumbling window aggregation | Stream processor |
| Bounded sorted set (top-K) | In-memory store (Redis sorted set) |
| Async fan-out with replay | Append log |
| Idempotency store | OLTP store (write once, read on retry) |
| Full-text search | Search store |
| Cold archival | Object store |
Technology Selection Rules #
Match mutation pattern first. If the invariant requires CAS, the store must support atomic conditional writes. Redis sorted sets do not support CAS on set membership — use Postgres or DynamoDB for that.
Never let a projection store become truth. Redis is fast but not the source of truth for anything correctness-critical. If Redis is unavailable, rebuild from the OLTP store. This must be possible.
Partition key alignment. The technology must support efficient access on the partition key you chose in Step 9. If your partition key is
(scope, bucket_id)and your queries are by this key, Cassandra with that as the partition key works. A single Postgres table without sharding does not work at high volume.Check consistency model of the technology. DynamoDB eventual consistency by default — if your invariant requires strong consistency, use strongly consistent reads (at higher latency and cost). Cassandra
QUORUMreads are consistent;ONEreads are not.
Output Artifact #
| DP | Required capability | Technology shape | Specific product | Consistency mode |
|---|---|---|---|---|
| Auction acceptance | CAS + transactions | OLTP | PostgreSQL | Serializable or Read Committed + optimistic lock |
| Bid idempotency | Write-once, read-on-retry | OLTP | PostgreSQL (same DB) | Read Committed |
| Driver status / claim | Lease (SETNX + TTL) | In-memory | Redis | Single-instance or Redlock |
| Driver spatial index | Geo query within radius | In-memory + geo | Redis Geo | Eventual (TTL-based refresh) |
| Score event ingestion | High-throughput append | Append log | Kafka (partitioned by video_id) | At-least-once |
| Score aggregation | Windowed stateful compute | Stream processor | Flink | Exactly-once (checkpointing) |
| Top-K projection | Bounded sorted set | In-memory | Redis sorted set | Eventual (rebuilt from Flink output) |
| Swipe store | High-write append, per-key scan | Wide-column | Cassandra (partition: swiper_id) | Quorum writes for right-swipes |
| Match store | CAS on (user_pair canonical key) | OLTP | PostgreSQL | Serializable |
Step 11 — Deployment Topology #
Goal #
Define service boundaries, partition ownership, failure domains, and data flow. Architecture becomes deployable here.
Service Boundary Rules #
A service boundary is justified when two responsibilities have:
- Different scaling characteristics (write-heavy vs read-heavy)
- Different failure isolation requirements (payment must not fail because recommendation is slow)
- Different consistency requirements (real-time bid acceptance vs eventual leaderboard)
- Different data ownership (auction service owns auction state; notification service owns delivery state)
Do not create service boundaries to satisfy a microservices philosophy. Create them when one of the above is true.
Partition Key Rules #
Partition key must align with the scope of the correctness invariant.
If the invariant is “accepted bids strictly increase within one auction,” the partition key for the acceptance mechanism is auction_id. Partitioning by seller_id puts auctions from the same seller together — this is wrong because two separate auctions from the same seller have no shared invariant.
| Invariant scope | Partition key |
|---|---|
| Per auction | auction_id |
| Per video | video_id |
| Per activity | activity_id |
| Per driver | driver_id |
| Per user pair (match) | canonical(min(a,b), max(a,b)) |
| Per geographic cell | H3 cell ID or geohash prefix |
| Per time window + scope | (granularity, bucket_id, scope_id) |
Failure Domain Classification #
For each component:
| Component | Correctness criticality | Failure impact | Recovery path |
|---|---|---|---|
| Auction acceptance (Postgres) | CRITICAL | No bids accepted | Fail closed; primary failover |
| Bid idempotency store | CRITICAL | Duplicate bids possible | Fail closed; replica promotion |
| Driver status (Redis) | CRITICAL | Dispatches fail | Fail closed; Redis Sentinel / Cluster |
| Score event ingest (Kafka) | HIGH | Events lost if unrecoverable | Retry from producer; DLQ |
| Score aggregate (Flink) | HIGH | Leaderboard stale | Restore from Flink checkpoint |
| Top-K projection (Redis) | MEDIUM | Stale leaderboard | Rebuild from aggregate store |
| Notification service | LOW | Match notification delayed | Retry; user sees match on app open |
Correctness-critical components fail closed. Projection-only components may degrade gracefully (serve stale data).
Output Artifact #
- Service boundary diagram with data flows
- Partition topology table
- Failure domain table (above)
Step 12 — Consistency Model #
Goal #
State explicitly which paths require strong consistency and which may be eventually consistent. Systems almost never need the same consistency everywhere. Applying strong consistency uniformly destroys scalability; applying eventual consistency uniformly introduces correctness bugs.
Classification Procedure #
For each write path and read path, ask:
Strong consistency required if: violating the invariant changes business truth — money is charged twice, a driver is double-dispatched, a bid is accepted that should have been rejected.
Eventual consistency acceptable if: the only consequence of a stale read is temporary UX lag — leaderboard is 60s stale, cache is 5s stale, notification arrives 2s after the event.
Output Artifact #
| Path | Consistency | Staleness bound | Why |
|---|---|---|---|
| Bid acceptance write | Strong | 0 | Eligibility invariant must hold at decision time |
| Auction view read | Eventual | 100ms | Stale view is visible briefly; correctness is in the write path |
| Driver status write (claim) | Strong | 0 | Double-dispatch is a correctness violation |
| Driver location read (for matching) | Eventual | 30s | Location ping is approximate; stale ping = slightly worse match |
| Score event ingest | At-least-once | N/A | Duplicates handled by dedup; missing events are not acceptable |
| Top-K leaderboard read | Eventual | 60s | Product accepts 60s staleness |
| Match creation write | Strong | 0 | Must not create duplicate match |
| Match notification delivery | Best-effort | N/A | Push is unreliable; durable match record is the truth |
Step 13 — Scaling Model #
Goal #
Identify how load grows and where bottlenecks appear. You are not yet tuning numbers; you are naming the dimensions.
Scale Type Taxonomy #
| Type | Definition | Bottleneck location | Strategy |
|---|---|---|---|
| Write-heavy / append | Many writes, few reads, low contention per key | Ingest throughput | Partition by key, parallel ingest, wide-column store |
| Contention-heavy | Many competing writes on the same key | Single-key throughput | Single-writer partition, or CAS with retry budget |
| Read-heavy | Many reads, low write rate | Read throughput + latency | Cache, CQRS, CDN |
| Fan-out heavy | One write causes many downstream reads or notifications | Fan-out path (notify N subscribers) | Fan-out on Write (small N), Fan-out on Read (large N), hybrid |
| Window / aggregation heavy | Events must be bucketed and aggregated in time windows | Aggregation throughput + state size | Stream processor, pre-aggregation, approximate counting |
Hotspot Identification #
A hotspot is a key or partition that receives disproportionately more traffic than others.
| Hotspot type | Example | Mitigation |
|---|---|---|
| Single-key write hotspot | One live auction with 10K bids/sec | Single-writer partition for that auction_id; queue bids in front |
| Single-key read hotspot | One video goes viral | CDN caching; in-memory cache with short TTL; cache stampede protection (single-flight) |
| Fan-out hotspot | Celebrity with 100M followers posts | Fan-out on Read for celebrity accounts; hybrid fan-out |
| Partition hotspot | All keys hash to same shard | Add random suffix to key for writes; aggregate at read time |
| Time-boundary hotspot | All window aggregates flush at the same second | Stagger flush times; jitter |
Output Artifact #
| Dimension | Driver | Hotspot key | Likely bottleneck | Strategy |
|---|---|---|---|---|
| Auction bids | bids per auction | Popular auction_id | CAS retry rate on hot auction | Single-writer queue |
| Score events | events per video | Viral video_id | Kafka partition throughput | Partition by video_id; increase partition count |
| Driver matches | dispatches per cell | Dense geo cell | Spatial index read hotspot | Cell-level cache; increase Redis cluster size |
| Match notifications | matches per second | Global (peak time) | Push notification throughput | Async queue with backpressure |
Step 14 — Failure Model #
Goal #
Enumerate what can go wrong and what correct behavior is in each case. A design is incomplete until every failure mode has an explicit response.
Failure Taxonomy #
Client retry / timeout #
Client does not know whether the write succeeded. May retry.
- Required: Idempotency key on every non-idempotent write.
- Every write path must be safe to retry N times.
Duplicate delivery #
Same event or message arrives twice from broker (at-least-once delivery).
- Required: Dedup by event_id before applying to aggregate or state.
- Never assume exactly-once from a message broker without explicit configuration.
Partial write #
Write to primary state succeeds; downstream event publication or projection update fails.
- Required: Outbox pattern (event written in same transaction as state change) OR CDC (event derived from WAL).
- The downstream failure must not leave the system in a state where the event is permanently lost.
Out-of-order arrival #
Events arrive in a different order than they were produced.
- Required: Explicit policy for each ordered object. Options:
- Reject and requeue (wait for in-order delivery)
- Accept and reconcile (watermark-based window)
- Idempotent application (if operation is order-independent)
Temporal boundary #
Event arrives near a window close or a deadline (e.g., bid arrives at exactly end_time).
- Required: Explicit clock authority. Client clock is never authoritative. Server time is used for all temporal decisions. Define behavior for events arriving within [end_time - ε, end_time + ε].
Infrastructure failure #
Node crashes, DB primary fails, cache is cold, queue consumer lags.
- Required: For each component, explicit answers to:
- What is the impact on correctness? (writes fail closed? reads serve stale?)
- What is the recovery path? (failover, rebuild from source, restart consumer)
Output Artifact #
| Failure mode | Can it happen? | Correct behavior | Mechanism enforcing it | Recovery if mechanism fails |
|---|---|---|---|---|
| Client retries bid | Yes | Same result as first attempt | Idempotency Key | Manual dedup audit |
| Duplicate score event | Yes (Kafka at-least-once) | Ignored; aggregate unchanged | Dedup by event_id | Recompute aggregate from event log |
| Bid write succeeds; event not published | Yes (crash after write) | Event eventually published | Outbox + Relay | Outbox relay catches up on restart |
| Auction end_time race | Yes (bid arrives at exact end_time) | Server time is authoritative; server rejects if server_time ≥ end_time | Server-side time check in CAS predicate | Audit log of rejection vs acceptance at boundary |
| Driver dispatch crashed holding claim | Yes | Claim auto-expires via TTL | Lease TTL | Redis TTL fires; driver returns to available |
| Score aggregate consumer lags | Yes | Leaderboard is stale by lag amount | SLO on freshness; alert on lag > threshold | Restart consumer; replay from last checkpoint |
Step 15 — SLO Definition #
Goal #
Convert broad product expectations into measurable promises. SLOs are the contract between architecture and operations.
SLO Types #
| Type | Definition | Example |
|---|---|---|
| Latency | Percentile latency of a request or event path | Bid acceptance p95 < 200ms |
| Availability | Fraction of time the service is usable | Bid service 99.9% available |
| Correctness | Rate at which invariants are violated | Zero accepted bids after auction close |
| Freshness | Maximum staleness of a projection | Leaderboard lag < 60s at p99 |
| Throughput | Sustained operation rate | 10K score events/sec ingested |
SLO Derivation #
SLOs are derived from:
- Product requirements (user-visible expectations)
- Invariant strength (correctness invariants have SLO = zero violations)
- Downstream SLO budgets (each call in a path consumes part of the latency budget)
Latency budget principle: if the end-to-end SLO is 500ms and the path has 3 hops, each hop must have a timeout budget. Sum of timeouts must be < 500ms. Every timeout in the system is derived from this budget, not set arbitrarily.
Output Artifact #
| Path | Metric | Target | Derived from |
|---|---|---|---|
| Bid acceptance | p95 latency | 200ms | Product UX requirement |
| Bid acceptance | Correctness violations | 0 per day | Invariant is absolute |
| Live comment delivery | p99 delivery lag | 500ms | Product UX requirement |
| Leaderboard freshness | p99 lag | 60s | Product requirement |
| Driver match latency | p95 end-to-end | 2s | Rider UX expectation |
| Payment capture | Duplicate charge rate | 0 | Legal + invariant |
| Score event ingest | Throughput | 50K events/sec | Capacity model |
Step 16 — Operational Parameters #
Goal #
Convert every architectural lever into a named, tunable parameter with its valid range and effect.
Every decision you made in Steps 6–11 that involved a number, a threshold, or a timeout must appear here.
Parameter Categories #
Concurrency parameters #
- CAS retry limit: max attempts before returning conflict error
- Lease TTL: how long a claim is held before auto-release
- Lock timeout: how long a pessimistic lock is held
Cache parameters #
- Cache TTL: staleness bound; sets the freshness SLO
- Cache size: eviction begins when exceeded
- Cache stampede protection: request coalescing window
Queue parameters #
- Retry budget: max retries before DLQ
- Backoff base + max: controls retry storm risk
- Jitter factor: fraction of backoff that is randomized
Window parameters #
- Window size: tumbling or sliding interval
- Allowed lateness: how long after window close late events are accepted
- Watermark lag: how far behind real time the watermark is held
Idempotency parameters #
- Idempotency key TTL: how long outcomes are retained
- Key scope: per-user, per-session, or global
Output Artifact #
| Parameter | Component | Default | Min | Max | Effect of increasing | Effect of decreasing |
|---|---|---|---|---|---|---|
| CAS retry limit | Auction acceptance | 3 | 1 | 10 | More retries = higher acceptance rate under contention, higher tail latency | Fewer retries = faster failure, lower acceptance rate |
| Lease TTL | Driver claim | 30s | 5s | 120s | Longer = more time before stale claim auto-releases | Shorter = more dispatcher timeouts on slow operations |
| Cache TTL | AuctionView | 100ms | 0 | 5s | More staleness | Less staleness, more DB load |
| Retry budget | Score event ingest | 5 | 1 | 20 | More attempts before DLQ | More events lost on transient failure |
| Backoff base | All retry paths | 100ms | 10ms | 1s | Slower retry, less storm | Faster retry, higher storm risk |
| Window size | Hourly leaderboard | 1h | 5m | 24h | Coarser time granularity | Finer granularity, more state |
| Watermark lag | Score events | 10s | 0 | 60s | More late events captured | More latency in window close |
Step 17 — Runbooks #
Goal #
For every known failure mode (from Step 14) and every hotspot (from Step 13), define the exact procedure an operator takes.
Runbook Template #
Incident: [Name]
Trigger: [What signal indicates this is happening — metric, alert, symptom]
Diagnosis: [How to confirm]
Immediate mitigation: [What to do right now to reduce impact]
Root cause investigation: [How to find why it happened]
Resolution: [How to fully recover]
Prevention: [What to change to prevent recurrence]
Example Runbooks #
Runbook: Auction CAS contention storm #
Trigger: CAS retry rate > 80% on bid acceptance service; p99 bid latency > 2s Diagnosis: Check bid volume per auction_id; identify hot auction Immediate mitigation: Enable bid queue in front of hot auction_id; switch to single-writer for that auction Root cause: Unexpected viral auction; more bidders than designed for Resolution: Queue drains; single-writer processes bids sequentially; switch back to CAS when load drops Prevention: Detect auction heat (bid rate > threshold) earlier; pre-emptively queue before contention saturates
Runbook: Leaderboard freshness SLO breach #
Trigger: Leaderboard lag > 60s; alert fires Diagnosis: Check Flink consumer lag on Kafka score event topic; check Flink job health; check Redis write latency Immediate mitigation: None needed if correctness is not affected (leaderboard is a projection) Root cause investigation: Flink checkpoint failure? Consumer restart? Kafka partition rebalance? Resolution: Restart Flink job from last checkpoint; it will catch up Prevention: Alert on consumer lag > 30s to catch before SLO breach; increase checkpoint frequency
Runbook: Driver not releasing after dispatch crash #
Trigger: Rider waiting > TTL seconds; driver marked dispatched but unresponsive Diagnosis: Check driver TTL expiry in Redis; check dispatch service health Immediate mitigation: Redis TTL fires automatically; driver returns to available pool Root cause investigation: Was TTL set correctly at claim time? Did dispatcher crash before setting TTL? Resolution: Verify TTL is always set atomically with claim (SETNX + EX in one command) Prevention: Audit: never allow claim without TTL in the same Redis command
Step 18 — Observability #
Goal #
Define what must be instrumented for every SLO to be measurable, every failure mode to be detectable, and every parameter to be tunable at runtime.
Observability Types #
| Type | Purpose | Examples |
|---|---|---|
| Metrics | Quantitative, time-series; powers SLO dashboards | CAS success rate, retry rate, latency percentiles, consumer lag |
| Traces | Request-scoped; shows where latency is spent | Distributed trace from bid submission through CAS through event publication |
| Logs | Discrete events; powers incident investigation | CAS rejections with reason, idempotency hits, transition failures |
| Alerts | Fires when metric crosses SLO threshold | Bid latency p95 > 200ms, consumer lag > 30s, CAS retry rate > 50% |
Required Instrumentation #
| Component | Required metric | Alert condition |
|---|---|---|
| Auction acceptance | CAS success rate, CAS retry count, p95/p99 latency | Retry rate > 50%, latency p95 > 200ms |
| Idempotency store | Hit rate, miss rate, TTL expiry rate | Hit rate < expected (suggests missing keys) |
| Driver claim | Claim success rate, TTL expiry rate, fencing token rejections | Fencing rejections > 0 (indicates stale dispatcher) |
| Score event ingest | Events/sec, lag, DLQ depth, dedup hit rate | Lag > 30s, DLQ depth > 0 |
| Leaderboard projection | Freshness lag, Redis write latency | Freshness > 60s |
| Match detection | Match creation rate, duplicate match attempts | Duplicate attempts > 0 |
Output Artifact #
Instrumentation plan: for each component, list of metrics, log events, trace spans, and alert thresholds.
Step 19 — Cost Model #
Goal #
Quantify what grows with users, events, and hotspots so that cost surprises can be predicted and capacity planned.
Growth Dimensions #
For each component, identify:
- What user behavior drives its growth?
- What is the growth rate (linear, quadratic, logarithmic)?
- What is the dominant cost (storage, compute, network, IOPS)?
Example #
| Component | Growth driver | Growth rate | Dominant cost |
|---|---|---|---|
| Bid store | Bids per auction × active auctions | Linear in auctions | Storage + IOPS |
| Score event store (Kafka) | Score events per video × active videos | Linear in content | Kafka storage (retention × throughput) |
| Score aggregate (Flink state) | Active videos | Linear | Memory (Flink state size) |
| Top-K projection (Redis) | Scopes × K | Constant per scope (bounded by K) | Redis memory |
| Driver status (Redis) | Active drivers | Linear | Redis keys |
| Swipe store (Cassandra) | Swipes per user × active users | Quadratic in users (O(U²) worst case if everyone swipes everyone) | Cassandra storage |
| Idempotency store | Write rate × TTL | Constant at steady state | OLTP storage |
Step 20 — Evolution Path #
Goal #
Define the smallest correct initial design and the signals that trigger each architectural upgrade.
Staged Design Principle #
Start with the minimum correct design. Add complexity only when a measured signal crosses a defined threshold.
Evolution Stages #
Stage 1: Single service, single OLTP DB #
- One Postgres instance handles auctions, bids, idempotency
- Leaderboard computed by query at read time
- No caching, no streaming
- Correct up to: ~100 writes/sec, ~1K reads/sec, <10K objects
Upgrade signal: Postgres CPU > 60%, p95 read latency > 500ms
Stage 2: Add caching and async projections #
- Redis cache for hot read paths (AuctionView, TopK)
- Background job to compute leaderboard
- Idempotency TTL cleanup job
Upgrade signal: Write contention on hot objects; CAS retry rate > 20%
Stage 3: Partition write-heavy objects #
- Shard Cassandra for append-heavy event stores (score events, swipe store)
- Kafka for score event fan-out to aggregation
- Flink for windowed aggregation
Upgrade signal: Kafka consumer lag > 30s; Flink state > available memory
Stage 4: Specialized stores and geographic distribution #
- Redis Geo for spatial driver matching
- H3 cell index for surge zone computation
- Multi-region Cassandra for swipe store
Trigger: Geographic expansion; latency SLO from distant regions fails
Data Model Preservation #
At each stage, identify:
- What schema changes are additive (safe)?
- What schema changes require migration (risky)?
- What data model decisions lock you into a pattern (e.g., partition key choice in Cassandra cannot be changed without full table rewrite)?
Flag these decisions early. They are the ones that are hardest to reverse.
Pattern Reference #
Quick lookup: given a mechanism type from Step 6, find the pattern.
Concurrency Control #
| Mechanism | Pattern | Key lever | Failure mode |
|---|---|---|---|
| CAS on version | Optimistic Lock | Retry budget; version column | ABA problem — add timestamp or monotonic ID to version |
| CAS + idempotency | Optimistic Lock + Idempotency Key | Idempotency TTL | Key collision if client reuses keys |
| SETNX + EX | Lease | TTL duration; fencing token | Clock skew between holder and store — use fencing token on all resource access |
| Row lock + transaction | Pessimistic Lock | Lock timeout | Deadlock; lock not released on crash |
| Commutative merge | CRDT | CRDT type (G-Counter, OR-Set, YATA) | Tombstone accumulation for deleted elements |
| Saga | Saga | Compensation idempotency; saga state durability | Compensation failure (double fault) |
| 2PC | 2PC | Coordinator crash recovery | Blocked participants if coordinator crashes after prepare |
Read Acceleration #
| Pattern | When to use | Discriminant |
|---|---|---|
| Cache-Aside + TTL | Read » write; staleness OK | Simplest; use first |
| CQRS + Materialized View | Read shape ≠ write shape | Necessary when shape differs |
| Denormalization | Join-heavy hot reads; same shape | When joins dominate latency |
| Scatter-Gather | Query fans across shards | When data is already partitioned |
| Spatial Partition | Geospatial query | Always use before other read patterns for geo queries |
Write Scaling #
| Pattern | When to use | Tradeoff |
|---|---|---|
| Hash Partition | Key-based access, uniform distribution | Loses range queries |
| Range Partition | Range queries dominant | Hot “latest” partition for time-series |
| Leaderless Replication | Availability > consistency, multi-region writes | Eventual consistency; conflict handling needed |
| Single-writer per partition | Very hot single key; CAS retry rate too high | Adds queue latency; operational complexity |
Event Flow #
| Pattern | When to use | Discriminant |
|---|---|---|
| Message Queue | Work distribution; rate decoupling | Use when one consumer per message |
| Pub/Sub | Fan-out; multiple independent consumers | Use when N consumers per message |
| Outbox + Relay | Event produced by application logic; must be atomic with state change | Application controls event content |
| CDC | Event IS the DB row change | No application-level event needed |
| Fan-out on Write | Small follower counts; real-time | Fails at celebrity scale |
| Fan-out on Read | Large follower counts; infrequent reads | Higher read latency |
Fault Tolerance #
| Pattern | When to use | Lever |
|---|---|---|
| Idempotency Key | Every non-idempotent write on unreliable network | TTL; key scope |
| Retry + Backoff + Jitter | Any network call | Base delay; max delay; jitter fraction |
| Circuit Breaker | Downstream can degrade (not just fail) | Error rate threshold; half-open probe interval |
| Bulkhead | High-priority traffic must be isolated from low-priority | Thread pool or connection pool size per tier |
| Timeout | Every outbound call | Must be derived from caller’s latency budget |
Consensus and Sync #
| Pattern | When to use | Tradeoff |
|---|---|---|
| Leader-Follower | Single-region; strong consistency writes | Leader is bottleneck; failover requires election |
| Quorum (W + R > N) | Leaderless; tunable consistency | Network partition can still split quorum |
| Gossip | Membership; failure detection; eventual propagation | Convergence time = O(log N) |
| State Vector Sync | Distributed sync; two nodes converge incrementally | Vector size grows with node count |
| Merkle Tree | Anti-entropy between replicas; efficient diff | Rehash cost on hot write path |
Time and Approximation #
| Pattern | When to use | Lever |
|---|---|---|
| Windowing | Aggregate over finite time interval | Window size; allowed lateness; watermark lag |
| TTL | Auto-expire entries | Duration; lazy vs eager expiry; add ±10% jitter |
| Approximate Counting | Memory-constrained cardinality or frequency | Count-Min (frequency), HyperLogLog (cardinality), Bloom Filter (membership) |
| Temporal Decay | Recency-weighted ranking | Decay rate λ; floor value |
| Spatial Partition | Geographic proximity queries | Cell resolution; neighbor depth |
| Staged Rollout | Deploy changes safely | Rollout percentage; metric thresholds for pause |
| Scheduled Trigger | Periodic maintenance or aggregation | Schedule frequency; distributed lock to prevent overlap |
Cross-Step Anti-Patterns #
These mistakes appear at specific steps and propagate forward as structural errors.
| Anti-pattern | Where it appears | Consequence | Fix |
|---|---|---|---|
| Treating a projection as primary state | Step 2 | Dual truth; inconsistent reads | Derivability test: can it be recomputed? If yes, it’s a projection. |
| Mixing evolution models in one object | Step 2-3 | CAS predicate is ambiguous; tests become impossible | Split into process object + append event stream |
| Ordering without scope | Step 3 | “Total order” applied globally when only per-partition order is needed → unnecessary coordination | Always bind ordering to a scope |
| Vague invariant (“system should be correct”) | Step 4 | Cannot derive mechanism; design is hand-wavy | Rewrite: “for every bid, amount > current_highest at decision time” |
| Technology as DP (“use Kafka”) | Step 5 | Technology lock-in before invariant is understood | Name the invariant-enforcing mechanism; choose technology in Step 10 |
| CAS without idempotency key | Step 6 | Client retry creates duplicate outcome | Always pair CAS with idempotency check first |
| Lock where Lease is needed | Step 6 | Holder crashes; resource blocked until manual intervention | If holder can crash, use Lease (TTL) |
| 2PC where Saga applies | Step 6 | Coordinator crash blocks all participants | Check: can each step be local with compensation? If yes, use Saga |
| CRDT for non-commutative domain | Step 6 | Merge produces incorrect state | Check commutativity first. “Bid accepted if highest” is not commutative → not CRDT. |
| Dual truth source | Step 7 | “Sometimes stale, sometimes wrong” bugs | One authoritative source; all others are projections with rebuild path |
| Partition key by convenience not invariant | Step 9-10 | Hotspot misaligned with contention; cross-partition transactions required | Partition key = scope of correctness invariant |
| Projection technology as truth | Step 10 | Redis unavailable = correctness unavailable | Redis is always a cache or projection. Rebuild path from OLTP source must exist. |
| Eventual consistency on correctness path | Step 12 | Silent correctness violations | Classify each path explicitly; strong consistency on eligibility-enforcing writes |
| Skipping failure model | Step 14 | Duplicates, partial writes, and race conditions not handled → production bugs | Every write path must answer: what happens on retry? On crash? On duplicate? |
Compact Procedure Checklist #
Use this to verify each step produced its required artifact.
Step 1 — Normalization
□ Every requirement rewritten as actor-operation-state
□ Hidden writes and projections surfaced
Step 2 — Object extraction
□ Primary objects identified with class
□ Derived views rejected with reason
□ Every object passes ownership/evolution/ordering/derivability tests
Step 3 — Axis assignment
□ Every primary object has ownership, evolution, ordering
□ Every ordering bound to a scope
Step 4 — Invariant extraction
□ Every requirement maps to typed, testable invariants
□ Concurrency considered for every eligibility invariant
Step 5 — DP derivation
□ Every invariant cluster maps to a DP
□ No DP is a technology name
Step 6 — Mechanism selection
□ 6.1: Every DP classified by invariant type
□ 6.2: Ownership × evolution table applied to select mechanism family
□ 6.3: Q1-Q5 applied to select specific implementation
□ 6.4: Required mechanism combinations identified (CAS + idempotency, etc.)
Step 7 — Axiomatic validation
□ Source-of-truth table complete
□ Dependency table complete; no circular dependencies
□ Every projection has upstream source and rebuild path
Step 8 — Algorithm design
□ Every write path has pseudocode
□ Every process object has complete state machine
□ Idempotency, retry, and rejection explicitly handled for every path
Step 9 — Logical data model
□ Every primary object has schema
□ Every derived view is a separate table marked as projection
□ Partition keys derived from invariant scope
□ Dedup keys explicit
Step 10 — Technology landscape
□ Every DP maps to capability → shape → specific product
□ Every correctness-critical invariant has named enforcing technology
Step 11 — Deployment topology
□ Service boundaries justified
□ Partition keys aligned with invariant scope
□ Failure domains classified by correctness criticality
Step 12 — Consistency model
□ Every write and read path classified (strong/eventual/bounded)
□ Correctness-critical paths are strong
□ Staleness bounds explicit for eventual paths
Step 13 — Scaling model
□ Scale type identified (write/contention/read/fanout/aggregation)
□ Hotspot keys identified
□ Strategy per hotspot
Step 14 — Failure model
□ Every write path: retry semantics explicit
□ Every aggregate: duplicate event handling explicit
□ Every component: correctness impact on failure classified
Step 15 — SLOs
□ Latency SLOs per path
□ Correctness SLOs per invariant
□ Freshness SLOs per projection
□ Latency budget allocated across hops
Step 16 — Operational parameters
□ Every tunable threshold named with valid range and effect
Step 17 — Runbooks
□ Every failure mode from Step 14 has a runbook
□ Every hotspot from Step 13 has a mitigation runbook
Step 18 — Observability
□ Every SLO is measurable
□ Every failure mode is detectable
□ Alerts defined for every SLO threshold
Step 19 — Cost model
□ Growth driver named per component
□ Dominant cost identified
Step 20 — Evolution
□ Stage 1 (minimum correct) defined
□ Upgrade signal per stage defined
□ Irreversible data model decisions flagged
There's no articles to list here yet.