Table of Contents
Local Delivery Service: System Design #
A complete derivation of a local grocery/item delivery system (Instacart / DoorDash Grocery pattern) using the 20-step derivation framework — from functional requirements to operable architecture, with every mechanism derived mechanically from invariants.
Ordering Principle #
Every layer constrains the next. Skipping a layer makes later choices arbitrary.
Functional requirements
→ normalize into operations over state (Step 1)
→ extract primary objects (Step 2)
→ assign ownership, ordering, evolution (Step 3)
→ extract invariants (Step 4)
→ derive minimal DPs from invariants (Step 5)
→ select concrete mechanisms (Step 6) ← the mechanical bridge
→ validate independence and source-of-truth (Step 7)
→ specify exact algorithms (Step 8)
→ define logical data model (Step 9)
→ map to technology landscape (Step 10)
→ define deployment topology (Step 11)
→ classify consistency per path (Step 12)
→ identify scaling dimensions and hotspots (Step 13)
→ enumerate failure modes (Step 14)
→ define SLOs (Step 15)
→ define operational parameters (Step 16)
→ write runbooks (Step 17)
→ define observability (Step 18)
→ estimate costs (Step 19)
→ plan evolution (Step 20)
Step 1 — Problem Normalization #
Functional Requirements (Raw) #
- Customer browses items available at nearby stores within their delivery zone.
- Customer places an order for multiple items, possibly from one store.
- System assigns a shopper/courier to fulfill the order.
- Customer tracks shopper location in real-time during fulfillment.
- Shopper reports item unavailability and can propose substitutes; customer approves or rejects.
- Order status updates in real-time: accepted → shopping → picked → in-transit → delivered.
- Customer is charged only after delivery is confirmed (authorization at placement, capture at delivery).
- Items are reserved at order placement — no oversell is permitted.
Normalized Table #
| # | Original Requirement | Actor | Operation | State Touched |
|---|---|---|---|---|
| 1 | Customer browses nearby items | Customer | read projection | AvailableItemView (derived from Item + InventoryReservation + DeliveryZone) |
| 1b | (hidden write) | System | evaluate zone eligibility | DeliveryZone membership check |
| 2a | Customer places order | Customer | create process object | Order + OrderItem records |
| 2b | (hidden write) | System | attempt conditional update per item | InventoryReservation per item (CAS on stock) |
| 3 | System assigns shopper | System | attempt conditional claim | Shopper.status state machine (multi-writer one-winner) |
| 4 | Customer tracks shopper | Customer | read derived view (live) | LocationPingView (derived from LocationPing stream) |
| 4b | (hidden write) | Shopper device | append event | LocationPing (immutable) |
| 5a | Shopper reports unavailability | Shopper | append event | SubstitutionRequest created |
| 5b | Customer approves/rejects substitute | Customer | transition state machine | SubstitutionRequest.status |
| 6 | Order status updates in real-time | System | transition state machine + push | Order.status |
| 7a | Payment authorization at order placement | System | create + authorize | Payment record + PSP auth call |
| 7b | Payment capture at delivery | System | transition + capture | Payment.status (authorized → captured) |
| 8 | Items reserved at order time | System | conditional write | InventoryReservation (quantity locked) |
Key Seams Exposed by Normalization #
- FR1 vs FR8: There is a race between “show available stock” (read projection) and “reserve stock” (conditional write). These are separate operations with a window between them. CAS on stock quantity is required.
- FR2 vs FR3: Order creation and shopper assignment are not atomic — shopper assignment is a downstream step triggered after order is persisted.
- FR5: Substitution is not an update to an OrderItem. It creates a new SubstitutionRequest object with its own approval lifecycle.
- FR7: Payment is a two-step process (auth then capture) separated by the entire fulfillment duration — potentially hours. Payment authorization expiry is a real failure mode.
Step 2 — Object Extraction #
Candidates and Classification #
Primary Objects #
| Object | Class | Rationale |
|---|---|---|
| Store | Stable entity | Long-lived; mutable (hours, address, active status); owned by operator |
| Item | Stable entity | Long-lived per store; mutable (price, stock count, description); owned by store operator |
| InventoryReservation | Process object | Lifecycle: pending → confirmed → released | fulfilled; created at order time, released on cancel/delivery |
| Order | Process object | Central lifecycle object; state machine drives the entire fulfillment pipeline |
| OrderItem | Relationship object | Links Order to Item with quantity; immutable once placed (substitution creates new record, not overwrite) |
| Shopper | Stable entity | Long-lived; mutable status (available / on_order / offline); location updated continuously |
| LocationPing | Event | Immutable point-in-time GPS record from shopper device |
| SubstitutionRequest | Intent / future constraint | Persists until customer approves or rejects; has expiry deadline |
| Payment | Process object | State machine: pending → authorized → captured | refunded | failed |
| DeliveryZone | Stable entity | Geo-boundary polygon; stable; written by operators only |
Derived / Rejected Objects #
| Candidate | Problem | Fix |
|---|---|---|
| AvailableStock (counter) | Derivable from initial_stock - sum(reservations) + sum(cancellations) | Either a derived view OR the single mutable counter that IS the source of truth (with CAS) — not both |
| ShopperLocationView | Derivable from latest LocationPing per shopper | Derived view; materialized in Redis Geo for spatial query |
| OrderStatusHistory | Derivable from Order state machine audit log | Derived view; project from Order event log |
| EstimatedDeliveryTime | Derivable from shopper location + store distance + traffic | Derived view; computed on read |
| CustomerCart | UI artifact (client-side state until order placed) | Not a domain object; lives in client session only |
Four Purity Tests Applied #
InventoryReservation #
- Ownership: System creates on order placement; system transitions on fulfillment or cancellation. Single writer pipeline. PASS.
- Evolution: State machine (pending → confirmed → released | fulfilled). Does not append, does not merge. PASS.
- Ordering: Reservations ordered by creation time within one item (to determine which get confirmed if stock runs out). PASS.
- Non-derivability: Cannot reconstruct which orders hold locks on which quantities without this object. PASS.
LocationPing #
- Ownership: Shopper device is the sole writer. PASS.
- Evolution: Append-only, immutable once written. PASS.
- Ordering: Ordered by timestamp within one shopper_id. PASS.
- Non-derivability: Raw GPS history cannot be derived from elsewhere. PASS.
SubstitutionRequest #
- Ownership: Shopper creates; customer transitions (approve/reject); system can expire. Multi-actor but pipeline-ordered (shopper writes first, customer responds). PASS.
- Evolution: State machine (pending_approval → approved | rejected | expired). PASS.
- Ordering: At most one active SubstitutionRequest per OrderItem at a time (uniqueness invariant). PASS.
- Non-derivability: Customer intent (approve/reject) cannot be derived from other state. PASS.
Step 3 — Axis Assignment #
Per-Object Axis Table #
| Object | Ownership | Evolution | Ordering scope |
|---|---|---|---|
| Store | Single writer per store (operator) | Overwrite | N/A (singleton per store_id) |
| Item | Single writer per item (store operator) | Overwrite | N/A (singleton per item_id) |
| InventoryReservation | System-only (order service) | State machine | By creation time within item_id |
| Order | Multi-stage pipeline (customer creates; system advances) | State machine | By status-transition time within order_id |
| OrderItem | Single writer (order service at placement) | Append-only (immutable after creation) | By position within order_id |
| Shopper | Multi-writer one-winner (assignment service claims) | State machine (availability + overwrite location) | N/A for status; by timestamp for location |
| LocationPing | Single writer per shopper (device) | Append-only | By device timestamp within shopper_id |
| SubstitutionRequest | Multi-actor pipeline | State machine | By creation time within order_item_id |
| Payment | System-only pipeline | State machine | By transition time within order_id |
| DeliveryZone | Single writer (operator/admin) | Overwrite | N/A (singleton per zone_id) |
Critical Axis Combinations #
Shopper (multi-writer one-winner + state machine): Multiple assignment service instances may simultaneously attempt to claim the same shopper for different orders. Only one claim can succeed. This combination resolves to a Lease (SETNX in Redis on shopper_id), as derived in Step 6.
InventoryReservation (system-only + state machine, but race at creation): The creation step is a conditional write — “create reservation only if available_quantity > 0.” Multiple concurrent orders may race for the same item. This is effectively multi-writer one-winner at creation, resolving to CAS on a quantity counter.
Order (multi-stage pipeline + state machine): No two services should advance the same order state simultaneously. Transitions must be guarded by CAS on (current_state, version).
Step 4 — Invariant Extraction #
Invariant Table #
| ID | Type | Statement | Objects involved |
|---|---|---|---|
| I1 | Eligibility | A customer may only order items from stores within their delivery zone | Customer, Store, DeliveryZone, Order |
| I2 | Eligibility | An order may only be placed if every item in the order has available_quantity ≥ requested_quantity at placement time | Item, InventoryReservation, Order |
| I3 | Ordering | The sum of all active reservations for an item must never exceed the item’s current stock | InventoryReservation, Item |
| I4 | Accounting | available_quantity = initial_stock − confirmed_reservations + released_reservations | Item, InventoryReservation |
| I5 | Uniqueness/Idempotency | At most one active InventoryReservation may exist per (order_id, item_id) | InventoryReservation |
| I6 | Ordering | Order status transitions must follow the legal state machine: pending → accepted → shopping → picked → in_transit → delivered; cancellation from pending or accepted only | Order |
| I7 | Uniqueness/Idempotency | At most one shopper may be assigned to an order at any time | Order, Shopper |
| I8 | Uniqueness/Idempotency | At most one Shopper record may be in on_order status per order | Shopper |
| I9 | Ordering | An order may not transition to delivered without all OrderItems having a resolution (fulfilled or substituted or removed with consent) | Order, OrderItem, SubstitutionRequest |
| I10 | Accounting | A Payment may only be captured after the corresponding Order has reached delivered status | Payment, Order |
| I11 | Accounting | A Payment capture amount must not exceed the authorized amount (plus allowed variance for substitutions) | Payment |
| I12 | Access-control | Only the assigned shopper may advance Order status from shopping → picked → in_transit → delivered | Order, Shopper |
| I13 | Access-control | Only the customer who placed the order may approve or reject a SubstitutionRequest | SubstitutionRequest, Customer |
| I14 | Propagation | When an Order is cancelled, all active InventoryReservations for that order must be released | Order, InventoryReservation |
| I15 | Propagation | When a SubstitutionRequest expires without approval, the OrderItem must be automatically removed and any excess reservation released | SubstitutionRequest, OrderItem, InventoryReservation |
| I16 | Eligibility | A SubstitutionRequest may only be created for an item that is in confirmed_reservation state | SubstitutionRequest, InventoryReservation |
| I17 | Idempotency | Payment authorization and capture operations must be idempotent — retrying must not double-charge | Payment |
| I18 | Uniqueness | At most one active SubstitutionRequest per (order_id, item_id) | SubstitutionRequest |
Step 5 — Design Point (DP) Derivation #
Each DP is the minimal mechanism that enforces its invariant cluster, derived bottom-up from invariant types.
| DP | Invariant cluster | Enforcing mechanism |
|---|---|---|
| DP1 | I1 | Zone eligibility check at order placement; store’s zone_id must be in customer’s eligible zones. Enforced at Order service write path. |
| DP2 | I2, I3, I4, I5 | Atomic CAS on item stock counter. Reservation creation is conditional on counter being positive. Counter is the single source of truth for available stock. |
| DP3 | I6 | Order state machine with guarded transitions. CAS on (order_id, current_status) — only allowed transitions accepted. |
| DP4 | I7, I8 | Distributed lease on shopper_id. Assignment service acquires SETNX on shopper_id:lock with TTL. Only one winner per shopper per time window. |
| DP5 | I9 | Pre-condition check before Order → delivered transition. All OrderItems must be in resolved state. |
| DP6 | I10, I11 | Payment capture gated on Order.status = delivered. Amount bounded by authorized amount ± substitution delta. |
| DP7 | I12 | Access-control check: Order service validates caller identity = order.assigned_shopper_id before allowing shopping-phase transitions. |
| DP8 | I13 | Access-control check: SubstitutionRequest service validates caller identity = order.customer_id before approve/reject. |
| DP9 | I14 | Saga compensation: on Order cancellation, fan-out release messages to each InventoryReservation. Each release is idempotent. |
| DP10 | I15 | Deadline-based scheduler: SubstitutionRequest TTL enforced by a delayed job (Redis sorted set or Kafka delay topic). On expiry, compensation removes OrderItem and releases reservation. |
| DP11 | I16 | Pre-condition check at SubstitutionRequest creation: item’s InventoryReservation must be in confirmed state. |
| DP12 | I17 | Idempotency keys on all PSP calls. Key = (order_id, operation_type). PSP returns cached result for duplicate key. |
| DP13 | I18 | Unique constraint on (order_id, item_id) in SubstitutionRequest table with status IN (pending_approval). |
Step 6 — Mechanism Selection #
The mechanical bridge maps invariant clusters and axis assignments to concrete implementation mechanisms.
6.1 Invariant Type → Mechanism Family #
| Invariant type | Mechanism family |
|---|---|
| Eligibility | Precondition check at write path (synchronous gate) |
| Ordering (count bound) | CAS on mutable counter (optimistic locking) |
| Accounting (derived value) | Single authoritative counter OR event-sourced derivation — pick one; never two |
| Uniqueness/Idempotency | Unique constraint + idempotency key store |
| Propagation (cross-object) | Saga (decomposable) or 2PC (non-decomposable) |
| Access-control | Identity check at service boundary |
6.2 Ownership × Evolution → Concurrency Mechanism #
| Object | Ownership | Evolution | Derived mechanism |
|---|---|---|---|
| InventoryReservation (creation) | Multi-writer, one winner | State machine | CAS on quantity counter + unique constraint on (order_id, item_id) |
| Order | Pipeline | State machine | CAS on (status, version) |
| Shopper (assignment) | Multi-writer, one winner | State machine | Lease (SETNX on shopper_id) |
| SubstitutionRequest | Multi-actor pipeline | State machine | CAS on (status, version) |
| Payment | System pipeline | State machine | CAS on (status, version) + Idempotency Key on PSP calls |
| LocationPing | Single writer per shopper | Append-only | No concurrency needed; idempotent by content |
| Order cancellation → Reservation release | Cross-service | Propagation | Saga with compensation |
6.3 Full Mechanical Derivation for Four Key Flows #
Derivation 1: Inventory Reservation (Multi-Item Atomicity) #
Problem statement: When a customer places an order with N items, all N items must be reserved or none. If item 3 is out of stock after items 1 and 2 are reserved, the partial reservation must be rolled back.
Q1 (Scope): Is this within one service or cross-service?
- The inventory check and reservation creation happen within the Order service’s write path, but reservation records may span multiple store-partitions. This is cross-object but within-service for the reservation logic. However, compensation (unreserve) crosses logical boundaries. → Cross-service decomposable (compensation is well-defined: “release reservation”).
Q2 (Failure): What happens if the writer crashes mid-reservation (e.g., after reserving items 1-2 but before attempting item 3)?
- The reservation records for items 1-2 exist in a confirmed-pending state. No compensating transaction has been applied. → Saga with compensation is required. Each step must be independently compensable. → Idempotency Key on each reservation step (keyed by order_id + item_id) ensures that retries after crash do not double-reserve.
Q3 (Data): Is the stock state commutative?
- No. “Reserve 5 units” and “Reserve 3 units” applied concurrently when only 5 are available must not both succeed. Not commutative under scarcity. → CAS on quantity counter (not CRDT).
Q4 (Access): Read pattern for stock availability?
- Read » Write (millions of customers browse, few place orders). → Cache-Aside for browse reads (AvailableItemView cached in Redis). The reservation write path hits Postgres directly (cache is advisory; Postgres is authoritative).
Q5 (Coupling): How does reservation propagate to the shopper view?
- Shopper needs to see which items are reserved (i.e., what to pick). → Outbox + Relay: Order service writes reservation to Postgres + Outbox atomically; relay publishes to Kafka; Shopper service consumes. Avoids dual-write.
Resulting mechanism combination:
Saga (per-item steps)
+ CAS on quantity counter (prevents oversell at each step)
+ Idempotency Key per step (keyed by order_id:item_id:reserve)
+ Compensation (release reservation) on Saga rollback
+ Outbox + Relay (propagate reservation to downstream consumers)
Required by 6.4: CAS + Idempotency Key — satisfied. (Saga does not require Lease; Lease applies to shopper assignment below.)
Derivation 2: Shopper Assignment #
Problem statement: When an order is accepted, the system must claim exactly one available shopper. Multiple assignment service instances may race to claim the same shopper for different orders.
Q1 (Scope): Within-service or cross-service?
- Assignment is within the Assignment service, but shopper state is shared across all instances. → Distributed CAS (not cross-region; no CRDT because not commutative).
Q2 (Failure): What if the assignment service instance crashes after claiming the shopper but before persisting the assignment to Order?
- Shopper remains locked (claimed) but Order has no shopper_id. The shopper lease must expire so another instance can re-claim. → Lease (SETNX with TTL). Lease TTL = assignment processing timeout (e.g., 30s). If Order is not updated within TTL, lease expires and shopper is returned to available pool.
Q3 (Data): Is shopper availability commutative?
- No. “Claim shopper X” by two different orders — only one should win. → Not CRDT. Lease is correct.
Q4 (Access): How is the nearest available shopper found?
- Geospatial query: find shoppers within N km of store, sorted by distance. → Spatial Partition + Scatter-Gather. Redis Geo index on shopper locations; GEORADIUS query to find candidates; claim attempt in order of proximity.
Q5 (Coupling): How does assignment propagate to Order?
- Assignment service must atomically: (a) acquire lease on shopper, (b) update Order.shopper_id, (c) update Shopper.status = on_order. Steps b and c must both succeed or both roll back. → Saga with compensation: if Order update fails, release shopper lease. If Shopper status update fails (shopper went offline), release lease and retry with next candidate.
Resulting mechanism combination:
Lease (Redis SETNX on "shopper:{shopper_id}:claim" with TTL=30s)
+ fencing token (lease token included in all subsequent writes; stale writes rejected)
+ Saga (multi-step assignment: lease → Order.shopper_id → Shopper.status)
+ Spatial query (Redis GEORADIUS) for candidate generation
+ CAS on Order.status (pending → accepted) to prevent duplicate assignment
Required by 6.4: Lease + fencing token — satisfied.
Derivation 3: Substitution Approval Workflow #
Problem statement: When a shopper finds an item unavailable, they propose a substitute. The customer must approve or reject within a time window. If the window expires, the item is dropped automatically. This is an intent/future-constraint object with a deadline.
Q1 (Scope): Within-service or cross-service?
- SubstitutionRequest is created by Shopper service; approved/rejected by Customer service; expired by Scheduler service. Cross-service. → Cross-service decomposable (approval and rejection are compensable in both directions).
Q2 (Failure): What if the deadline expires and the scheduler crashes before processing the expiry?
- The SubstitutionRequest remains in pending_approval state indefinitely. → Deadline enforced by a durable delayed job: scheduled via Redis sorted set (score = expiry_unix_timestamp) + a sweeper that runs every 10s and processes expired requests. The sweeper is idempotent (CAS on SubstitutionRequest.status from pending_approval → expired). If sweeper crashes mid-way, it retries; CAS prevents double-processing.
Q3 (Data): Can multiple SubstitutionRequests for the same (order_id, item_id) exist simultaneously?
- No. I18 prohibits this. → Unique constraint on (order_id, item_id) filtered to status = pending_approval. Database enforces; service layer validates.
Q4 (Access): Customer must see the substitution request in real-time (push notification + UI update).
- → Fan-out on Write: when SubstitutionRequest is created, publish to Kafka → push notification service → APNs/FCM. Also pub/sub channel per order_id for live UI update (WebSocket).
Q5 (Coupling): On approval, the OrderItem must be updated (replace item_id + update reservation). On rejection, the InventoryReservation for the original item must be released.
- → Outbox + Relay: SubstitutionRequest service writes to DB + Outbox atomically; relay publishes event to Kafka; downstream services (InventoryReservation service, OrderItem service) consume and apply their respective changes. Each downstream step is idempotent (keyed by substitution_request_id + action).
Resulting mechanism combination:
SubstitutionRequest state machine (pending_approval → approved | rejected | expired)
+ CAS on (status, version) for all transitions
+ Unique constraint on (order_id, item_id) for active requests
+ Durable deadline: Redis ZSET (key=substitution_request_id, score=expiry_ts)
+ Idempotent sweeper (CAS on status prevents double-expiry)
+ Outbox + Relay (propagate approval/rejection to InventoryReservation + OrderItem)
+ Fan-out on Write (push notification + WebSocket on creation)
Derivation 4: Payment Authorization and Capture #
Problem statement: Payment is authorized at order placement (hold on card) but captured only after delivery confirmation. The authorization and capture are separate PSP calls, potentially hours apart. Authorization can expire (typically 7 days for most card networks). Double-capture must be impossible.
Q1 (Scope): Cross-service (Payment service → PSP external call). → Cross-service non-decomposable at PSP level (PSP calls cannot be Saga-compensated within milliseconds; compensation = refund, which is a different operation). Within the system, Payment state transitions are a state machine.
Q2 (Failure):
- PSP call for authorization times out: retry with same idempotency key → PSP returns cached result or processes once. → Idempotency Key on all PSP calls (key = order_id:auth or order_id:capture).
- System crashes after PSP confirms authorization but before persisting Payment.status = authorized: on restart, query PSP with idempotency key to recover status, then persist. → Idempotency Key + recovery reconciliation job.
- Authorization expiry: if delivery takes > 7 days (edge case), auth expires. → Scheduler: daily job checks Payment records in authorized state older than 6 days and triggers re-authorization.
Q3 (Data): Is payment state commutative?
- No. Capture must happen exactly once. Not commutative. → CAS on Payment.status (authorized → captured). Two concurrent capture attempts → only one CAS succeeds.
Q4 (Access): Payment reads are infrequent (per-order); PSP calls are the bottleneck.
- No special caching needed. Payment record is small; read from Postgres.
Q5 (Coupling): Capture must be triggered by Order reaching delivered status. How is this propagated?
- → CDC (Change Data Capture): Order service updates Order.status = delivered in Postgres; Debezium publishes change event to Kafka; Payment service consumes and initiates capture. This ensures the trigger is the DB row change, not a dual-write.
Resulting mechanism combination:
Payment state machine (pending → authorized → captured | refunded | failed)
+ CAS on (status, version) for all transitions
+ Idempotency Key on all PSP calls (key = order_id:operation)
+ CDC (Debezium on Order table) → Kafka → Payment capture trigger
+ Auth expiry scheduler (daily reconciliation job)
+ Recovery reconciliation (on startup, query PSP for in-flight payments)
6.4 Required Combinations Check #
| Combination | Applied where | Satisfied? |
|---|---|---|
| CAS + Idempotency Key | Inventory reservation (CAS on counter + idempotency key per step) | Yes |
| CAS + Idempotency Key | Payment capture (CAS on status + idempotency key on PSP calls) | Yes |
| Lease + fencing token | Shopper assignment (SETNX lease + token in downstream writes) | Yes |
Step 7 — Axiomatic Validation (Source-of-Truth Table) #
Rule: Each piece of state has exactly one source of truth. No dual truth.
| State | Single source of truth | How derived views are computed | Dual-truth risk |
|---|---|---|---|
| Item.available_quantity | Postgres Item.stock_count counter (decremented by CAS on reservation) | AvailableItemView in Redis cache derived from Postgres | RISK: cache must not be used as authority for reservation decisions. Cache is advisory only. Write path always reads from Postgres with FOR UPDATE. |
| Order.status | Postgres Order table | OrderStatusView in Redis/Kafka for real-time push | No dual-truth: Postgres is authority; Redis/Kafka are read projections |
| Shopper.current_location | Redis Geo index (latest ping) | LocationPing history in Cassandra | No dual-truth: Redis is live view; Cassandra is history. They serve different queries. |
| Shopper.status | Redis Lease (on_order = lease held) + Postgres Shopper table | Shopper availability view | RISK: Redis lease and Postgres must agree. Reconciliation: on lease expiry, system must reset Shopper.status in Postgres. Fencing token prevents stale Postgres write. |
| Payment.status | Postgres Payment table | N/A | No derived view; single source |
| SubstitutionRequest.status | Postgres SubstitutionRequest table | Redis ZSET for deadline index (advisory) | No dual-truth: ZSET is a deadline index, not a status authority. Status in Postgres is canonical. |
| InventoryReservation.status | Postgres InventoryReservation table | N/A | Single source |
Dual-Truth Resolution Log #
Item.available_quantity: Resolution — cache is advisory only. All reservation write paths use
SELECT FOR UPDATEon Postgres item row. Cache is populated by Postgres read after write; cache miss falls through to Postgres.Shopper.status (Redis vs Postgres): Resolution — Postgres Shopper.status is updated only with the valid fencing token from the Redis lease. On lease expiry, a lease-expiry handler (not the expired holder) updates Postgres.status = available using a system-level write (no fencing token required because the lease has expired and no other writer holds it).
Step 8 — Algorithm Design #
Algorithm 1: Multi-Item Order Placement (Saga) #
function placeOrder(customer_id, store_id, items[{item_id, quantity}]):
# Step 0: Eligibility check
zone_ok = checkDeliveryZone(customer_id, store_id)
if not zone_ok: raise EligibilityError
# Step 1: Create Order in pending state
order_id = generateUUID()
idempotency_key = hash(customer_id + cart_fingerprint)
existing = lookupIdempotencyKey(idempotency_key)
if existing: return existing # idempotent replay
order = INSERT INTO orders (id, customer_id, store_id, status='pending', version=0)
persistIdempotencyKey(idempotency_key, order_id)
# Step 2: Authorize payment (async, but must succeed before reservations)
auth_result = authorizePayment(customer_id, estimatedTotal(items),
idempotency_key=order_id+":auth")
if auth_result.failed:
markOrder(order_id, status='cancelled', reason='payment_auth_failed')
return Error("payment_authorization_failed")
INSERT INTO payments (order_id, status='authorized', auth_token=auth_result.token,
authorized_amount=auth_result.amount)
# Step 3: Reserve items (Saga — each step compensable)
reserved = []
for item in items:
step_key = order_id + ":" + item.item_id + ":reserve"
if lookupIdempotencyKey(step_key):
reserved.append(item)
continue # already reserved (retry path)
# CAS on stock counter
result = executeSQL("""
UPDATE items
SET stock_count = stock_count - :qty,
version = version + 1
WHERE item_id = :item_id
AND stock_count >= :qty
AND version = :expected_version
""", item_id=item.item_id, qty=item.quantity,
expected_version=item.read_version)
if result.rows_affected == 0:
# Saga rollback: release all previously reserved items
for prev in reserved:
releaseReservation(order_id, prev.item_id) # idempotent
markOrder(order_id, status='cancelled', reason='item_unavailable:'+item.item_id)
voidPaymentAuth(order_id, idempotency_key=order_id+":void")
return Error("item_out_of_stock", item_id=item.item_id)
INSERT INTO inventory_reservations (
order_id, item_id, quantity, status='confirmed'
)
persistIdempotencyKey(step_key, "confirmed")
reserved.append(item)
# Step 4: Advance order to accepted
CAS_update(orders, order_id,
expected_status='pending', new_status='accepted',
expected_version=0, new_version=1)
publishToOutbox(order_id, event='order_accepted', payload=order)
return order_id
function releaseReservation(order_id, item_id):
# Idempotent: safe to call multiple times
reservation = SELECT ... WHERE order_id=order_id AND item_id=item_id
if reservation.status == 'released': return # already done
UPDATE items SET stock_count = stock_count + reservation.quantity
WHERE item_id = item_id
UPDATE inventory_reservations SET status='released'
WHERE order_id=order_id AND item_id=item_id AND status != 'released'
Algorithm 2: Shopper Assignment #
function assignShopper(order_id, store_id):
order = getOrder(order_id)
assert order.status == 'accepted'
MAX_ATTEMPTS = 5
attempts = 0
while attempts < MAX_ATTEMPTS:
# Find nearby available shoppers
store_location = getStoreLocation(store_id)
candidates = GEORADIUS(key='shoppers:available',
lat=store_location.lat, lon=store_location.lon,
radius=10km, sort=ASC, count=10)
for shopper_id in candidates:
lease_key = "shopper:" + shopper_id + ":claim"
fencing_token = generateFencingToken()
# Attempt to acquire lease
acquired = SETNX(lease_key,
value=json(order_id=order_id, token=fencing_token),
EX=30) # 30s TTL
if not acquired: continue # already claimed by another order
# Verify shopper is still available in Postgres (lease acquired)
shopper = SELECT ... FROM shoppers WHERE id=shopper_id FOR UPDATE
if shopper.status != 'available':
DEL(lease_key) # release lease — stale Redis state
continue
# Saga step 1: Update Order with shopper
order_updated = CAS_update(orders, order_id,
expected_status='accepted', new_status='shopping',
expected_version=order.version,
new_version=order.version+1,
new_shopper_id=shopper_id,
new_fencing_token=fencing_token)
if not order_updated:
DEL(lease_key) # compensate
return # concurrent assignment won — OK
# Saga step 2: Update Shopper status (with fencing token validation)
UPDATE shoppers SET status='on_order', assigned_order_id=order_id
WHERE id=shopper_id
AND status='available'
AND (last_fencing_token IS NULL OR last_fencing_token < fencing_token)
# Lease remains held; will be released when order reaches delivered/cancelled
publishToOutbox(order_id, event='shopper_assigned', shopper_id=shopper_id)
return order_id
attempts++
sleep(exponential_backoff(attempts))
# No shopper available — requeue order for retry
publishDelayed(order_id, event='assignment_retry', delay=60s)
Algorithm 3: Substitution Approval Flow #
function proposeSubstitution(shopper_id, order_id, item_id, substitute_item_id, reason):
# Verify shopper is assigned to this order
order = getOrder(order_id)
assert order.assigned_shopper_id == shopper_id
assert order.status == 'shopping'
# Verify original item reservation is active
reservation = getReservation(order_id, item_id)
assert reservation.status == 'confirmed'
# Uniqueness check (also enforced by DB constraint)
existing = SELECT ... FROM substitution_requests
WHERE order_id=order_id AND item_id=item_id AND status='pending_approval'
if existing: raise UniqueConstraintError
expiry = now() + 10 minutes # configurable window
sub_req = INSERT INTO substitution_requests (
id=generateUUID(), order_id, item_id, substitute_item_id, reason,
status='pending_approval', version=0, expires_at=expiry
)
# Register deadline in Redis ZSET
ZADD('substitution:deadlines', expiry.unix_timestamp, sub_req.id)
# Outbox event → Kafka → push notification + WebSocket
publishToOutbox(sub_req.id, event='substitution_proposed',
payload={order_id, item_id, substitute_item_id, expiry})
return sub_req.id
function approveSubstitution(customer_id, substitution_request_id):
sub = getSubstitutionRequest(substitution_request_id) FOR UPDATE
assert sub.status == 'pending_approval'
assert sub.expires_at > now()
assert getOrder(sub.order_id).customer_id == customer_id # access control
# CAS transition
CAS_update(substitution_requests, sub.id,
expected_status='pending_approval', new_status='approved',
expected_version=sub.version, new_version=sub.version+1)
# Remove deadline from Redis ZSET
ZREM('substitution:deadlines', sub.id)
# Outbox → Kafka → downstream effects:
# - InventoryReservation service: release old item reservation, create new
# - OrderItem service: update item_id to substitute
publishToOutbox(sub.id, event='substitution_approved',
payload={order_id=sub.order_id, old_item=sub.item_id,
new_item=sub.substitute_item_id})
function sweepExpiredSubstitutions():
# Runs every 10 seconds
now_ts = unix_timestamp()
expired_ids = ZRANGEBYSCORE('substitution:deadlines', 0, now_ts, LIMIT 100)
for sub_id in expired_ids:
sub = getSubstitutionRequest(sub_id) FOR UPDATE
if sub.status != 'pending_approval':
ZREM('substitution:deadlines', sub_id)
continue # already resolved
# Idempotent CAS
updated = CAS_update(substitution_requests, sub_id,
expected_status='pending_approval', new_status='expired',
expected_version=sub.version)
if updated:
ZREM('substitution:deadlines', sub_id)
publishToOutbox(sub_id, event='substitution_expired',
payload={order_id=sub.order_id, item_id=sub.item_id})
# Downstream: release InventoryReservation for original item
# remove OrderItem from order
Algorithm 4: Payment Capture #
function capturePayment(order_id):
# Triggered by CDC event: Order.status changed to 'delivered'
payment = SELECT ... FROM payments WHERE order_id=order_id FOR UPDATE
if payment.status != 'authorized':
log("Ignoring capture trigger — payment status: " + payment.status)
return # idempotent: already captured or not authorizable
# Calculate final amount (may differ from auth due to substitutions/removals)
final_amount = calculateFinalAmount(order_id)
assert final_amount <= payment.authorized_amount * 1.10 # max 10% variance
# Call PSP with idempotency key
capture_result = psp.capture(
auth_token=payment.auth_token,
amount=final_amount,
idempotency_key=order_id + ":capture"
)
if capture_result.success:
CAS_update(payments, payment.id,
expected_status='authorized', new_status='captured',
expected_version=payment.version,
new_captured_amount=final_amount,
captured_at=now())
else if capture_result.auth_expired:
# Authorization expired — re-authorize then capture
reauth_result = psp.authorize(
customer_payment_method=payment.payment_method_token,
amount=final_amount,
idempotency_key=order_id + ":reauth"
)
if reauth_result.success:
psp.capture(auth_token=reauth_result.token, amount=final_amount,
idempotency_key=order_id + ":capture:v2")
UPDATE payments SET status='captured', ... WHERE id=payment.id
else:
UPDATE payments SET status='failed', failure_reason='reauth_declined'
# Alert: manual review required
publishAlert('payment_capture_failed', order_id=order_id)
else:
# Transient PSP error — exponential retry (max 3 attempts over 5 minutes)
scheduleRetry('capture_payment', order_id=order_id, attempt=1, delay=30s)
Order State Machine #
States: pending → accepted → shopping → substitution_pending → picked → in_transit → delivered
└──────────────────────────────────────────────────────────────────────────────────→ cancelled
Legal transitions:
pending → accepted (system: shopper assigned)
pending → cancelled (customer: within cancellation window)
accepted → shopping (shopper: arrived at store, started shopping)
accepted → cancelled (customer: before shopper starts)
shopping → substitution_pending (system: when SubstitutionRequest created)
substitution_pending → shopping (system: when all pending substitutions resolved)
shopping → picked (shopper: all items confirmed picked)
picked → in_transit (shopper: left store with order)
in_transit → delivered (shopper: confirmed delivery)
any non-terminal → cancelled (system: forced cancellation with compensation)
Note: substitution_pending is a synthetic state — the Order may have multiple concurrent
SubstitutionRequests. The state returns to 'shopping' when all are resolved.
Guard: Order may only transition to picked when zero SubstitutionRequests are in pending_approval state.
Step 9 — Logical Data Model #
Schema #
-- Stable entities
CREATE TABLE delivery_zones (
zone_id UUID PRIMARY KEY,
name TEXT NOT NULL,
boundary GEOMETRY(POLYGON, 4326) NOT NULL, -- PostGIS
active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index: GIST(boundary) for zone containment queries
CREATE TABLE stores (
store_id UUID PRIMARY KEY,
name TEXT NOT NULL,
zone_id UUID NOT NULL REFERENCES delivery_zones(zone_id),
location GEOMETRY(POINT, 4326) NOT NULL,
active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index: store_id (PK), zone_id, GIST(location)
CREATE TABLE items (
item_id UUID PRIMARY KEY,
store_id UUID NOT NULL REFERENCES stores(store_id),
name TEXT NOT NULL,
price_cents INTEGER NOT NULL,
stock_count INTEGER NOT NULL DEFAULT 0 CHECK (stock_count >= 0),
version BIGINT NOT NULL DEFAULT 0,
active BOOLEAN DEFAULT TRUE,
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Partition key: store_id (logical partition; physical by store_id range or hash)
-- Index: (store_id, active) for browse queries
-- Note: stock_count is the single source of truth for available inventory.
-- It is decremented by CAS at reservation time.
CREATE TABLE shoppers (
shopper_id UUID PRIMARY KEY,
name TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'offline'
CHECK (status IN ('offline', 'available', 'on_order')),
assigned_order_id UUID,
zone_id UUID REFERENCES delivery_zones(zone_id),
last_fencing_token TEXT,
version BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (status, zone_id) for availability queries
-- Process objects
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
store_id UUID NOT NULL REFERENCES stores(store_id),
shopper_id UUID REFERENCES shoppers(shopper_id),
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','accepted','shopping',
'substitution_pending','picked',
'in_transit','delivered','cancelled')),
cancellation_reason TEXT,
version BIGINT NOT NULL DEFAULT 0,
idempotency_key TEXT UNIQUE,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Partition key: customer_id (most queries scoped to customer)
-- Index: (customer_id, created_at DESC), (shopper_id, status), (status, created_at)
CREATE TABLE order_items (
order_item_id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(order_id),
item_id UUID NOT NULL REFERENCES items(item_id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price_cents INTEGER NOT NULL, -- price at order time, snapshot
resolution TEXT DEFAULT 'pending'
CHECK (resolution IN ('pending','fulfilled','substituted','removed')),
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (order_id) — always queried by order
CREATE TABLE inventory_reservations (
reservation_id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(order_id),
item_id UUID NOT NULL REFERENCES items(item_id),
quantity INTEGER NOT NULL,
status TEXT NOT NULL DEFAULT 'confirmed'
CHECK (status IN ('confirmed','released','fulfilled')),
idempotency_key TEXT UNIQUE, -- order_id:item_id:reserve
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (order_id, item_id) -- enforces I5
);
-- Partition key: item_id (for stock aggregation queries)
-- Index: (order_id), (item_id, status)
CREATE TABLE substitution_requests (
substitution_id UUID PRIMARY KEY,
order_id UUID NOT NULL REFERENCES orders(order_id),
item_id UUID NOT NULL REFERENCES items(item_id),
substitute_item_id UUID NOT NULL REFERENCES items(item_id),
shopper_reason TEXT,
status TEXT NOT NULL DEFAULT 'pending_approval'
CHECK (status IN ('pending_approval','approved','rejected','expired')),
version BIGINT NOT NULL DEFAULT 0,
expires_at TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Unique constraint: only one pending per (order_id, item_id)
-- Enforced by partial unique index:
CREATE UNIQUE INDEX uq_substitution_active
ON substitution_requests(order_id, item_id)
WHERE status = 'pending_approval';
-- Index: (expires_at) for sweeper queries, (order_id)
CREATE TABLE payments (
payment_id UUID PRIMARY KEY,
order_id UUID NOT NULL UNIQUE REFERENCES orders(order_id),
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','authorized','captured','refunded','failed')),
authorized_amount INTEGER, -- in cents
captured_amount INTEGER,
auth_token TEXT, -- PSP auth token
payment_method_token TEXT, -- tokenized card reference
failure_reason TEXT,
version BIGINT NOT NULL DEFAULT 0,
authorized_at TIMESTAMPTZ,
captured_at TIMESTAMPTZ,
auth_expires_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (order_id) unique, (status, authorized_at) for expiry reconciliation
-- Events (append-only, high volume)
-- LocationPing stored in Cassandra (see Step 10)
-- Schema shown here for reference:
-- CREATE TABLE location_pings (
-- shopper_id UUID,
-- ts TIMESTAMP,
-- lat DOUBLE,
-- lon DOUBLE,
-- accuracy_m FLOAT,
-- PRIMARY KEY ((shopper_id), ts)
-- ) WITH CLUSTERING ORDER BY (ts DESC);
-- Outbox table (per-service, for transactional outbox pattern)
CREATE TABLE outbox_events (
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_id UUID NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
published BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (published, created_at) for relay queries
-- Idempotency key store
CREATE TABLE idempotency_keys (
idem_key TEXT PRIMARY KEY,
result_ref TEXT NOT NULL, -- order_id or reservation_id
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- TTL managed by background cleanup job (keys older than 24h deleted)
Step 10 — Technology Landscape #
Technology Selection Table #
| Component | Technology | Justification |
|---|---|---|
| Primary relational store | PostgreSQL 16 with PostGIS | ACID transactions for CAS; PostGIS for delivery zone geo queries; partial indexes for substitution uniqueness; row-level locking (SELECT FOR UPDATE) |
| Shopper lease + availability index | Redis 7 Cluster | SETNX with TTL for leases; GEORADIUS for spatial queries; Pub/Sub for live order status per customer session |
| Real-time location index | Redis Geo (GEOADD, GEORADIUS) | O(N+log M) radius query; TTL-expired entries auto-purge shoppers who go offline |
| Location ping history | Apache Cassandra | Time-series append-only workload; partition by shopper_id, cluster by timestamp; wide-row model fits perfectly; LSM tree optimizes write-heavy load |
| Event streaming | Apache Kafka | Durable ordered log; topics: order-events, location-stream, inventory-events, substitution-events, payment-events; supports CDC relay, outbox relay, notification fan-out |
| CDC relay | Debezium | Tails Postgres WAL; publishes row changes to Kafka without application dual-write |
| Push notifications | FCM / APNs (via Kafka consumer) | Fan-out from substitution-events and order-events Kafka topics |
| Item images | S3 + CloudFront | Static binary blobs; CDN for global low-latency serve |
| API gateway | Kong / AWS API Gateway | Auth, rate limiting, routing |
| Service mesh | Istio / Linkerd | mTLS, circuit breakers, retries between internal services |
| Shopper app real-time | WebSocket (via API gateway) | Bi-directional; shopper sends location pings, receives order updates |
| Customer app real-time | WebSocket + SSE fallback | Customer receives order status + substitution requests |
| Scheduler (delayed jobs) | Redis ZSET + sweeper workers | Substitution expiry deadlines; auth expiry reminders |
| Monitoring | Prometheus + Grafana | Metrics; Jaeger for distributed traces |
| Logging | ELK stack (Elasticsearch + Logstash + Kibana) | Structured log aggregation |
Why Not Alternatives #
| Alternative | Why rejected |
|---|---|
| DynamoDB for Orders | Multi-item transactions with CAS across multiple tables require expensive DynamoDB transactions (TransactWriteItems); Postgres row-level locking is simpler and cheaper at this scale |
| MongoDB | No partial unique indexes until recently; CAS less ergonomic; PostGIS integration poor |
| CRDT for inventory | Inventory is not commutative under scarcity — CAS is required |
| Zookeeper for leases | Operationally heavier than Redis for this workload; Redis SETNX is sufficient |
| Single Postgres for location pings | Write throughput: 1M+ pings/minute cannot be sustained by Postgres; Cassandra’s LSM tree is the right tool |
Step 11 — Deployment Topology #
Service Decomposition #
┌─────────────────────────────────────────────────────────────────────────┐
│ API Gateway (Kong) │
│ TLS termination, auth, rate limiting, routing │
└──────┬────────────┬──────────────┬─────────────┬────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Customer│ │ Order │ │ Shopper │ │ Payment │
│ Service │ │ Service │ │ Service │ │ Service │
│ │ │ │ │ │ │ │
│ Profile │ │ Place │ │ Location │ │ Auth/Capture │
│ Auth │ │ Status │ │ Tracking │ │ Reconcile │
└────┬────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────────┐
│ Postgres (Primary) │
│ orders, order_items, payments, customers, shoppers, │
│ inventory_reservations, substitution_requests, outbox │
│ Read replicas for browse/reporting queries │
└────────────────────────┬─────────────────────────────────┘
│ Debezium CDC
▼
┌──────────────────────────────────────────────────────────┐
│ Apache Kafka (3 brokers) │
│ Topics: order-events, location-stream, inventory-events │
│ substitution-events, payment-events, outbox-relay│
└──────┬────────────────┬─────────────────┬────────────────┘
│ │ │
▼ ▼ ▼
┌───────────┐ ┌─────────────┐ ┌─────────────────┐
│Notification│ │ Analytics │ │ Outbox Relay │
│ Service │ │ Consumer │ │ (Debezium sink) │
│ FCM/APNs │ │ (Spark/Flink│ │ │
└───────────┘ └─────────────┘ └─────────────────┘
┌──────────────────────────────────────────────────────────┐
│ Redis Cluster │
│ - Shopper leases (SETNX) │
│ - Shopper geo index (GEOADD) │
│ - Substitution deadline ZSET │
│ - Order status pub/sub channels │
│ - Available item cache (read-through) │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ Cassandra Cluster (3 nodes) │
│ - LocationPing history │
│ Partition: shopper_id; Cluster: timestamp DESC │
└──────────────────────────────────────────────────────────┘
Regions and Replication #
- Primary region: Active-active multi-AZ within one region (3 AZs).
- Postgres: Primary + 2 synchronous replicas (for failover). Read replicas for browse queries. Streaming replication.
- Redis Cluster: 3 primary shards × 1 replica per shard = 6 nodes across 3 AZs.
- Kafka: 3 brokers, replication factor 3, min.insync.replicas=2.
- Cassandra: Replication factor 3, LOCAL_QUORUM reads and writes.
- Multi-region: Phase 2 evolution (see Step 20). Initially single-region.
Container Layout #
Each service is a separate Kubernetes Deployment:
- Order Service: 6-20 pods (HPA on CPU + pending order queue depth)
- Shopper Service: 4-12 pods (HPA on active shoppers × location ping rate)
- Assignment Service: 3-6 pods (HPA on assignment queue depth)
- Payment Service: 3-6 pods (stateless; PSP calls are the bottleneck)
- Notification Service: 3-6 pods (fan-out workload)
- Outbox Relay: 2 pods (low volume; single-writer per outbox shard)
- Substitution Sweeper: 2 pods (leader-elected; single active at a time)
Step 12 — Consistency Model #
Per-Path Classification #
| Path | Consistency model | Justification |
|---|---|---|
| Inventory reservation (write) | Linearizable | CAS with SELECT FOR UPDATE on Postgres primary. No stale read permitted — oversell is a hard business invariant. |
| Inventory browse (read) | Eventual (cache-assisted) | Cache may be up to 30s stale. Fine for browse; write path is linearizable. |
| Order status read | Read-your-writes (at minimum) | Customer reads from Postgres replica with session consistency guarantee. Or: writes propagate to Redis pub/sub < 500ms; customer WebSocket sees near-real-time. |
| Order status write (transition) | Linearizable | CAS on (status, version) on Postgres primary. |
| Shopper location (live view) | Eventual (< 5s lag) | LocationPing → Kafka → Redis Geo update. Customers see location that is at most one ping interval stale. Acceptable for navigation. |
| Shopper location (history) | Eventually consistent | Cassandra LOCAL_QUORUM on write. Read from Cassandra for history. |
| Shopper assignment claim | Linearizable (within Redis) | Redis SETNX is atomic. The subsequent Postgres writes use fencing token to reject stale writes. |
| Payment authorization | Linearizable | Postgres write + PSP call. |
| Payment capture | Linearizable | CAS on Payment.status + PSP call with idempotency key. |
| Substitution approval | Linearizable | CAS on SubstitutionRequest.status on Postgres. |
| Substitution expiry | Monotonic (sweep-based) | Sweeper applies idempotent CAS; processes each expired request exactly once due to ZRANGEBYSCORE + CAS guard. |
Step 13 — Scaling Model #
Traffic Estimation #
| Metric | Assumption | Peak value |
|---|---|---|
| DAU | 1M customers, 10% place orders/day | 100K orders/day |
| Orders/hour (peak) | 3× average, evening peak | ~20K orders/hour |
| Items/order | Average 10 items | 200K item reservations/hour |
| LocationPing rate | 10K active shoppers × 1 ping/10s | 1,000 pings/second |
| Browse QPS | 1M DAU × 20 browses/day / 86400 | ~230 QPS baseline; 5× peak = 1,150 QPS |
| Payment operations | 1 auth + 1 capture per order | ~40K PSP calls/hour at peak |
Scaling Dimensions #
| Dimension | Bottleneck | Mitigation |
|---|---|---|
| Inventory CAS writes | High-contention items (popular grocery SKUs) | Shard by item_id; Postgres row-level locking scoped to one row per item; CAS contention is localized |
| Browse reads | Read amplification | Cache in Redis (30s TTL); serve from Postgres read replicas for cache misses |
| Location pings | Write throughput to Cassandra | Cassandra scales horizontally; Kafka buffers bursts; Cassandra ring hash by shopper_id |
| Shopper geo queries | Redis GEORADIUS at scale | Redis Cluster sharded by zone_id; each zone’s shoppers indexed separately |
| Order event fan-out | Notification delivery latency | Kafka partitioned by order_id; notification workers scale horizontally |
| Assignment contention | Many orders competing for few shoppers | Exponential backoff; assignment is queued, not synchronous |
| Payment PSP | PSP rate limits | Connection pooling; retry with exponential backoff; circuit breaker on PSP |
Hotspot Analysis #
- Popular item at single store: stock_count row is a hotspot. At extreme scale (Black Friday), use a reservation queue (Kafka) rather than synchronous CAS — serialize reservation requests per item through a single partition consumer. This is a Phase 2 mitigation.
- Shopper in high-density area: Redis Geo shard by zone; zone is the partition boundary. One zone’s shoppers live on one Redis shard.
- Kafka partition hotspot: Order events partitioned by order_id (random distribution). Location stream partitioned by shopper_id (even distribution by shopper count, not geography).
Step 14 — Failure Model #
Failure Enumeration #
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Order service pod crash mid-Saga (reservation partially complete) | Health check → Kubernetes restart | Items 1..k reserved; items k+1..N not; Order in pending | On pod restart: idempotency key lookup resumes Saga at last unconfirmed step; if order is in pending state older than 5min, compensate and cancel |
| Redis lease holder crashes before persisting shopper assignment to Postgres | Lease TTL expires (30s) | Shopper locked but order has no shopper_id | After TTL, assignment service retries; shopper returned to available pool |
| Postgres primary failure | Streaming replication monitoring; auto-failover via Patroni | Writes blocked during failover (~30s) | Replica promoted; application reconnects; in-flight writes retry with idempotency keys |
| Kafka broker failure | Kafka replication (factor=3, ISR=2) | No data loss if ISR maintained; 1-broker loss tolerated | Automatic leader re-election; producers retry with idempotency |
| Cassandra node failure (1 of 3) | Gossip failure detection | Location ping writes succeed (quorum = 2); history reads succeed | Cassandra hinted handoff; node replacement; repair job |
| PSP timeout during payment auth | HTTP timeout + circuit breaker | Order placement blocked | Retry with idempotency key (same key = PSP returns cached result); max 3 retries over 60s; order cancelled if all fail |
| PSP timeout during payment capture | CDC trigger retry | Payment remains authorized | Outbox relay re-triggers capture; idempotency key prevents double capture; alert if capture fails after 3 retries |
| Payment auth expiry before delivery | Scheduled reconciliation job | Capture fails with auth_expired | Re-authorize then capture; if customer card declines, alert + flag order |
| Substitution sweeper crash | Pod health check; Kubernetes restart | Expired substitutions not processed; order stuck in substitution_pending | Sweeper is stateless + idempotent; next sweep processes all overdue entries; ZSET persists in Redis |
| Network partition between services | Circuit breaker (Hystrix/Resilience4j) | Degraded functionality | Fail-open for reads (serve cached data); fail-closed for writes (reject until partition healed) |
| Shopper app location ping failure | Shopper device connectivity loss | Location view goes stale | Last known location displayed; staleness indicator shown after 30s; assignment not affected |
| Duplicate Kafka message delivery | Kafka at-least-once semantics | Double-processing of events | All consumers implement idempotent processing (CAS on state or unique constraint) |
Step 15 — SLOs #
Service Level Objectives #
| Metric | SLO | Measurement window |
|---|---|---|
| Order placement P99 latency | < 3s (includes payment auth + N item reservations) | Rolling 5-minute window |
| Order placement P50 latency | < 800ms | Rolling 5-minute window |
| Shopper assignment time (P95) | < 60s from order accepted | Per-order measurement |
| Location update latency (device → customer view) | < 5s P95 | Sampled end-to-end trace |
| Substitution push notification delivery | < 10s P95 from shopper submission | Per-event measurement |
| Payment capture P99 latency | < 30s from delivered event | Per-payment measurement |
| Inventory reservation accuracy | Zero oversell events | Continuous invariant monitoring |
| Order service availability | 99.9% (< 8.7h downtime/year) | Rolling 30-day window |
| Payment service availability | 99.95% (< 4.4h downtime/year) | Rolling 30-day window |
| Browse API availability | 99.5% (cache-backed; acceptable degraded mode) | Rolling 30-day window |
| Data durability | 99.999% (no order data loss) | All-time |
Error Budget #
- Order service: 43.8 minutes downtime per month.
- Payment service: 21.9 minutes downtime per month.
- Burn rate alerting: if error budget burns > 5× expected rate for 1 hour → page on-call.
Step 16 — Operational Parameters #
Tunable Parameters #
| Parameter | Default | Min | Max | Effect of increasing |
|---|---|---|---|---|
| inventory_reservation.ttl_pending_minutes | 20 | 5 | 60 | Longer window before abandoned orders release stock; increases risk of stock lockout for other customers |
| shopper_assignment.lease_ttl_seconds | 30 | 10 | 120 | Longer lease → more time for multi-step assignment saga; higher risk of shopper being locked if assignment service crashes |
| shopper_assignment.search_radius_km | 10 | 1 | 50 | Larger radius finds more shoppers but increases drive time |
| substitution_request.approval_window_minutes | 10 | 2 | 30 | Longer window → better customer response rate; longer order hold |
| substitution_sweeper.interval_seconds | 10 | 5 | 60 | More frequent sweeping → faster expiry processing; higher Redis ZRANGEBYSCORE load |
| payment.auth_preexpiry_days | 6 | 1 | 7 | Alert/reauth triggered this many days before auth expiry |
| item_cache.ttl_seconds | 30 | 5 | 300 | Higher TTL → lower Postgres read load; higher risk of stale stock display |
| assignment.max_attempts | 5 | 1 | 20 | More attempts before falling back to delayed retry queue |
| kafka.consumer.max_poll_records | 500 | 1 | 2000 | Higher → more throughput per consumer poll; higher memory pressure |
Circuit Breaker Parameters #
| Circuit | Failure threshold | Wait duration | Half-open probe count |
|---|---|---|---|
| PSP auth calls | 50% failure in 10s window | 60s | 3 |
| PSP capture calls | 50% failure in 10s window | 60s | 3 |
| Cassandra writes (location) | 30% failure in 5s window | 30s | 5 |
| Redis (lease operations) | 40% failure in 5s window | 15s | 3 |
Step 17 — Runbooks #
Runbook 1: Inventory Oversell Detection #
Trigger: Alert fires: inventory_negative_stock metric > 0 for any item_id.
Root cause: CAS guard bypassed (bug) OR stock_count initialized incorrectly OR compensating transaction double-applied.
Steps:
- Query:
SELECT item_id, stock_count FROM items WHERE stock_count < 0. Identify affected items. - Query:
SELECT * FROM inventory_reservations WHERE item_id = :item_id AND status = 'confirmed'. Sum quantities. - Compare sum(reservations) with initial_stock (from audit log). Identify discrepancy.
- If stock_count < 0: immediate lock — set item.active = false to prevent further orders.
- Identify affected orders (reservations referencing this item_id). Triage: which orders are still in shopping phase vs picked/delivered.
- For picked/delivered orders: allow completion; adjust stock manually.
- For shopping orders: contact shoppers; may need to cancel items.
- Root cause analysis: enable DEBUG logging on item CAS path; replay recent reservations from Kafka topic.
- Fix stock_count:
UPDATE items SET stock_count = max(0, initial_stock - confirmed_reservations). - Re-enable item:
UPDATE items SET active = true WHERE item_id = :item_id.
Runbook 2: Shopper Lease Stuck (Shopper Permanently in on_order) #
Trigger: Alert: shopper_id has been in on_order status for > 4 hours with no location ping for > 30 minutes.
Steps:
- Verify:
SELECT * FROM shoppers WHERE shopper_id = :id. Check assigned_order_id. - Check order status:
SELECT status FROM orders WHERE order_id = :assigned_order_id. - Check Redis lease:
GET shopper:{shopper_id}:claim. If key exists, note TTL. - If order is delivered/cancelled but shopper status is still on_order: this is a missed status reset.
- Check outbox_events for the relevant order transition — was it published? Was the Shopper service consumer lagging?
- Manual fix:
UPDATE shoppers SET status='available', assigned_order_id=NULL WHERE shopper_id=:id. DEL shopper:{shopper_id}:claimif key still exists.
- If order is in active state but shopper is unreachable: attempt to contact shopper via ops tooling. If genuinely lost, escalate to manual delivery or reassignment.
- For reassignment: cancel current assignment (system-level), trigger new assignment saga for the order.
Runbook 3: Payment Capture Failure #
Trigger: Alert: payment_capture_failed event, or Payment record in authorized state for > 24h after Order.status = delivered.
Steps:
- Query:
SELECT * FROM payments WHERE order_id = :order_id. - Check PSP portal for auth_token status. Is auth still valid?
- If auth expired: manually trigger re-authorization via internal admin API (
POST /admin/payments/:id/reauth). This uses the payment_method_token to issue a new auth. - If PSP is degraded: check circuit breaker status in Grafana. If circuit is open, wait for PSP recovery; circuit will auto-retry.
- If customer card declined during re-auth: flag order for finance team. Send customer notification. Create manual review ticket.
- If capture succeeds eventually: close ticket.
- Post-mortem: if multiple orders affected simultaneously → PSP incident. Coordinate with PSP support.
Runbook 4: Substitution Sweeper Backlog #
Trigger: Alert: substitution_deadlines_backlog (ZCARD of substitution:deadlines > 1000 entries with score < now).
Steps:
- Check sweeper pod health:
kubectl get pods -l app=substitution-sweeper. - If pod is crashed:
kubectl describe podto see crash reason. Check logs. Restart pod. - If sweeper is running but backlog is growing: check Postgres write latency. SubstitutionRequest CAS updates may be slow.
- Verify no lock contention:
SELECT * FROM pg_stat_activity WHERE wait_event_type='Lock'. - If backlog is non-critical (< 10 min delay): let sweeper process naturally; it will catch up.
- If critical (hour+ delay with customer-visible stuck orders): deploy emergency parallel sweeper with a different ZSET range split (e.g., two sweepers each taking half the score range). Coordinate via Redis distributed lock to prevent double-processing — or rely on CAS guard.
Step 18 — Observability #
Metrics (Prometheus) #
| Metric | Type | Labels | Alert threshold |
|---|---|---|---|
order_placement_duration_seconds | Histogram | status (success/failure), error_type | P99 > 3s |
inventory_reservation_cas_retries_total | Counter | item_id | Rate > 10/s per item → hotspot |
inventory_negative_stock_count | Gauge | item_id | > 0 → page immediately |
shopper_assignment_duration_seconds | Histogram | — | P95 > 60s |
shopper_assignment_attempts | Histogram | — | P99 > 4 attempts |
shopper_lease_acquisition_total | Counter | result (success/failure) | Failure rate > 50% |
location_ping_ingest_rate | Gauge | zone_id | Drop > 30% → shopper app issue |
location_ping_lag_seconds | Histogram | — | P95 > 10s → pipeline issue |
substitution_request_created_total | Counter | — | — |
substitution_request_expired_total | Counter | — | > 20% of created → approval UX issue |
substitution_sweeper_backlog | Gauge | — | > 500 → warn; > 2000 → page |
payment_authorization_duration_seconds | Histogram | psp_name | P99 > 5s |
payment_capture_success_total | Counter | — | — |
payment_capture_failure_total | Counter | failure_reason | > 0 → alert |
payment_auth_expiry_count | Gauge | — | > 0 → pre-expiry action needed |
kafka_consumer_lag | Gauge | topic, consumer_group | > 10,000 → alert |
circuit_breaker_state | Gauge | circuit_name, state | state=open → alert |
Distributed Tracing (Jaeger) #
Every request carries a trace-id propagated via HTTP headers and Kafka message headers. Key trace spans:
order.place→ includes child spans:zone_check,payment.authorize,inventory.reserve[item_id=X](one per item),order.status_transitionassignment.run→ includes:geo_query,lease.acquire[shopper_id=X],order.update_shopper,shopper.update_statussubstitution.propose→ includes:reservation.check,db.insert,deadline.register,notification.publishpayment.capture→ includes:order.verify_delivered,psp.capture_call,payment.status_transition
Sampling: 10% of requests traced by default; 100% for error paths and P99+ latency paths.
Structured Logs (ELK) #
Every log line includes: trace_id, service, pod_id, order_id (where applicable), level, timestamp, message, duration_ms (for external calls).
Key log events:
inventory_cas_failure: includesitem_id,requested_qty,available_qty,attempt_numbersaga_compensation_triggered: includesorder_id,step_failed,items_releasedshopper_lease_acquired: includesshopper_id,order_id,fencing_tokenshopper_lease_expired: includesshopper_id,previous_order_idpayment_capture_attempted: includesorder_id,amount_cents,idempotency_key,psp_response_codesubstitution_expired: includessubstitution_id,order_id,item_id,delay_seconds
Dashboards #
- Order Funnel Dashboard: placement rate, acceptance rate, cancellation rate by stage, P50/P95/P99 per stage.
- Inventory Health Dashboard: items with negative stock (should always be 0), reservation hot spots, CAS retry rates.
- Shopper Assignment Dashboard: assignment latency, lease acquisition success rate, available shopper count by zone.
- Payment Dashboard: auth success rate, capture success rate, PSP latency, auth expiry count.
- Real-time Ops Dashboard: active orders by status, active shoppers, pending substitutions, Kafka consumer lag.
Step 19 — Cost Model #
Infrastructure Cost Estimate (Monthly, AWS us-east-1) #
| Component | Configuration | Est. monthly cost |
|---|---|---|
| Postgres (RDS Multi-AZ) | db.r6g.2xlarge (primary + 1 replica), 500GB gp3 | $1,200 |
| Postgres read replicas (2×) | db.r6g.xlarge × 2 | $800 |
| Redis Cluster | 6× cache.r7g.large (3 primary + 3 replica) | $900 |
| Kafka (MSK) | 3× kafka.m5.xlarge brokers, 500GB storage each | $1,100 |
| Cassandra (EC2) | 3× r6g.2xlarge + 1TB gp3 each | $1,500 |
| EKS worker nodes | 10× m6g.2xlarge (mixed services) | $1,800 |
| S3 (item images) | 1TB storage + 50M GET requests | $100 |
| CloudFront (CDN) | 10TB egress | $850 |
| Data transfer | Internal VPC traffic | $300 |
| Monitoring (Prometheus/Grafana stack) | 2× t3.medium | $60 |
| Miscellaneous (ALB, Route53, NAT) | — | $400 |
| Total infrastructure | ~$9,010/month |
Variable Costs #
| Cost driver | Unit cost | At 100K orders/day |
|---|---|---|
| PSP transaction fee (auth) | $0.005/transaction | $150/day |
| PSP transaction fee (capture) | 2.9% + $0.30 per capture | Variable (depends on order value) |
| FCM/APNs pushes | Negligible (free tier) | $0 |
| Kafka message storage | $0.10/GB-month | ~$50/month at 1KB/msg × 10M msgs/day |
Cost per Order (Infrastructure only) #
At 100K orders/day = 3M orders/month:
- Infrastructure: $9,010 / 3,000,000 = $0.003/order (0.3 cents)
- PSP fees are the dominant variable cost (separate revenue model consideration)
Cost Optimization Levers #
- Cassandra: Move location pings older than 90 days to S3 (lifecycle policy). Cassandra cluster shrinks by 60% after 6 months.
- Read replicas: Scale down Postgres read replicas off-peak (scheduled RDS scaling).
- Kafka: Use S3 tiered storage (MSK Connect) for topics older than 7 days. Reduces broker storage cost by 70%.
- Item images: Use CloudFront origin shield to reduce S3 request costs.
Step 20 — Evolution Stages #
Stage 1: MVP (Months 0-3) #
Goal: Working system in one city, one delivery zone, ~1,000 orders/day.
What to build:
- Monolithic Order Service (handles placement + assignment + status)
- Postgres single instance (no replicas)
- Redis single node (no cluster)
- No Cassandra — location pings stored in Postgres (time-partitioned table)
- Basic FCM push (no WebSocket)
- Manual payment capture trigger (via internal admin UI)
- No Kafka — synchronous in-process event handling
What to defer:
- Cassandra migration (location pings)
- Kafka (event streaming)
- CDC / Debezium
- Substitution sweeper as a separate service
- Multi-zone, multi-city
Invariants that must hold even in MVP:
- I3 (no oversell) — CAS on stock_count must be implemented from day 1
- I7 (one shopper per order) — must work; use Postgres advisory lock if Redis not available
- I17 (idempotent payments) — idempotency keys on PSP calls from day 1
Stage 2: City Scale (Months 3-9) #
Goal: 5 cities, 10 zones, ~10K orders/day.
What to add:
- Extract Assignment Service from Order Service
- Add Kafka (3-broker cluster) — decouple notification fan-out
- Add Cassandra — migrate location pings from Postgres
- Add Redis Cluster — replace single node
- Add Postgres read replicas — separate browse reads
- Implement Outbox pattern — replace in-process event dispatch
- Add Debezium CDC — payment capture triggered by DB change
- Implement WebSocket for real-time customer updates
- Add circuit breakers on PSP calls
Migrations:
- Postgres location_pings table → Cassandra: dual-write for 2 weeks, then cut over reads, then drop table
- In-process events → Outbox + Kafka: feature-flagged per service, gradual rollout
Stage 3: National Scale (Months 9-24) #
Goal: 50 cities, 500 zones, ~100K orders/day, SLA 99.9%.
What to add:
- Horizontal Postgres sharding by zone_id (Citus or manual application-level sharding)
- Kafka partition expansion (increase to 50 partitions per topic)
- Cassandra cluster expansion to 9 nodes
- Redis Cluster expansion to 9 shards
- CQRS pattern — separate read model (Elasticsearch) for item search with full-text
- Auth expiry reconciliation job (scheduled daily)
- Full observability stack (Prometheus + Grafana + Jaeger + ELK)
- Chaos engineering (Chaos Monkey) — validate failure recovery
Stage 4: Multi-Region (Month 24+) #
Goal: < 100ms latency in each region; regional failure isolation; 99.95% SLA.
Design changes:
- Active-active multi-region with conflict resolution
- Order data: region-pinned (order belongs to the region where it was placed; no cross-region write)
- Shopper data: replicated read-only to other regions; writes go to home region
- Item catalog: replicated globally; writes in home region, CDC to all regions
- Payment: globally replicated Payment table; capture always goes to original auth region (PSP requires it)
- Kafka MirrorMaker 2 for cross-region topic replication
- Global API gateway (Cloudflare or AWS Global Accelerator) for geo-routing
What does NOT need cross-region coordination:
- Inventory reservation (inventory is physical — a store in Seattle doesn’t serve a customer in Austin)
- Shopper assignment (shoppers are local)
- Location pings (local to the order’s region)
What DOES need cross-region:
- Customer profile and payment method tokens (global read; home-region write)
- Item catalog (global read; store-local write)
- Analytics/reporting (global aggregation)
Evolution Decision Points #
| Trigger | Action |
|---|---|
| Location ping write latency P99 > 50ms on Postgres | Migrate to Cassandra |
| CAS contention rate > 20% on popular items | Implement per-item reservation queue (Kafka-serialized) |
| Shopper assignment P95 > 30s | Add predictive pre-assignment (assign before order is placed, based on cart signal) |
| Order placement P99 > 3s | Async reservation (optimistic placement + async confirmation); trade consistency for speed |
| PSP failure rate > 2% | Add secondary PSP fallback routing |
| Item browse cache miss rate > 10% | Increase Redis memory; add read replicas for cache warming |
Summary: Key Design Decisions and Their Derivations #
| Decision | Derived from | Invariant |
|---|---|---|
| CAS on stock_count (not a mutable available_quantity column derived from reservation sum) | Accounting invariant I4: one source of truth | I3, I4 |
| Saga for multi-item reservation (not 2PC) | Cross-service decomposable: each item reservation is independently compensable | I2, I14 |
| Redis SETNX Lease for shopper assignment | Multi-writer one-winner + state machine (Step 6.2) | I7, I8 |
| Lease + fencing token (not just Lease) | Step 6.4 required combination | I7 |
| Outbox + Relay (not dual-write) | Propagation invariant + atomicity: DB write and event publish must be atomic | I14, I15 |
| CDC for payment capture trigger | async-source-is-DB-row (Step 6.3 Q5) | I10 |
| Idempotency keys on all PSP calls | Failure model Q2: writer-crash recovery | I17 |
| SubstitutionRequest as separate object (not an update to OrderItem) | Evolution purity (Step 2): OrderItem is append-only/immutable; substitution has its own state machine | I13, I16, I18 |
| Redis ZSET for substitution deadlines | Intent/future constraint with expiry (Step 2 class) + idempotent sweeper | I15 |
| Cassandra for LocationPing | Append-only + single writer + time-series access pattern (Step 3 axes) | — |
| available_quantity cache is advisory only | Source-of-truth validation (Step 7): dual truth eliminated | I4 |
There's no articles to list here yet.