Skip to main content
  1. System Design Components/

Table of Contents

Local Delivery Service: System Design #

A complete derivation of a local grocery/item delivery system (Instacart / DoorDash Grocery pattern) using the 20-step derivation framework — from functional requirements to operable architecture, with every mechanism derived mechanically from invariants.


Ordering Principle #

Every layer constrains the next. Skipping a layer makes later choices arbitrary.

Functional requirements
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, ordering, evolution         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)  ← the mechanical bridge
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

Step 1 — Problem Normalization #

Functional Requirements (Raw) #

  1. Customer browses items available at nearby stores within their delivery zone.
  2. Customer places an order for multiple items, possibly from one store.
  3. System assigns a shopper/courier to fulfill the order.
  4. Customer tracks shopper location in real-time during fulfillment.
  5. Shopper reports item unavailability and can propose substitutes; customer approves or rejects.
  6. Order status updates in real-time: accepted → shopping → picked → in-transit → delivered.
  7. Customer is charged only after delivery is confirmed (authorization at placement, capture at delivery).
  8. Items are reserved at order placement — no oversell is permitted.

Normalized Table #

#Original RequirementActorOperationState Touched
1Customer browses nearby itemsCustomerread projectionAvailableItemView (derived from Item + InventoryReservation + DeliveryZone)
1b(hidden write)Systemevaluate zone eligibilityDeliveryZone membership check
2aCustomer places orderCustomercreate process objectOrder + OrderItem records
2b(hidden write)Systemattempt conditional update per itemInventoryReservation per item (CAS on stock)
3System assigns shopperSystemattempt conditional claimShopper.status state machine (multi-writer one-winner)
4Customer tracks shopperCustomerread derived view (live)LocationPingView (derived from LocationPing stream)
4b(hidden write)Shopper deviceappend eventLocationPing (immutable)
5aShopper reports unavailabilityShopperappend eventSubstitutionRequest created
5bCustomer approves/rejects substituteCustomertransition state machineSubstitutionRequest.status
6Order status updates in real-timeSystemtransition state machine + pushOrder.status
7aPayment authorization at order placementSystemcreate + authorizePayment record + PSP auth call
7bPayment capture at deliverySystemtransition + capturePayment.status (authorized → captured)
8Items reserved at order timeSystemconditional writeInventoryReservation (quantity locked)

Key Seams Exposed by Normalization #

  • FR1 vs FR8: There is a race between “show available stock” (read projection) and “reserve stock” (conditional write). These are separate operations with a window between them. CAS on stock quantity is required.
  • FR2 vs FR3: Order creation and shopper assignment are not atomic — shopper assignment is a downstream step triggered after order is persisted.
  • FR5: Substitution is not an update to an OrderItem. It creates a new SubstitutionRequest object with its own approval lifecycle.
  • FR7: Payment is a two-step process (auth then capture) separated by the entire fulfillment duration — potentially hours. Payment authorization expiry is a real failure mode.

Step 2 — Object Extraction #

Candidates and Classification #

Primary Objects #

ObjectClassRationale
StoreStable entityLong-lived; mutable (hours, address, active status); owned by operator
ItemStable entityLong-lived per store; mutable (price, stock count, description); owned by store operator
InventoryReservationProcess objectLifecycle: pending → confirmed → released | fulfilled; created at order time, released on cancel/delivery
OrderProcess objectCentral lifecycle object; state machine drives the entire fulfillment pipeline
OrderItemRelationship objectLinks Order to Item with quantity; immutable once placed (substitution creates new record, not overwrite)
ShopperStable entityLong-lived; mutable status (available / on_order / offline); location updated continuously
LocationPingEventImmutable point-in-time GPS record from shopper device
SubstitutionRequestIntent / future constraintPersists until customer approves or rejects; has expiry deadline
PaymentProcess objectState machine: pending → authorized → captured | refunded | failed
DeliveryZoneStable entityGeo-boundary polygon; stable; written by operators only

Derived / Rejected Objects #

CandidateProblemFix
AvailableStock (counter)Derivable from initial_stock - sum(reservations) + sum(cancellations)Either a derived view OR the single mutable counter that IS the source of truth (with CAS) — not both
ShopperLocationViewDerivable from latest LocationPing per shopperDerived view; materialized in Redis Geo for spatial query
OrderStatusHistoryDerivable from Order state machine audit logDerived view; project from Order event log
EstimatedDeliveryTimeDerivable from shopper location + store distance + trafficDerived view; computed on read
CustomerCartUI artifact (client-side state until order placed)Not a domain object; lives in client session only

Four Purity Tests Applied #

InventoryReservation #

  1. Ownership: System creates on order placement; system transitions on fulfillment or cancellation. Single writer pipeline. PASS.
  2. Evolution: State machine (pending → confirmed → released | fulfilled). Does not append, does not merge. PASS.
  3. Ordering: Reservations ordered by creation time within one item (to determine which get confirmed if stock runs out). PASS.
  4. Non-derivability: Cannot reconstruct which orders hold locks on which quantities without this object. PASS.

LocationPing #

  1. Ownership: Shopper device is the sole writer. PASS.
  2. Evolution: Append-only, immutable once written. PASS.
  3. Ordering: Ordered by timestamp within one shopper_id. PASS.
  4. Non-derivability: Raw GPS history cannot be derived from elsewhere. PASS.

SubstitutionRequest #

  1. Ownership: Shopper creates; customer transitions (approve/reject); system can expire. Multi-actor but pipeline-ordered (shopper writes first, customer responds). PASS.
  2. Evolution: State machine (pending_approval → approved | rejected | expired). PASS.
  3. Ordering: At most one active SubstitutionRequest per OrderItem at a time (uniqueness invariant). PASS.
  4. Non-derivability: Customer intent (approve/reject) cannot be derived from other state. PASS.

Step 3 — Axis Assignment #

Per-Object Axis Table #

ObjectOwnershipEvolutionOrdering scope
StoreSingle writer per store (operator)OverwriteN/A (singleton per store_id)
ItemSingle writer per item (store operator)OverwriteN/A (singleton per item_id)
InventoryReservationSystem-only (order service)State machineBy creation time within item_id
OrderMulti-stage pipeline (customer creates; system advances)State machineBy status-transition time within order_id
OrderItemSingle writer (order service at placement)Append-only (immutable after creation)By position within order_id
ShopperMulti-writer one-winner (assignment service claims)State machine (availability + overwrite location)N/A for status; by timestamp for location
LocationPingSingle writer per shopper (device)Append-onlyBy device timestamp within shopper_id
SubstitutionRequestMulti-actor pipelineState machineBy creation time within order_item_id
PaymentSystem-only pipelineState machineBy transition time within order_id
DeliveryZoneSingle writer (operator/admin)OverwriteN/A (singleton per zone_id)

Critical Axis Combinations #

Shopper (multi-writer one-winner + state machine): Multiple assignment service instances may simultaneously attempt to claim the same shopper for different orders. Only one claim can succeed. This combination resolves to a Lease (SETNX in Redis on shopper_id), as derived in Step 6.

InventoryReservation (system-only + state machine, but race at creation): The creation step is a conditional write — “create reservation only if available_quantity > 0.” Multiple concurrent orders may race for the same item. This is effectively multi-writer one-winner at creation, resolving to CAS on a quantity counter.

Order (multi-stage pipeline + state machine): No two services should advance the same order state simultaneously. Transitions must be guarded by CAS on (current_state, version).


Step 4 — Invariant Extraction #

Invariant Table #

IDTypeStatementObjects involved
I1EligibilityA customer may only order items from stores within their delivery zoneCustomer, Store, DeliveryZone, Order
I2EligibilityAn order may only be placed if every item in the order has available_quantity ≥ requested_quantity at placement timeItem, InventoryReservation, Order
I3OrderingThe sum of all active reservations for an item must never exceed the item’s current stockInventoryReservation, Item
I4Accountingavailable_quantity = initial_stock − confirmed_reservations + released_reservationsItem, InventoryReservation
I5Uniqueness/IdempotencyAt most one active InventoryReservation may exist per (order_id, item_id)InventoryReservation
I6OrderingOrder status transitions must follow the legal state machine: pending → accepted → shopping → picked → in_transit → delivered; cancellation from pending or accepted onlyOrder
I7Uniqueness/IdempotencyAt most one shopper may be assigned to an order at any timeOrder, Shopper
I8Uniqueness/IdempotencyAt most one Shopper record may be in on_order status per orderShopper
I9OrderingAn order may not transition to delivered without all OrderItems having a resolution (fulfilled or substituted or removed with consent)Order, OrderItem, SubstitutionRequest
I10AccountingA Payment may only be captured after the corresponding Order has reached delivered statusPayment, Order
I11AccountingA Payment capture amount must not exceed the authorized amount (plus allowed variance for substitutions)Payment
I12Access-controlOnly the assigned shopper may advance Order status from shopping → picked → in_transit → deliveredOrder, Shopper
I13Access-controlOnly the customer who placed the order may approve or reject a SubstitutionRequestSubstitutionRequest, Customer
I14PropagationWhen an Order is cancelled, all active InventoryReservations for that order must be releasedOrder, InventoryReservation
I15PropagationWhen a SubstitutionRequest expires without approval, the OrderItem must be automatically removed and any excess reservation releasedSubstitutionRequest, OrderItem, InventoryReservation
I16EligibilityA SubstitutionRequest may only be created for an item that is in confirmed_reservation stateSubstitutionRequest, InventoryReservation
I17IdempotencyPayment authorization and capture operations must be idempotent — retrying must not double-chargePayment
I18UniquenessAt most one active SubstitutionRequest per (order_id, item_id)SubstitutionRequest

Step 5 — Design Point (DP) Derivation #

Each DP is the minimal mechanism that enforces its invariant cluster, derived bottom-up from invariant types.

DPInvariant clusterEnforcing mechanism
DP1I1Zone eligibility check at order placement; store’s zone_id must be in customer’s eligible zones. Enforced at Order service write path.
DP2I2, I3, I4, I5Atomic CAS on item stock counter. Reservation creation is conditional on counter being positive. Counter is the single source of truth for available stock.
DP3I6Order state machine with guarded transitions. CAS on (order_id, current_status) — only allowed transitions accepted.
DP4I7, I8Distributed lease on shopper_id. Assignment service acquires SETNX on shopper_id:lock with TTL. Only one winner per shopper per time window.
DP5I9Pre-condition check before Order → delivered transition. All OrderItems must be in resolved state.
DP6I10, I11Payment capture gated on Order.status = delivered. Amount bounded by authorized amount ± substitution delta.
DP7I12Access-control check: Order service validates caller identity = order.assigned_shopper_id before allowing shopping-phase transitions.
DP8I13Access-control check: SubstitutionRequest service validates caller identity = order.customer_id before approve/reject.
DP9I14Saga compensation: on Order cancellation, fan-out release messages to each InventoryReservation. Each release is idempotent.
DP10I15Deadline-based scheduler: SubstitutionRequest TTL enforced by a delayed job (Redis sorted set or Kafka delay topic). On expiry, compensation removes OrderItem and releases reservation.
DP11I16Pre-condition check at SubstitutionRequest creation: item’s InventoryReservation must be in confirmed state.
DP12I17Idempotency keys on all PSP calls. Key = (order_id, operation_type). PSP returns cached result for duplicate key.
DP13I18Unique constraint on (order_id, item_id) in SubstitutionRequest table with status IN (pending_approval).

Step 6 — Mechanism Selection #

The mechanical bridge maps invariant clusters and axis assignments to concrete implementation mechanisms.

6.1 Invariant Type → Mechanism Family #

Invariant typeMechanism family
EligibilityPrecondition check at write path (synchronous gate)
Ordering (count bound)CAS on mutable counter (optimistic locking)
Accounting (derived value)Single authoritative counter OR event-sourced derivation — pick one; never two
Uniqueness/IdempotencyUnique constraint + idempotency key store
Propagation (cross-object)Saga (decomposable) or 2PC (non-decomposable)
Access-controlIdentity check at service boundary

6.2 Ownership × Evolution → Concurrency Mechanism #

ObjectOwnershipEvolutionDerived mechanism
InventoryReservation (creation)Multi-writer, one winnerState machineCAS on quantity counter + unique constraint on (order_id, item_id)
OrderPipelineState machineCAS on (status, version)
Shopper (assignment)Multi-writer, one winnerState machineLease (SETNX on shopper_id)
SubstitutionRequestMulti-actor pipelineState machineCAS on (status, version)
PaymentSystem pipelineState machineCAS on (status, version) + Idempotency Key on PSP calls
LocationPingSingle writer per shopperAppend-onlyNo concurrency needed; idempotent by content
Order cancellation → Reservation releaseCross-servicePropagationSaga with compensation

6.3 Full Mechanical Derivation for Four Key Flows #


Derivation 1: Inventory Reservation (Multi-Item Atomicity) #

Problem statement: When a customer places an order with N items, all N items must be reserved or none. If item 3 is out of stock after items 1 and 2 are reserved, the partial reservation must be rolled back.

Q1 (Scope): Is this within one service or cross-service?

  • The inventory check and reservation creation happen within the Order service’s write path, but reservation records may span multiple store-partitions. This is cross-object but within-service for the reservation logic. However, compensation (unreserve) crosses logical boundaries. → Cross-service decomposable (compensation is well-defined: “release reservation”).

Q2 (Failure): What happens if the writer crashes mid-reservation (e.g., after reserving items 1-2 but before attempting item 3)?

  • The reservation records for items 1-2 exist in a confirmed-pending state. No compensating transaction has been applied. → Saga with compensation is required. Each step must be independently compensable. → Idempotency Key on each reservation step (keyed by order_id + item_id) ensures that retries after crash do not double-reserve.

Q3 (Data): Is the stock state commutative?

  • No. “Reserve 5 units” and “Reserve 3 units” applied concurrently when only 5 are available must not both succeed. Not commutative under scarcity. → CAS on quantity counter (not CRDT).

Q4 (Access): Read pattern for stock availability?

  • Read » Write (millions of customers browse, few place orders). → Cache-Aside for browse reads (AvailableItemView cached in Redis). The reservation write path hits Postgres directly (cache is advisory; Postgres is authoritative).

Q5 (Coupling): How does reservation propagate to the shopper view?

  • Shopper needs to see which items are reserved (i.e., what to pick). → Outbox + Relay: Order service writes reservation to Postgres + Outbox atomically; relay publishes to Kafka; Shopper service consumes. Avoids dual-write.

Resulting mechanism combination:

Saga (per-item steps)
  + CAS on quantity counter (prevents oversell at each step)
  + Idempotency Key per step (keyed by order_id:item_id:reserve)
  + Compensation (release reservation) on Saga rollback
  + Outbox + Relay (propagate reservation to downstream consumers)

Required by 6.4: CAS + Idempotency Key — satisfied. (Saga does not require Lease; Lease applies to shopper assignment below.)


Derivation 2: Shopper Assignment #

Problem statement: When an order is accepted, the system must claim exactly one available shopper. Multiple assignment service instances may race to claim the same shopper for different orders.

Q1 (Scope): Within-service or cross-service?

  • Assignment is within the Assignment service, but shopper state is shared across all instances. → Distributed CAS (not cross-region; no CRDT because not commutative).

Q2 (Failure): What if the assignment service instance crashes after claiming the shopper but before persisting the assignment to Order?

  • Shopper remains locked (claimed) but Order has no shopper_id. The shopper lease must expire so another instance can re-claim. → Lease (SETNX with TTL). Lease TTL = assignment processing timeout (e.g., 30s). If Order is not updated within TTL, lease expires and shopper is returned to available pool.

Q3 (Data): Is shopper availability commutative?

  • No. “Claim shopper X” by two different orders — only one should win. → Not CRDT. Lease is correct.

Q4 (Access): How is the nearest available shopper found?

  • Geospatial query: find shoppers within N km of store, sorted by distance. → Spatial Partition + Scatter-Gather. Redis Geo index on shopper locations; GEORADIUS query to find candidates; claim attempt in order of proximity.

Q5 (Coupling): How does assignment propagate to Order?

  • Assignment service must atomically: (a) acquire lease on shopper, (b) update Order.shopper_id, (c) update Shopper.status = on_order. Steps b and c must both succeed or both roll back. → Saga with compensation: if Order update fails, release shopper lease. If Shopper status update fails (shopper went offline), release lease and retry with next candidate.

Resulting mechanism combination:

Lease (Redis SETNX on "shopper:{shopper_id}:claim" with TTL=30s)
  + fencing token (lease token included in all subsequent writes; stale writes rejected)
  + Saga (multi-step assignment: lease → Order.shopper_id → Shopper.status)
  + Spatial query (Redis GEORADIUS) for candidate generation
  + CAS on Order.status (pending → accepted) to prevent duplicate assignment

Required by 6.4: Lease + fencing token — satisfied.


Derivation 3: Substitution Approval Workflow #

Problem statement: When a shopper finds an item unavailable, they propose a substitute. The customer must approve or reject within a time window. If the window expires, the item is dropped automatically. This is an intent/future-constraint object with a deadline.

Q1 (Scope): Within-service or cross-service?

  • SubstitutionRequest is created by Shopper service; approved/rejected by Customer service; expired by Scheduler service. Cross-service. → Cross-service decomposable (approval and rejection are compensable in both directions).

Q2 (Failure): What if the deadline expires and the scheduler crashes before processing the expiry?

  • The SubstitutionRequest remains in pending_approval state indefinitely. → Deadline enforced by a durable delayed job: scheduled via Redis sorted set (score = expiry_unix_timestamp) + a sweeper that runs every 10s and processes expired requests. The sweeper is idempotent (CAS on SubstitutionRequest.status from pending_approval → expired). If sweeper crashes mid-way, it retries; CAS prevents double-processing.

Q3 (Data): Can multiple SubstitutionRequests for the same (order_id, item_id) exist simultaneously?

  • No. I18 prohibits this. → Unique constraint on (order_id, item_id) filtered to status = pending_approval. Database enforces; service layer validates.

Q4 (Access): Customer must see the substitution request in real-time (push notification + UI update).

  • Fan-out on Write: when SubstitutionRequest is created, publish to Kafka → push notification service → APNs/FCM. Also pub/sub channel per order_id for live UI update (WebSocket).

Q5 (Coupling): On approval, the OrderItem must be updated (replace item_id + update reservation). On rejection, the InventoryReservation for the original item must be released.

  • Outbox + Relay: SubstitutionRequest service writes to DB + Outbox atomically; relay publishes event to Kafka; downstream services (InventoryReservation service, OrderItem service) consume and apply their respective changes. Each downstream step is idempotent (keyed by substitution_request_id + action).

Resulting mechanism combination:

SubstitutionRequest state machine (pending_approval → approved | rejected | expired)
  + CAS on (status, version) for all transitions
  + Unique constraint on (order_id, item_id) for active requests
  + Durable deadline: Redis ZSET (key=substitution_request_id, score=expiry_ts)
  + Idempotent sweeper (CAS on status prevents double-expiry)
  + Outbox + Relay (propagate approval/rejection to InventoryReservation + OrderItem)
  + Fan-out on Write (push notification + WebSocket on creation)

Derivation 4: Payment Authorization and Capture #

Problem statement: Payment is authorized at order placement (hold on card) but captured only after delivery confirmation. The authorization and capture are separate PSP calls, potentially hours apart. Authorization can expire (typically 7 days for most card networks). Double-capture must be impossible.

Q1 (Scope): Cross-service (Payment service → PSP external call). → Cross-service non-decomposable at PSP level (PSP calls cannot be Saga-compensated within milliseconds; compensation = refund, which is a different operation). Within the system, Payment state transitions are a state machine.

Q2 (Failure):

  • PSP call for authorization times out: retry with same idempotency key → PSP returns cached result or processes once. → Idempotency Key on all PSP calls (key = order_id:auth or order_id:capture).
  • System crashes after PSP confirms authorization but before persisting Payment.status = authorized: on restart, query PSP with idempotency key to recover status, then persist. → Idempotency Key + recovery reconciliation job.
  • Authorization expiry: if delivery takes > 7 days (edge case), auth expires. → Scheduler: daily job checks Payment records in authorized state older than 6 days and triggers re-authorization.

Q3 (Data): Is payment state commutative?

  • No. Capture must happen exactly once. Not commutative. → CAS on Payment.status (authorized → captured). Two concurrent capture attempts → only one CAS succeeds.

Q4 (Access): Payment reads are infrequent (per-order); PSP calls are the bottleneck.

  • No special caching needed. Payment record is small; read from Postgres.

Q5 (Coupling): Capture must be triggered by Order reaching delivered status. How is this propagated?

  • CDC (Change Data Capture): Order service updates Order.status = delivered in Postgres; Debezium publishes change event to Kafka; Payment service consumes and initiates capture. This ensures the trigger is the DB row change, not a dual-write.

Resulting mechanism combination:

Payment state machine (pending → authorized → captured | refunded | failed)
  + CAS on (status, version) for all transitions
  + Idempotency Key on all PSP calls (key = order_id:operation)
  + CDC (Debezium on Order table) → Kafka → Payment capture trigger
  + Auth expiry scheduler (daily reconciliation job)
  + Recovery reconciliation (on startup, query PSP for in-flight payments)

6.4 Required Combinations Check #

CombinationApplied whereSatisfied?
CAS + Idempotency KeyInventory reservation (CAS on counter + idempotency key per step)Yes
CAS + Idempotency KeyPayment capture (CAS on status + idempotency key on PSP calls)Yes
Lease + fencing tokenShopper assignment (SETNX lease + token in downstream writes)Yes

Step 7 — Axiomatic Validation (Source-of-Truth Table) #

Rule: Each piece of state has exactly one source of truth. No dual truth.

StateSingle source of truthHow derived views are computedDual-truth risk
Item.available_quantityPostgres Item.stock_count counter (decremented by CAS on reservation)AvailableItemView in Redis cache derived from PostgresRISK: cache must not be used as authority for reservation decisions. Cache is advisory only. Write path always reads from Postgres with FOR UPDATE.
Order.statusPostgres Order tableOrderStatusView in Redis/Kafka for real-time pushNo dual-truth: Postgres is authority; Redis/Kafka are read projections
Shopper.current_locationRedis Geo index (latest ping)LocationPing history in CassandraNo dual-truth: Redis is live view; Cassandra is history. They serve different queries.
Shopper.statusRedis Lease (on_order = lease held) + Postgres Shopper tableShopper availability viewRISK: Redis lease and Postgres must agree. Reconciliation: on lease expiry, system must reset Shopper.status in Postgres. Fencing token prevents stale Postgres write.
Payment.statusPostgres Payment tableN/ANo derived view; single source
SubstitutionRequest.statusPostgres SubstitutionRequest tableRedis ZSET for deadline index (advisory)No dual-truth: ZSET is a deadline index, not a status authority. Status in Postgres is canonical.
InventoryReservation.statusPostgres InventoryReservation tableN/ASingle source

Dual-Truth Resolution Log #

  1. Item.available_quantity: Resolution — cache is advisory only. All reservation write paths use SELECT FOR UPDATE on Postgres item row. Cache is populated by Postgres read after write; cache miss falls through to Postgres.

  2. Shopper.status (Redis vs Postgres): Resolution — Postgres Shopper.status is updated only with the valid fencing token from the Redis lease. On lease expiry, a lease-expiry handler (not the expired holder) updates Postgres.status = available using a system-level write (no fencing token required because the lease has expired and no other writer holds it).


Step 8 — Algorithm Design #

Algorithm 1: Multi-Item Order Placement (Saga) #

function placeOrder(customer_id, store_id, items[{item_id, quantity}]):
    # Step 0: Eligibility check
    zone_ok = checkDeliveryZone(customer_id, store_id)
    if not zone_ok: raise EligibilityError

    # Step 1: Create Order in pending state
    order_id = generateUUID()
    idempotency_key = hash(customer_id + cart_fingerprint)
    existing = lookupIdempotencyKey(idempotency_key)
    if existing: return existing  # idempotent replay

    order = INSERT INTO orders (id, customer_id, store_id, status='pending', version=0)
    persistIdempotencyKey(idempotency_key, order_id)

    # Step 2: Authorize payment (async, but must succeed before reservations)
    auth_result = authorizePayment(customer_id, estimatedTotal(items),
                                   idempotency_key=order_id+":auth")
    if auth_result.failed:
        markOrder(order_id, status='cancelled', reason='payment_auth_failed')
        return Error("payment_authorization_failed")

    INSERT INTO payments (order_id, status='authorized', auth_token=auth_result.token,
                          authorized_amount=auth_result.amount)

    # Step 3: Reserve items (Saga — each step compensable)
    reserved = []
    for item in items:
        step_key = order_id + ":" + item.item_id + ":reserve"
        if lookupIdempotencyKey(step_key):
            reserved.append(item)
            continue  # already reserved (retry path)

        # CAS on stock counter
        result = executeSQL("""
            UPDATE items
            SET stock_count = stock_count - :qty,
                version = version + 1
            WHERE item_id = :item_id
              AND stock_count >= :qty
              AND version = :expected_version
        """, item_id=item.item_id, qty=item.quantity,
             expected_version=item.read_version)

        if result.rows_affected == 0:
            # Saga rollback: release all previously reserved items
            for prev in reserved:
                releaseReservation(order_id, prev.item_id)  # idempotent
            markOrder(order_id, status='cancelled', reason='item_unavailable:'+item.item_id)
            voidPaymentAuth(order_id, idempotency_key=order_id+":void")
            return Error("item_out_of_stock", item_id=item.item_id)

        INSERT INTO inventory_reservations (
            order_id, item_id, quantity, status='confirmed'
        )
        persistIdempotencyKey(step_key, "confirmed")
        reserved.append(item)

    # Step 4: Advance order to accepted
    CAS_update(orders, order_id,
               expected_status='pending', new_status='accepted',
               expected_version=0, new_version=1)

    publishToOutbox(order_id, event='order_accepted', payload=order)
    return order_id


function releaseReservation(order_id, item_id):
    # Idempotent: safe to call multiple times
    reservation = SELECT ... WHERE order_id=order_id AND item_id=item_id
    if reservation.status == 'released': return  # already done

    UPDATE items SET stock_count = stock_count + reservation.quantity
      WHERE item_id = item_id
    UPDATE inventory_reservations SET status='released'
      WHERE order_id=order_id AND item_id=item_id AND status != 'released'

Algorithm 2: Shopper Assignment #

function assignShopper(order_id, store_id):
    order = getOrder(order_id)
    assert order.status == 'accepted'

    MAX_ATTEMPTS = 5
    attempts = 0

    while attempts < MAX_ATTEMPTS:
        # Find nearby available shoppers
        store_location = getStoreLocation(store_id)
        candidates = GEORADIUS(key='shoppers:available',
                               lat=store_location.lat, lon=store_location.lon,
                               radius=10km, sort=ASC, count=10)

        for shopper_id in candidates:
            lease_key = "shopper:" + shopper_id + ":claim"
            fencing_token = generateFencingToken()

            # Attempt to acquire lease
            acquired = SETNX(lease_key,
                             value=json(order_id=order_id, token=fencing_token),
                             EX=30)  # 30s TTL

            if not acquired: continue  # already claimed by another order

            # Verify shopper is still available in Postgres (lease acquired)
            shopper = SELECT ... FROM shoppers WHERE id=shopper_id FOR UPDATE
            if shopper.status != 'available':
                DEL(lease_key)  # release lease — stale Redis state
                continue

            # Saga step 1: Update Order with shopper
            order_updated = CAS_update(orders, order_id,
                expected_status='accepted', new_status='shopping',
                expected_version=order.version,
                new_version=order.version+1,
                new_shopper_id=shopper_id,
                new_fencing_token=fencing_token)

            if not order_updated:
                DEL(lease_key)  # compensate
                return  # concurrent assignment won — OK

            # Saga step 2: Update Shopper status (with fencing token validation)
            UPDATE shoppers SET status='on_order', assigned_order_id=order_id
              WHERE id=shopper_id
                AND status='available'
                AND (last_fencing_token IS NULL OR last_fencing_token < fencing_token)

            # Lease remains held; will be released when order reaches delivered/cancelled
            publishToOutbox(order_id, event='shopper_assigned', shopper_id=shopper_id)
            return order_id

        attempts++
        sleep(exponential_backoff(attempts))

    # No shopper available — requeue order for retry
    publishDelayed(order_id, event='assignment_retry', delay=60s)

Algorithm 3: Substitution Approval Flow #

function proposeSubstitution(shopper_id, order_id, item_id, substitute_item_id, reason):
    # Verify shopper is assigned to this order
    order = getOrder(order_id)
    assert order.assigned_shopper_id == shopper_id
    assert order.status == 'shopping'

    # Verify original item reservation is active
    reservation = getReservation(order_id, item_id)
    assert reservation.status == 'confirmed'

    # Uniqueness check (also enforced by DB constraint)
    existing = SELECT ... FROM substitution_requests
      WHERE order_id=order_id AND item_id=item_id AND status='pending_approval'
    if existing: raise UniqueConstraintError

    expiry = now() + 10 minutes  # configurable window
    sub_req = INSERT INTO substitution_requests (
        id=generateUUID(), order_id, item_id, substitute_item_id, reason,
        status='pending_approval', version=0, expires_at=expiry
    )

    # Register deadline in Redis ZSET
    ZADD('substitution:deadlines', expiry.unix_timestamp, sub_req.id)

    # Outbox event → Kafka → push notification + WebSocket
    publishToOutbox(sub_req.id, event='substitution_proposed',
                    payload={order_id, item_id, substitute_item_id, expiry})
    return sub_req.id


function approveSubstitution(customer_id, substitution_request_id):
    sub = getSubstitutionRequest(substitution_request_id) FOR UPDATE
    assert sub.status == 'pending_approval'
    assert sub.expires_at > now()
    assert getOrder(sub.order_id).customer_id == customer_id  # access control

    # CAS transition
    CAS_update(substitution_requests, sub.id,
               expected_status='pending_approval', new_status='approved',
               expected_version=sub.version, new_version=sub.version+1)

    # Remove deadline from Redis ZSET
    ZREM('substitution:deadlines', sub.id)

    # Outbox → Kafka → downstream effects:
    #   - InventoryReservation service: release old item reservation, create new
    #   - OrderItem service: update item_id to substitute
    publishToOutbox(sub.id, event='substitution_approved',
                    payload={order_id=sub.order_id, old_item=sub.item_id,
                             new_item=sub.substitute_item_id})


function sweepExpiredSubstitutions():
    # Runs every 10 seconds
    now_ts = unix_timestamp()
    expired_ids = ZRANGEBYSCORE('substitution:deadlines', 0, now_ts, LIMIT 100)

    for sub_id in expired_ids:
        sub = getSubstitutionRequest(sub_id) FOR UPDATE
        if sub.status != 'pending_approval':
            ZREM('substitution:deadlines', sub_id)
            continue  # already resolved

        # Idempotent CAS
        updated = CAS_update(substitution_requests, sub_id,
                             expected_status='pending_approval', new_status='expired',
                             expected_version=sub.version)
        if updated:
            ZREM('substitution:deadlines', sub_id)
            publishToOutbox(sub_id, event='substitution_expired',
                            payload={order_id=sub.order_id, item_id=sub.item_id})
            # Downstream: release InventoryReservation for original item
            #             remove OrderItem from order

Algorithm 4: Payment Capture #

function capturePayment(order_id):
    # Triggered by CDC event: Order.status changed to 'delivered'

    payment = SELECT ... FROM payments WHERE order_id=order_id FOR UPDATE
    if payment.status != 'authorized':
        log("Ignoring capture trigger — payment status: " + payment.status)
        return  # idempotent: already captured or not authorizable

    # Calculate final amount (may differ from auth due to substitutions/removals)
    final_amount = calculateFinalAmount(order_id)
    assert final_amount <= payment.authorized_amount * 1.10  # max 10% variance

    # Call PSP with idempotency key
    capture_result = psp.capture(
        auth_token=payment.auth_token,
        amount=final_amount,
        idempotency_key=order_id + ":capture"
    )

    if capture_result.success:
        CAS_update(payments, payment.id,
                   expected_status='authorized', new_status='captured',
                   expected_version=payment.version,
                   new_captured_amount=final_amount,
                   captured_at=now())
    else if capture_result.auth_expired:
        # Authorization expired — re-authorize then capture
        reauth_result = psp.authorize(
            customer_payment_method=payment.payment_method_token,
            amount=final_amount,
            idempotency_key=order_id + ":reauth"
        )
        if reauth_result.success:
            psp.capture(auth_token=reauth_result.token, amount=final_amount,
                        idempotency_key=order_id + ":capture:v2")
            UPDATE payments SET status='captured', ... WHERE id=payment.id
        else:
            UPDATE payments SET status='failed', failure_reason='reauth_declined'
            # Alert: manual review required
            publishAlert('payment_capture_failed', order_id=order_id)
    else:
        # Transient PSP error — exponential retry (max 3 attempts over 5 minutes)
        scheduleRetry('capture_payment', order_id=order_id, attempt=1, delay=30s)

Order State Machine #

States: pending → accepted → shopping → substitution_pending → picked → in_transit → delivered
         └──────────────────────────────────────────────────────────────────────────────────→ cancelled

Legal transitions:
  pending        → accepted           (system: shopper assigned)
  pending        → cancelled          (customer: within cancellation window)
  accepted       → shopping           (shopper: arrived at store, started shopping)
  accepted       → cancelled          (customer: before shopper starts)
  shopping       → substitution_pending (system: when SubstitutionRequest created)
  substitution_pending → shopping     (system: when all pending substitutions resolved)
  shopping       → picked             (shopper: all items confirmed picked)
  picked         → in_transit         (shopper: left store with order)
  in_transit     → delivered          (shopper: confirmed delivery)
  any non-terminal → cancelled        (system: forced cancellation with compensation)

Note: substitution_pending is a synthetic state — the Order may have multiple concurrent
SubstitutionRequests. The state returns to 'shopping' when all are resolved.
Guard: Order may only transition to picked when zero SubstitutionRequests are in pending_approval state.

Step 9 — Logical Data Model #

Schema #

-- Stable entities

CREATE TABLE delivery_zones (
    zone_id         UUID PRIMARY KEY,
    name            TEXT NOT NULL,
    boundary        GEOMETRY(POLYGON, 4326) NOT NULL,  -- PostGIS
    active          BOOLEAN DEFAULT TRUE,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Index: GIST(boundary) for zone containment queries

CREATE TABLE stores (
    store_id        UUID PRIMARY KEY,
    name            TEXT NOT NULL,
    zone_id         UUID NOT NULL REFERENCES delivery_zones(zone_id),
    location        GEOMETRY(POINT, 4326) NOT NULL,
    active          BOOLEAN DEFAULT TRUE,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Index: store_id (PK), zone_id, GIST(location)

CREATE TABLE items (
    item_id         UUID PRIMARY KEY,
    store_id        UUID NOT NULL REFERENCES stores(store_id),
    name            TEXT NOT NULL,
    price_cents     INTEGER NOT NULL,
    stock_count     INTEGER NOT NULL DEFAULT 0 CHECK (stock_count >= 0),
    version         BIGINT NOT NULL DEFAULT 0,
    active          BOOLEAN DEFAULT TRUE,
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Partition key: store_id (logical partition; physical by store_id range or hash)
-- Index: (store_id, active) for browse queries
-- Note: stock_count is the single source of truth for available inventory.
--       It is decremented by CAS at reservation time.

CREATE TABLE shoppers (
    shopper_id      UUID PRIMARY KEY,
    name            TEXT NOT NULL,
    status          TEXT NOT NULL DEFAULT 'offline'
                    CHECK (status IN ('offline', 'available', 'on_order')),
    assigned_order_id UUID,
    zone_id         UUID REFERENCES delivery_zones(zone_id),
    last_fencing_token TEXT,
    version         BIGINT NOT NULL DEFAULT 0,
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (status, zone_id) for availability queries

-- Process objects

CREATE TABLE orders (
    order_id        UUID PRIMARY KEY,
    customer_id     UUID NOT NULL,
    store_id        UUID NOT NULL REFERENCES stores(store_id),
    shopper_id      UUID REFERENCES shoppers(shopper_id),
    status          TEXT NOT NULL DEFAULT 'pending'
                    CHECK (status IN ('pending','accepted','shopping',
                                     'substitution_pending','picked',
                                     'in_transit','delivered','cancelled')),
    cancellation_reason TEXT,
    version         BIGINT NOT NULL DEFAULT 0,
    idempotency_key TEXT UNIQUE,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Partition key: customer_id (most queries scoped to customer)
-- Index: (customer_id, created_at DESC), (shopper_id, status), (status, created_at)

CREATE TABLE order_items (
    order_item_id   UUID PRIMARY KEY,
    order_id        UUID NOT NULL REFERENCES orders(order_id),
    item_id         UUID NOT NULL REFERENCES items(item_id),
    quantity        INTEGER NOT NULL CHECK (quantity > 0),
    unit_price_cents INTEGER NOT NULL,  -- price at order time, snapshot
    resolution      TEXT DEFAULT 'pending'
                    CHECK (resolution IN ('pending','fulfilled','substituted','removed')),
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (order_id) — always queried by order

CREATE TABLE inventory_reservations (
    reservation_id  UUID PRIMARY KEY,
    order_id        UUID NOT NULL REFERENCES orders(order_id),
    item_id         UUID NOT NULL REFERENCES items(item_id),
    quantity        INTEGER NOT NULL,
    status          TEXT NOT NULL DEFAULT 'confirmed'
                    CHECK (status IN ('confirmed','released','fulfilled')),
    idempotency_key TEXT UNIQUE,  -- order_id:item_id:reserve
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (order_id, item_id)  -- enforces I5
);
-- Partition key: item_id (for stock aggregation queries)
-- Index: (order_id), (item_id, status)

CREATE TABLE substitution_requests (
    substitution_id     UUID PRIMARY KEY,
    order_id            UUID NOT NULL REFERENCES orders(order_id),
    item_id             UUID NOT NULL REFERENCES items(item_id),
    substitute_item_id  UUID NOT NULL REFERENCES items(item_id),
    shopper_reason      TEXT,
    status              TEXT NOT NULL DEFAULT 'pending_approval'
                        CHECK (status IN ('pending_approval','approved','rejected','expired')),
    version             BIGINT NOT NULL DEFAULT 0,
    expires_at          TIMESTAMPTZ NOT NULL,
    created_at          TIMESTAMPTZ DEFAULT NOW(),
    updated_at          TIMESTAMPTZ DEFAULT NOW()
);
-- Unique constraint: only one pending per (order_id, item_id)
-- Enforced by partial unique index:
CREATE UNIQUE INDEX uq_substitution_active
    ON substitution_requests(order_id, item_id)
    WHERE status = 'pending_approval';
-- Index: (expires_at) for sweeper queries, (order_id)

CREATE TABLE payments (
    payment_id          UUID PRIMARY KEY,
    order_id            UUID NOT NULL UNIQUE REFERENCES orders(order_id),
    status              TEXT NOT NULL DEFAULT 'pending'
                        CHECK (status IN ('pending','authorized','captured','refunded','failed')),
    authorized_amount   INTEGER,  -- in cents
    captured_amount     INTEGER,
    auth_token          TEXT,     -- PSP auth token
    payment_method_token TEXT,    -- tokenized card reference
    failure_reason      TEXT,
    version             BIGINT NOT NULL DEFAULT 0,
    authorized_at       TIMESTAMPTZ,
    captured_at         TIMESTAMPTZ,
    auth_expires_at     TIMESTAMPTZ,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (order_id) unique, (status, authorized_at) for expiry reconciliation

-- Events (append-only, high volume)

-- LocationPing stored in Cassandra (see Step 10)
-- Schema shown here for reference:
-- CREATE TABLE location_pings (
--     shopper_id  UUID,
--     ts          TIMESTAMP,
--     lat         DOUBLE,
--     lon         DOUBLE,
--     accuracy_m  FLOAT,
--     PRIMARY KEY ((shopper_id), ts)
-- ) WITH CLUSTERING ORDER BY (ts DESC);

-- Outbox table (per-service, for transactional outbox pattern)

CREATE TABLE outbox_events (
    event_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_id    UUID NOT NULL,
    event_type      TEXT NOT NULL,
    payload         JSONB NOT NULL,
    published       BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
-- Index: (published, created_at) for relay queries

-- Idempotency key store

CREATE TABLE idempotency_keys (
    idem_key        TEXT PRIMARY KEY,
    result_ref      TEXT NOT NULL,  -- order_id or reservation_id
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
-- TTL managed by background cleanup job (keys older than 24h deleted)

Step 10 — Technology Landscape #

Technology Selection Table #

ComponentTechnologyJustification
Primary relational storePostgreSQL 16 with PostGISACID transactions for CAS; PostGIS for delivery zone geo queries; partial indexes for substitution uniqueness; row-level locking (SELECT FOR UPDATE)
Shopper lease + availability indexRedis 7 ClusterSETNX with TTL for leases; GEORADIUS for spatial queries; Pub/Sub for live order status per customer session
Real-time location indexRedis Geo (GEOADD, GEORADIUS)O(N+log M) radius query; TTL-expired entries auto-purge shoppers who go offline
Location ping historyApache CassandraTime-series append-only workload; partition by shopper_id, cluster by timestamp; wide-row model fits perfectly; LSM tree optimizes write-heavy load
Event streamingApache KafkaDurable ordered log; topics: order-events, location-stream, inventory-events, substitution-events, payment-events; supports CDC relay, outbox relay, notification fan-out
CDC relayDebeziumTails Postgres WAL; publishes row changes to Kafka without application dual-write
Push notificationsFCM / APNs (via Kafka consumer)Fan-out from substitution-events and order-events Kafka topics
Item imagesS3 + CloudFrontStatic binary blobs; CDN for global low-latency serve
API gatewayKong / AWS API GatewayAuth, rate limiting, routing
Service meshIstio / LinkerdmTLS, circuit breakers, retries between internal services
Shopper app real-timeWebSocket (via API gateway)Bi-directional; shopper sends location pings, receives order updates
Customer app real-timeWebSocket + SSE fallbackCustomer receives order status + substitution requests
Scheduler (delayed jobs)Redis ZSET + sweeper workersSubstitution expiry deadlines; auth expiry reminders
MonitoringPrometheus + GrafanaMetrics; Jaeger for distributed traces
LoggingELK stack (Elasticsearch + Logstash + Kibana)Structured log aggregation

Why Not Alternatives #

AlternativeWhy rejected
DynamoDB for OrdersMulti-item transactions with CAS across multiple tables require expensive DynamoDB transactions (TransactWriteItems); Postgres row-level locking is simpler and cheaper at this scale
MongoDBNo partial unique indexes until recently; CAS less ergonomic; PostGIS integration poor
CRDT for inventoryInventory is not commutative under scarcity — CAS is required
Zookeeper for leasesOperationally heavier than Redis for this workload; Redis SETNX is sufficient
Single Postgres for location pingsWrite throughput: 1M+ pings/minute cannot be sustained by Postgres; Cassandra’s LSM tree is the right tool

Step 11 — Deployment Topology #

Service Decomposition #

┌─────────────────────────────────────────────────────────────────────────┐
│                          API Gateway (Kong)                             │
│          TLS termination, auth, rate limiting, routing                  │
└──────┬────────────┬──────────────┬─────────────┬────────────────────────┘
       │            │              │             │
       ▼            ▼              ▼             ▼
┌─────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐
│ Customer│  │  Order   │  │ Shopper  │  │   Payment    │
│ Service │  │ Service  │  │ Service  │  │   Service    │
│         │  │          │  │          │  │              │
│ Profile │  │ Place    │  │ Location │  │ Auth/Capture │
│ Auth    │  │ Status   │  │ Tracking │  │ Reconcile    │
└────┬────┘  └────┬─────┘  └────┬─────┘  └──────┬───────┘
     │            │              │                │
     ▼            ▼              ▼                ▼
┌──────────────────────────────────────────────────────────┐
│                    Postgres (Primary)                     │
│  orders, order_items, payments, customers, shoppers,     │
│  inventory_reservations, substitution_requests, outbox   │
│  Read replicas for browse/reporting queries              │
└────────────────────────┬─────────────────────────────────┘
                         │ Debezium CDC
                         ▼
┌──────────────────────────────────────────────────────────┐
│                  Apache Kafka (3 brokers)                 │
│  Topics: order-events, location-stream, inventory-events │
│          substitution-events, payment-events, outbox-relay│
└──────┬────────────────┬─────────────────┬────────────────┘
       │                │                 │
       ▼                ▼                 ▼
┌───────────┐   ┌─────────────┐  ┌─────────────────┐
│Notification│  │  Analytics  │  │ Outbox Relay    │
│  Service  │   │  Consumer   │  │ (Debezium sink) │
│ FCM/APNs  │   │ (Spark/Flink│  │                 │
└───────────┘   └─────────────┘  └─────────────────┘

┌──────────────────────────────────────────────────────────┐
│                   Redis Cluster                          │
│  - Shopper leases (SETNX)                               │
│  - Shopper geo index (GEOADD)                           │
│  - Substitution deadline ZSET                           │
│  - Order status pub/sub channels                        │
│  - Available item cache (read-through)                  │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│                  Cassandra Cluster (3 nodes)             │
│  - LocationPing history                                  │
│  Partition: shopper_id; Cluster: timestamp DESC         │
└──────────────────────────────────────────────────────────┘

Regions and Replication #

  • Primary region: Active-active multi-AZ within one region (3 AZs).
  • Postgres: Primary + 2 synchronous replicas (for failover). Read replicas for browse queries. Streaming replication.
  • Redis Cluster: 3 primary shards × 1 replica per shard = 6 nodes across 3 AZs.
  • Kafka: 3 brokers, replication factor 3, min.insync.replicas=2.
  • Cassandra: Replication factor 3, LOCAL_QUORUM reads and writes.
  • Multi-region: Phase 2 evolution (see Step 20). Initially single-region.

Container Layout #

Each service is a separate Kubernetes Deployment:

  • Order Service: 6-20 pods (HPA on CPU + pending order queue depth)
  • Shopper Service: 4-12 pods (HPA on active shoppers × location ping rate)
  • Assignment Service: 3-6 pods (HPA on assignment queue depth)
  • Payment Service: 3-6 pods (stateless; PSP calls are the bottleneck)
  • Notification Service: 3-6 pods (fan-out workload)
  • Outbox Relay: 2 pods (low volume; single-writer per outbox shard)
  • Substitution Sweeper: 2 pods (leader-elected; single active at a time)

Step 12 — Consistency Model #

Per-Path Classification #

PathConsistency modelJustification
Inventory reservation (write)LinearizableCAS with SELECT FOR UPDATE on Postgres primary. No stale read permitted — oversell is a hard business invariant.
Inventory browse (read)Eventual (cache-assisted)Cache may be up to 30s stale. Fine for browse; write path is linearizable.
Order status readRead-your-writes (at minimum)Customer reads from Postgres replica with session consistency guarantee. Or: writes propagate to Redis pub/sub < 500ms; customer WebSocket sees near-real-time.
Order status write (transition)LinearizableCAS on (status, version) on Postgres primary.
Shopper location (live view)Eventual (< 5s lag)LocationPing → Kafka → Redis Geo update. Customers see location that is at most one ping interval stale. Acceptable for navigation.
Shopper location (history)Eventually consistentCassandra LOCAL_QUORUM on write. Read from Cassandra for history.
Shopper assignment claimLinearizable (within Redis)Redis SETNX is atomic. The subsequent Postgres writes use fencing token to reject stale writes.
Payment authorizationLinearizablePostgres write + PSP call.
Payment captureLinearizableCAS on Payment.status + PSP call with idempotency key.
Substitution approvalLinearizableCAS on SubstitutionRequest.status on Postgres.
Substitution expiryMonotonic (sweep-based)Sweeper applies idempotent CAS; processes each expired request exactly once due to ZRANGEBYSCORE + CAS guard.

Step 13 — Scaling Model #

Traffic Estimation #

MetricAssumptionPeak value
DAU1M customers, 10% place orders/day100K orders/day
Orders/hour (peak)3× average, evening peak~20K orders/hour
Items/orderAverage 10 items200K item reservations/hour
LocationPing rate10K active shoppers × 1 ping/10s1,000 pings/second
Browse QPS1M DAU × 20 browses/day / 86400~230 QPS baseline; 5× peak = 1,150 QPS
Payment operations1 auth + 1 capture per order~40K PSP calls/hour at peak

Scaling Dimensions #

DimensionBottleneckMitigation
Inventory CAS writesHigh-contention items (popular grocery SKUs)Shard by item_id; Postgres row-level locking scoped to one row per item; CAS contention is localized
Browse readsRead amplificationCache in Redis (30s TTL); serve from Postgres read replicas for cache misses
Location pingsWrite throughput to CassandraCassandra scales horizontally; Kafka buffers bursts; Cassandra ring hash by shopper_id
Shopper geo queriesRedis GEORADIUS at scaleRedis Cluster sharded by zone_id; each zone’s shoppers indexed separately
Order event fan-outNotification delivery latencyKafka partitioned by order_id; notification workers scale horizontally
Assignment contentionMany orders competing for few shoppersExponential backoff; assignment is queued, not synchronous
Payment PSPPSP rate limitsConnection pooling; retry with exponential backoff; circuit breaker on PSP

Hotspot Analysis #

  • Popular item at single store: stock_count row is a hotspot. At extreme scale (Black Friday), use a reservation queue (Kafka) rather than synchronous CAS — serialize reservation requests per item through a single partition consumer. This is a Phase 2 mitigation.
  • Shopper in high-density area: Redis Geo shard by zone; zone is the partition boundary. One zone’s shoppers live on one Redis shard.
  • Kafka partition hotspot: Order events partitioned by order_id (random distribution). Location stream partitioned by shopper_id (even distribution by shopper count, not geography).

Step 14 — Failure Model #

Failure Enumeration #

FailureDetectionImpactRecovery
Order service pod crash mid-Saga (reservation partially complete)Health check → Kubernetes restartItems 1..k reserved; items k+1..N not; Order in pendingOn pod restart: idempotency key lookup resumes Saga at last unconfirmed step; if order is in pending state older than 5min, compensate and cancel
Redis lease holder crashes before persisting shopper assignment to PostgresLease TTL expires (30s)Shopper locked but order has no shopper_idAfter TTL, assignment service retries; shopper returned to available pool
Postgres primary failureStreaming replication monitoring; auto-failover via PatroniWrites blocked during failover (~30s)Replica promoted; application reconnects; in-flight writes retry with idempotency keys
Kafka broker failureKafka replication (factor=3, ISR=2)No data loss if ISR maintained; 1-broker loss toleratedAutomatic leader re-election; producers retry with idempotency
Cassandra node failure (1 of 3)Gossip failure detectionLocation ping writes succeed (quorum = 2); history reads succeedCassandra hinted handoff; node replacement; repair job
PSP timeout during payment authHTTP timeout + circuit breakerOrder placement blockedRetry with idempotency key (same key = PSP returns cached result); max 3 retries over 60s; order cancelled if all fail
PSP timeout during payment captureCDC trigger retryPayment remains authorizedOutbox relay re-triggers capture; idempotency key prevents double capture; alert if capture fails after 3 retries
Payment auth expiry before deliveryScheduled reconciliation jobCapture fails with auth_expiredRe-authorize then capture; if customer card declines, alert + flag order
Substitution sweeper crashPod health check; Kubernetes restartExpired substitutions not processed; order stuck in substitution_pendingSweeper is stateless + idempotent; next sweep processes all overdue entries; ZSET persists in Redis
Network partition between servicesCircuit breaker (Hystrix/Resilience4j)Degraded functionalityFail-open for reads (serve cached data); fail-closed for writes (reject until partition healed)
Shopper app location ping failureShopper device connectivity lossLocation view goes staleLast known location displayed; staleness indicator shown after 30s; assignment not affected
Duplicate Kafka message deliveryKafka at-least-once semanticsDouble-processing of eventsAll consumers implement idempotent processing (CAS on state or unique constraint)

Step 15 — SLOs #

Service Level Objectives #

MetricSLOMeasurement window
Order placement P99 latency< 3s (includes payment auth + N item reservations)Rolling 5-minute window
Order placement P50 latency< 800msRolling 5-minute window
Shopper assignment time (P95)< 60s from order acceptedPer-order measurement
Location update latency (device → customer view)< 5s P95Sampled end-to-end trace
Substitution push notification delivery< 10s P95 from shopper submissionPer-event measurement
Payment capture P99 latency< 30s from delivered eventPer-payment measurement
Inventory reservation accuracyZero oversell eventsContinuous invariant monitoring
Order service availability99.9% (< 8.7h downtime/year)Rolling 30-day window
Payment service availability99.95% (< 4.4h downtime/year)Rolling 30-day window
Browse API availability99.5% (cache-backed; acceptable degraded mode)Rolling 30-day window
Data durability99.999% (no order data loss)All-time

Error Budget #

  • Order service: 43.8 minutes downtime per month.
  • Payment service: 21.9 minutes downtime per month.
  • Burn rate alerting: if error budget burns > 5× expected rate for 1 hour → page on-call.

Step 16 — Operational Parameters #

Tunable Parameters #

ParameterDefaultMinMaxEffect of increasing
inventory_reservation.ttl_pending_minutes20560Longer window before abandoned orders release stock; increases risk of stock lockout for other customers
shopper_assignment.lease_ttl_seconds3010120Longer lease → more time for multi-step assignment saga; higher risk of shopper being locked if assignment service crashes
shopper_assignment.search_radius_km10150Larger radius finds more shoppers but increases drive time
substitution_request.approval_window_minutes10230Longer window → better customer response rate; longer order hold
substitution_sweeper.interval_seconds10560More frequent sweeping → faster expiry processing; higher Redis ZRANGEBYSCORE load
payment.auth_preexpiry_days617Alert/reauth triggered this many days before auth expiry
item_cache.ttl_seconds305300Higher TTL → lower Postgres read load; higher risk of stale stock display
assignment.max_attempts5120More attempts before falling back to delayed retry queue
kafka.consumer.max_poll_records50012000Higher → more throughput per consumer poll; higher memory pressure

Circuit Breaker Parameters #

CircuitFailure thresholdWait durationHalf-open probe count
PSP auth calls50% failure in 10s window60s3
PSP capture calls50% failure in 10s window60s3
Cassandra writes (location)30% failure in 5s window30s5
Redis (lease operations)40% failure in 5s window15s3

Step 17 — Runbooks #

Runbook 1: Inventory Oversell Detection #

Trigger: Alert fires: inventory_negative_stock metric > 0 for any item_id.

Root cause: CAS guard bypassed (bug) OR stock_count initialized incorrectly OR compensating transaction double-applied.

Steps:

  1. Query: SELECT item_id, stock_count FROM items WHERE stock_count < 0. Identify affected items.
  2. Query: SELECT * FROM inventory_reservations WHERE item_id = :item_id AND status = 'confirmed'. Sum quantities.
  3. Compare sum(reservations) with initial_stock (from audit log). Identify discrepancy.
  4. If stock_count < 0: immediate lock — set item.active = false to prevent further orders.
  5. Identify affected orders (reservations referencing this item_id). Triage: which orders are still in shopping phase vs picked/delivered.
  6. For picked/delivered orders: allow completion; adjust stock manually.
  7. For shopping orders: contact shoppers; may need to cancel items.
  8. Root cause analysis: enable DEBUG logging on item CAS path; replay recent reservations from Kafka topic.
  9. Fix stock_count: UPDATE items SET stock_count = max(0, initial_stock - confirmed_reservations).
  10. Re-enable item: UPDATE items SET active = true WHERE item_id = :item_id.

Runbook 2: Shopper Lease Stuck (Shopper Permanently in on_order) #

Trigger: Alert: shopper_id has been in on_order status for > 4 hours with no location ping for > 30 minutes.

Steps:

  1. Verify: SELECT * FROM shoppers WHERE shopper_id = :id. Check assigned_order_id.
  2. Check order status: SELECT status FROM orders WHERE order_id = :assigned_order_id.
  3. Check Redis lease: GET shopper:{shopper_id}:claim. If key exists, note TTL.
  4. If order is delivered/cancelled but shopper status is still on_order: this is a missed status reset.
    • Check outbox_events for the relevant order transition — was it published? Was the Shopper service consumer lagging?
    • Manual fix: UPDATE shoppers SET status='available', assigned_order_id=NULL WHERE shopper_id=:id.
    • DEL shopper:{shopper_id}:claim if key still exists.
  5. If order is in active state but shopper is unreachable: attempt to contact shopper via ops tooling. If genuinely lost, escalate to manual delivery or reassignment.
  6. For reassignment: cancel current assignment (system-level), trigger new assignment saga for the order.

Runbook 3: Payment Capture Failure #

Trigger: Alert: payment_capture_failed event, or Payment record in authorized state for > 24h after Order.status = delivered.

Steps:

  1. Query: SELECT * FROM payments WHERE order_id = :order_id.
  2. Check PSP portal for auth_token status. Is auth still valid?
  3. If auth expired: manually trigger re-authorization via internal admin API (POST /admin/payments/:id/reauth). This uses the payment_method_token to issue a new auth.
  4. If PSP is degraded: check circuit breaker status in Grafana. If circuit is open, wait for PSP recovery; circuit will auto-retry.
  5. If customer card declined during re-auth: flag order for finance team. Send customer notification. Create manual review ticket.
  6. If capture succeeds eventually: close ticket.
  7. Post-mortem: if multiple orders affected simultaneously → PSP incident. Coordinate with PSP support.

Runbook 4: Substitution Sweeper Backlog #

Trigger: Alert: substitution_deadlines_backlog (ZCARD of substitution:deadlines > 1000 entries with score < now).

Steps:

  1. Check sweeper pod health: kubectl get pods -l app=substitution-sweeper.
  2. If pod is crashed: kubectl describe pod to see crash reason. Check logs. Restart pod.
  3. If sweeper is running but backlog is growing: check Postgres write latency. SubstitutionRequest CAS updates may be slow.
  4. Verify no lock contention: SELECT * FROM pg_stat_activity WHERE wait_event_type='Lock'.
  5. If backlog is non-critical (< 10 min delay): let sweeper process naturally; it will catch up.
  6. If critical (hour+ delay with customer-visible stuck orders): deploy emergency parallel sweeper with a different ZSET range split (e.g., two sweepers each taking half the score range). Coordinate via Redis distributed lock to prevent double-processing — or rely on CAS guard.

Step 18 — Observability #

Metrics (Prometheus) #

MetricTypeLabelsAlert threshold
order_placement_duration_secondsHistogramstatus (success/failure), error_typeP99 > 3s
inventory_reservation_cas_retries_totalCounteritem_idRate > 10/s per item → hotspot
inventory_negative_stock_countGaugeitem_id> 0 → page immediately
shopper_assignment_duration_secondsHistogramP95 > 60s
shopper_assignment_attemptsHistogramP99 > 4 attempts
shopper_lease_acquisition_totalCounterresult (success/failure)Failure rate > 50%
location_ping_ingest_rateGaugezone_idDrop > 30% → shopper app issue
location_ping_lag_secondsHistogramP95 > 10s → pipeline issue
substitution_request_created_totalCounter
substitution_request_expired_totalCounter> 20% of created → approval UX issue
substitution_sweeper_backlogGauge> 500 → warn; > 2000 → page
payment_authorization_duration_secondsHistogrampsp_nameP99 > 5s
payment_capture_success_totalCounter
payment_capture_failure_totalCounterfailure_reason> 0 → alert
payment_auth_expiry_countGauge> 0 → pre-expiry action needed
kafka_consumer_lagGaugetopic, consumer_group> 10,000 → alert
circuit_breaker_stateGaugecircuit_name, statestate=open → alert

Distributed Tracing (Jaeger) #

Every request carries a trace-id propagated via HTTP headers and Kafka message headers. Key trace spans:

  • order.place → includes child spans: zone_check, payment.authorize, inventory.reserve[item_id=X] (one per item), order.status_transition
  • assignment.run → includes: geo_query, lease.acquire[shopper_id=X], order.update_shopper, shopper.update_status
  • substitution.propose → includes: reservation.check, db.insert, deadline.register, notification.publish
  • payment.capture → includes: order.verify_delivered, psp.capture_call, payment.status_transition

Sampling: 10% of requests traced by default; 100% for error paths and P99+ latency paths.

Structured Logs (ELK) #

Every log line includes: trace_id, service, pod_id, order_id (where applicable), level, timestamp, message, duration_ms (for external calls).

Key log events:

  • inventory_cas_failure: includes item_id, requested_qty, available_qty, attempt_number
  • saga_compensation_triggered: includes order_id, step_failed, items_released
  • shopper_lease_acquired: includes shopper_id, order_id, fencing_token
  • shopper_lease_expired: includes shopper_id, previous_order_id
  • payment_capture_attempted: includes order_id, amount_cents, idempotency_key, psp_response_code
  • substitution_expired: includes substitution_id, order_id, item_id, delay_seconds

Dashboards #

  1. Order Funnel Dashboard: placement rate, acceptance rate, cancellation rate by stage, P50/P95/P99 per stage.
  2. Inventory Health Dashboard: items with negative stock (should always be 0), reservation hot spots, CAS retry rates.
  3. Shopper Assignment Dashboard: assignment latency, lease acquisition success rate, available shopper count by zone.
  4. Payment Dashboard: auth success rate, capture success rate, PSP latency, auth expiry count.
  5. Real-time Ops Dashboard: active orders by status, active shoppers, pending substitutions, Kafka consumer lag.

Step 19 — Cost Model #

Infrastructure Cost Estimate (Monthly, AWS us-east-1) #

ComponentConfigurationEst. monthly cost
Postgres (RDS Multi-AZ)db.r6g.2xlarge (primary + 1 replica), 500GB gp3$1,200
Postgres read replicas (2×)db.r6g.xlarge × 2$800
Redis Cluster6× cache.r7g.large (3 primary + 3 replica)$900
Kafka (MSK)3× kafka.m5.xlarge brokers, 500GB storage each$1,100
Cassandra (EC2)3× r6g.2xlarge + 1TB gp3 each$1,500
EKS worker nodes10× m6g.2xlarge (mixed services)$1,800
S3 (item images)1TB storage + 50M GET requests$100
CloudFront (CDN)10TB egress$850
Data transferInternal VPC traffic$300
Monitoring (Prometheus/Grafana stack)2× t3.medium$60
Miscellaneous (ALB, Route53, NAT)$400
Total infrastructure~$9,010/month

Variable Costs #

Cost driverUnit costAt 100K orders/day
PSP transaction fee (auth)$0.005/transaction$150/day
PSP transaction fee (capture)2.9% + $0.30 per captureVariable (depends on order value)
FCM/APNs pushesNegligible (free tier)$0
Kafka message storage$0.10/GB-month~$50/month at 1KB/msg × 10M msgs/day

Cost per Order (Infrastructure only) #

At 100K orders/day = 3M orders/month:

  • Infrastructure: $9,010 / 3,000,000 = $0.003/order (0.3 cents)
  • PSP fees are the dominant variable cost (separate revenue model consideration)

Cost Optimization Levers #

  1. Cassandra: Move location pings older than 90 days to S3 (lifecycle policy). Cassandra cluster shrinks by 60% after 6 months.
  2. Read replicas: Scale down Postgres read replicas off-peak (scheduled RDS scaling).
  3. Kafka: Use S3 tiered storage (MSK Connect) for topics older than 7 days. Reduces broker storage cost by 70%.
  4. Item images: Use CloudFront origin shield to reduce S3 request costs.

Step 20 — Evolution Stages #

Stage 1: MVP (Months 0-3) #

Goal: Working system in one city, one delivery zone, ~1,000 orders/day.

What to build:

  • Monolithic Order Service (handles placement + assignment + status)
  • Postgres single instance (no replicas)
  • Redis single node (no cluster)
  • No Cassandra — location pings stored in Postgres (time-partitioned table)
  • Basic FCM push (no WebSocket)
  • Manual payment capture trigger (via internal admin UI)
  • No Kafka — synchronous in-process event handling

What to defer:

  • Cassandra migration (location pings)
  • Kafka (event streaming)
  • CDC / Debezium
  • Substitution sweeper as a separate service
  • Multi-zone, multi-city

Invariants that must hold even in MVP:

  • I3 (no oversell) — CAS on stock_count must be implemented from day 1
  • I7 (one shopper per order) — must work; use Postgres advisory lock if Redis not available
  • I17 (idempotent payments) — idempotency keys on PSP calls from day 1

Stage 2: City Scale (Months 3-9) #

Goal: 5 cities, 10 zones, ~10K orders/day.

What to add:

  • Extract Assignment Service from Order Service
  • Add Kafka (3-broker cluster) — decouple notification fan-out
  • Add Cassandra — migrate location pings from Postgres
  • Add Redis Cluster — replace single node
  • Add Postgres read replicas — separate browse reads
  • Implement Outbox pattern — replace in-process event dispatch
  • Add Debezium CDC — payment capture triggered by DB change
  • Implement WebSocket for real-time customer updates
  • Add circuit breakers on PSP calls

Migrations:

  • Postgres location_pings table → Cassandra: dual-write for 2 weeks, then cut over reads, then drop table
  • In-process events → Outbox + Kafka: feature-flagged per service, gradual rollout

Stage 3: National Scale (Months 9-24) #

Goal: 50 cities, 500 zones, ~100K orders/day, SLA 99.9%.

What to add:

  • Horizontal Postgres sharding by zone_id (Citus or manual application-level sharding)
  • Kafka partition expansion (increase to 50 partitions per topic)
  • Cassandra cluster expansion to 9 nodes
  • Redis Cluster expansion to 9 shards
  • CQRS pattern — separate read model (Elasticsearch) for item search with full-text
  • Auth expiry reconciliation job (scheduled daily)
  • Full observability stack (Prometheus + Grafana + Jaeger + ELK)
  • Chaos engineering (Chaos Monkey) — validate failure recovery

Stage 4: Multi-Region (Month 24+) #

Goal: < 100ms latency in each region; regional failure isolation; 99.95% SLA.

Design changes:

  • Active-active multi-region with conflict resolution
  • Order data: region-pinned (order belongs to the region where it was placed; no cross-region write)
  • Shopper data: replicated read-only to other regions; writes go to home region
  • Item catalog: replicated globally; writes in home region, CDC to all regions
  • Payment: globally replicated Payment table; capture always goes to original auth region (PSP requires it)
  • Kafka MirrorMaker 2 for cross-region topic replication
  • Global API gateway (Cloudflare or AWS Global Accelerator) for geo-routing

What does NOT need cross-region coordination:

  • Inventory reservation (inventory is physical — a store in Seattle doesn’t serve a customer in Austin)
  • Shopper assignment (shoppers are local)
  • Location pings (local to the order’s region)

What DOES need cross-region:

  • Customer profile and payment method tokens (global read; home-region write)
  • Item catalog (global read; store-local write)
  • Analytics/reporting (global aggregation)

Evolution Decision Points #

TriggerAction
Location ping write latency P99 > 50ms on PostgresMigrate to Cassandra
CAS contention rate > 20% on popular itemsImplement per-item reservation queue (Kafka-serialized)
Shopper assignment P95 > 30sAdd predictive pre-assignment (assign before order is placed, based on cart signal)
Order placement P99 > 3sAsync reservation (optimistic placement + async confirmation); trade consistency for speed
PSP failure rate > 2%Add secondary PSP fallback routing
Item browse cache miss rate > 10%Increase Redis memory; add read replicas for cache warming

Summary: Key Design Decisions and Their Derivations #

DecisionDerived fromInvariant
CAS on stock_count (not a mutable available_quantity column derived from reservation sum)Accounting invariant I4: one source of truthI3, I4
Saga for multi-item reservation (not 2PC)Cross-service decomposable: each item reservation is independently compensableI2, I14
Redis SETNX Lease for shopper assignmentMulti-writer one-winner + state machine (Step 6.2)I7, I8
Lease + fencing token (not just Lease)Step 6.4 required combinationI7
Outbox + Relay (not dual-write)Propagation invariant + atomicity: DB write and event publish must be atomicI14, I15
CDC for payment capture triggerasync-source-is-DB-row (Step 6.3 Q5)I10
Idempotency keys on all PSP callsFailure model Q2: writer-crash recoveryI17
SubstitutionRequest as separate object (not an update to OrderItem)Evolution purity (Step 2): OrderItem is append-only/immutable; substitution has its own state machineI13, I16, I18
Redis ZSET for substitution deadlinesIntent/future constraint with expiry (Step 2 class) + idempotent sweeperI15
Cassandra for LocationPingAppend-only + single writer + time-series access pattern (Step 3 axes)
available_quantity cache is advisory onlySource-of-truth validation (Step 7): dual truth eliminatedI4

There's no articles to list here yet.