Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity

Table of Contents

Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity #

Chapter 4 introduced Agreement at the infrastructure level: Raft log replication ensures replicas agree on the same entry, 2PC coordinates multiple database participants. At the application level, Agreement is the same mechanism operating at a different boundary: multiple business operations — spanning services, databases, and external APIs — must all succeed or all be undone.

The structural difference is critical:

Infrastructure Agreement: a small number of participants (nodes in a cluster), each implementing a well-defined protocol (Raft, XA). Failures are handled by the consensus algorithm. The application sees a single atomic operation.

Application Agreement: participants are independent services with their own data stores, their own failure modes, and no shared transaction coordinator. Failures must be handled explicitly by the application, using compensation logic written by the developer.

This is where infrastructure Agreement stops and application Agreement begins: at the service boundary.

When Application Agreement Is Needed #

Application Agreement is required when:

An operation spans multiple services and partial completion leaves the system in a worse state than no completion.
The services do not share a database (preventing a single transaction).
The operations are too long-lived for 2PC (the user is browsing between steps).

Domain	The Agreement problem
E-commerce checkout	Payment + inventory decrement + order creation must all succeed or all roll back
Travel booking	Hotel + flight + car rental must all be confirmed or all released
Bank transfer	Debit source account + credit destination account must be atomic
Account creation	Create user record + send welcome email + provision storage must all succeed
Order cancellation	Refund payment + release inventory + cancel fulfillment must all succeed

The Saga Pattern at Application Level #

Chapter 4 introduced Saga mechanics. This chapter focuses on how to design Sagas correctly for real application domains — the compensation design problem, the isolation gap, and the failure modes that emerge in production.

Designing Compensations #

Every Saga step that can be compensated requires a compensation action. The compensation must satisfy:

Idempotency: if the compensation is executed twice (orchestrator crash during compensation), it must be a no-op. This requires either natural idempotency (releasing a reservation that is already released succeeds silently) or an idempotency key on the compensation call.

Semantic correctness: the compensation undoes the business effect of the step, not just the technical operation. A payment refund is not DELETE FROM charges WHERE id = X — it is a refund transaction in the payment provider’s system, which may have fees, time limits, and regulatory requirements.

Not all steps are compensatable: sending an email, publishing a public tweet, or transmitting data to a regulatory body cannot be undone. These steps must be placed last in the Saga — only executed once all prior steps have succeeded. If a non-compensatable step fails, the Saga cannot roll back; it must move forward to a completed state via retry or manual intervention.

WRONG order:
  1. Send confirmation email (non-compensatable)
  2. Charge payment (compensatable: refund)
  3. Reserve inventory (compensatable: release)

  → If step 2 fails, email was already sent. Cannot un-send.

CORRECT order:
  1. Reserve inventory (compensatable: release)
  2. Charge payment (compensatable: refund)
  3. Send confirmation email (non-compensatable — placed last)

E-Commerce Checkout Saga #

saga_steps = [
    # Step, Compensation
    (reserve_inventory,    release_inventory_reservation),
    (create_order,         cancel_order),
    (charge_payment,       refund_payment),
    (ship_order,           cancel_shipment),      # Only if fulfillment started
    (send_confirmation,    None),                  # Non-compensatable — last
]

Reserve inventory: creates an inventory reservation with a TTL. Compensation: release the reservation (idempotent — releasing an already-released reservation is a no-op).

Create order: inserts an order record in PENDING state. Compensation: transition order to CANCELLED state (not DELETE — the order ID may have been communicated to the user already).

Charge payment: calls payment provider API with idempotency key. Compensation: call refund API with the charge ID. The refund is not instant — it may take days to reflect on the customer’s statement.

Ship order: sends fulfillment request. Compensation: cancel shipment if not yet dispatched; initiate return if already dispatched.

Send confirmation: sends email. No compensation — placed last so it only executes on full success.

The Isolation Gap #

The most important property that Sagas do not provide is isolation. Between the commit of an early step and the compensation of later steps, the intermediate state is visible to other transactions.

Saga executing:
  Step 1: Inventory reserved (visible: seat 101 shows "HELD")
  Step 2: Order created (visible: order #456 in PENDING state)
  Step 3: Payment fails

Compensation:
  Cancel order #456 (order transitions to CANCELLED)
  Release inventory reservation for seat 101

Window between step 2 commit and compensation:
  Another user queries orders → sees order #456 in PENDING
  Another user queries inventory → seat 101 shows HELD

Countermeasure patterns:

Semantic locks: mark resources with a “in-progress” flag during the Saga. Other transactions that see the flag either wait (blocking) or fail fast (non-blocking). This restores some isolation at the cost of coordination:

UPDATE seats SET status = 'SAGA_IN_PROGRESS', saga_id = $saga_id
WHERE seat_id = $seat_id AND status = 'AVAILABLE';

Pivot transaction: identify the step after which the Saga cannot fail. Everything before the pivot is compensatable; everything after the pivot is non-compensatable and must eventually complete. The pivot step is the “point of no return.” Design the Saga so the pivot is as early as possible, reducing the visible intermediate state window.

For e-commerce: the payment charge is the pivot. Before payment: everything is reversible. After payment: the order must complete (fulfillment retries indefinitely) or be explicitly cancelled with a refund.

Orchestration vs Choreography at Application Level #

Orchestration (explicit saga coordinator):

class CheckoutOrchestrator:
    def execute(self, order_id: str):
        saga_state = self.db.load_saga(order_id) or SagaState(order_id)

        steps = [
            ('reserve_inventory', self.inventory.reserve, self.inventory.release),
            ('create_order',      self.orders.create,    self.orders.cancel),
            ('charge_payment',    self.payments.charge,  self.payments.refund),
        ]

        for step_name, forward, compensate in steps:
            if step_name in saga_state.completed_steps:
                continue  # Idempotent resume: skip already-completed steps

            try:
                result = forward(order_id)
                saga_state.record_completion(step_name, result)
                self.db.save_saga(saga_state)
            except Exception:
                self._compensate(saga_state, steps)
                raise

    def _compensate(self, saga_state, steps):
        for step_name, _, compensate in reversed(steps):
            if step_name not in saga_state.completed_steps:
                continue
            compensate(saga_state.get_result(step_name))

The orchestrator persists saga state to the database. On crash recovery, it reloads the state and resumes from the last completed step. Completed steps are skipped (idempotent resume).

Choreography (event-driven, no central coordinator):

Each service listens for events from prior steps and publishes events for subsequent steps:

OrderService:       publishes OrderCreated
InventoryService:   listens OrderCreated → reserves → publishes InventoryReserved
PaymentService:     listens InventoryReserved → charges → publishes PaymentCharged
FulfillmentService: listens PaymentCharged → ships → publishes OrderShipped

On failure:
PaymentService:     publishes PaymentFailed
InventoryService:   listens PaymentFailed → releases → publishes InventoryReleased
OrderService:       listens InventoryReleased → cancels order

Choreography has no single point of failure (no orchestrator) but is harder to trace, harder to debug, and harder to add new steps to without modifying multiple services.

Choosing between them: use orchestration when the Saga involves more than 3 steps, when visibility into Saga state is needed (monitoring, debugging), or when compensation logic is complex. Use choreography when the flow is simple and the services are strongly independently owned.

Database Transactions Within a Single Service #

Not every Agreement problem requires a Saga. If the operation is within a single service’s database boundary, a local transaction is correct, simpler, and has full isolation:

# Agreement within one service — use a transaction, not a Saga
def transfer_between_accounts(from_id: str, to_id: str, amount: int):
    with db.transaction():
        from_account = db.lock_and_get(f"SELECT * FROM accounts WHERE id = $1 FOR UPDATE", from_id)
        if from_account.balance < amount:
            raise InsufficientFunds()
        db.execute("UPDATE accounts SET balance = balance - $1 WHERE id = $2", amount, from_id)
        db.execute("UPDATE accounts SET balance = balance + $1 WHERE id = $2", amount, to_id)
        db.execute("INSERT INTO transfers (from_id, to_id, amount) VALUES ($1, $2, $3)",
                   from_id, to_id, amount)

The rule: use a database transaction when all participants share the same database. Only escalate to a Saga when the operation crosses a service boundary with its own data store.

Agreement Decision Matrix #

Scenario	Mechanism	Rationale
Two tables in same database	Local transaction	Single ACID boundary
Two databases in same service	XA (if supported) or Saga	Crossing storage boundary
Two microservices, each with own DB	Saga	No shared transaction coordinator
Long-lived user flow (minutes between steps)	Saga with TTL-bounded reservations	2PC would hold locks for minutes
Operation includes non-compensatable step	Saga with pivot — non-compensatable last	Cannot undo email/regulatory submission
Strong isolation required across services	Rethink service boundary	Saga cannot provide isolation

The last row is important: if the isolation gap in Sagas is unacceptable for your use case, the answer is not a better coordination mechanism — it is a different service boundary. Operations that truly require isolation should share a database.