- My Development Notes/
- Distributed Coordination: The Hidden Component/
- Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity/
Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity
Table of Contents
Agreement at Application Level — Sagas, Compensation, and Cross-Service Atomicity #
Chapter 4 introduced Agreement at the infrastructure level: Raft log replication ensures replicas agree on the same entry, 2PC coordinates multiple database participants. At the application level, Agreement is the same mechanism operating at a different boundary: multiple business operations — spanning services, databases, and external APIs — must all succeed or all be undone.
The structural difference is critical:
Infrastructure Agreement: a small number of participants (nodes in a cluster), each implementing a well-defined protocol (Raft, XA). Failures are handled by the consensus algorithm. The application sees a single atomic operation.
Application Agreement: participants are independent services with their own data stores, their own failure modes, and no shared transaction coordinator. Failures must be handled explicitly by the application, using compensation logic written by the developer.
This is where infrastructure Agreement stops and application Agreement begins: at the service boundary.
When Application Agreement Is Needed #
Application Agreement is required when:
- An operation spans multiple services and partial completion leaves the system in a worse state than no completion.
- The services do not share a database (preventing a single transaction).
- The operations are too long-lived for 2PC (the user is browsing between steps).
| Domain | The Agreement problem |
|---|---|
| E-commerce checkout | Payment + inventory decrement + order creation must all succeed or all roll back |
| Travel booking | Hotel + flight + car rental must all be confirmed or all released |
| Bank transfer | Debit source account + credit destination account must be atomic |
| Account creation | Create user record + send welcome email + provision storage must all succeed |
| Order cancellation | Refund payment + release inventory + cancel fulfillment must all succeed |
The Saga Pattern at Application Level #
Chapter 4 introduced Saga mechanics. This chapter focuses on how to design Sagas correctly for real application domains — the compensation design problem, the isolation gap, and the failure modes that emerge in production.
Designing Compensations #
Every Saga step that can be compensated requires a compensation action. The compensation must satisfy:
Idempotency: if the compensation is executed twice (orchestrator crash during compensation), it must be a no-op. This requires either natural idempotency (releasing a reservation that is already released succeeds silently) or an idempotency key on the compensation call.
Semantic correctness: the compensation undoes the business effect of the step, not just the technical operation. A payment refund is not DELETE FROM charges WHERE id = X — it is a refund transaction in the payment provider’s system, which may have fees, time limits, and regulatory requirements.
Not all steps are compensatable: sending an email, publishing a public tweet, or transmitting data to a regulatory body cannot be undone. These steps must be placed last in the Saga — only executed once all prior steps have succeeded. If a non-compensatable step fails, the Saga cannot roll back; it must move forward to a completed state via retry or manual intervention.
WRONG order:
1. Send confirmation email (non-compensatable)
2. Charge payment (compensatable: refund)
3. Reserve inventory (compensatable: release)
→ If step 2 fails, email was already sent. Cannot un-send.
CORRECT order:
1. Reserve inventory (compensatable: release)
2. Charge payment (compensatable: refund)
3. Send confirmation email (non-compensatable — placed last)
E-Commerce Checkout Saga #
saga_steps = [
# Step, Compensation
(reserve_inventory, release_inventory_reservation),
(create_order, cancel_order),
(charge_payment, refund_payment),
(ship_order, cancel_shipment), # Only if fulfillment started
(send_confirmation, None), # Non-compensatable — last
]
Reserve inventory: creates an inventory reservation with a TTL. Compensation: release the reservation (idempotent — releasing an already-released reservation is a no-op).
Create order: inserts an order record in PENDING state. Compensation: transition order to CANCELLED state (not DELETE — the order ID may have been communicated to the user already).
Charge payment: calls payment provider API with idempotency key. Compensation: call refund API with the charge ID. The refund is not instant — it may take days to reflect on the customer’s statement.
Ship order: sends fulfillment request. Compensation: cancel shipment if not yet dispatched; initiate return if already dispatched.
Send confirmation: sends email. No compensation — placed last so it only executes on full success.
The Isolation Gap #
The most important property that Sagas do not provide is isolation. Between the commit of an early step and the compensation of later steps, the intermediate state is visible to other transactions.
Saga executing:
Step 1: Inventory reserved (visible: seat 101 shows "HELD")
Step 2: Order created (visible: order #456 in PENDING state)
Step 3: Payment fails
Compensation:
Cancel order #456 (order transitions to CANCELLED)
Release inventory reservation for seat 101
Window between step 2 commit and compensation:
Another user queries orders → sees order #456 in PENDING
Another user queries inventory → seat 101 shows HELD
Countermeasure patterns:
Semantic locks: mark resources with a “in-progress” flag during the Saga. Other transactions that see the flag either wait (blocking) or fail fast (non-blocking). This restores some isolation at the cost of coordination:
UPDATE seats SET status = 'SAGA_IN_PROGRESS', saga_id = $saga_id
WHERE seat_id = $seat_id AND status = 'AVAILABLE';
Pivot transaction: identify the step after which the Saga cannot fail. Everything before the pivot is compensatable; everything after the pivot is non-compensatable and must eventually complete. The pivot step is the “point of no return.” Design the Saga so the pivot is as early as possible, reducing the visible intermediate state window.
For e-commerce: the payment charge is the pivot. Before payment: everything is reversible. After payment: the order must complete (fulfillment retries indefinitely) or be explicitly cancelled with a refund.
Orchestration vs Choreography at Application Level #
Orchestration (explicit saga coordinator):
class CheckoutOrchestrator:
def execute(self, order_id: str):
saga_state = self.db.load_saga(order_id) or SagaState(order_id)
steps = [
('reserve_inventory', self.inventory.reserve, self.inventory.release),
('create_order', self.orders.create, self.orders.cancel),
('charge_payment', self.payments.charge, self.payments.refund),
]
for step_name, forward, compensate in steps:
if step_name in saga_state.completed_steps:
continue # Idempotent resume: skip already-completed steps
try:
result = forward(order_id)
saga_state.record_completion(step_name, result)
self.db.save_saga(saga_state)
except Exception:
self._compensate(saga_state, steps)
raise
def _compensate(self, saga_state, steps):
for step_name, _, compensate in reversed(steps):
if step_name not in saga_state.completed_steps:
continue
compensate(saga_state.get_result(step_name))
The orchestrator persists saga state to the database. On crash recovery, it reloads the state and resumes from the last completed step. Completed steps are skipped (idempotent resume).
Choreography (event-driven, no central coordinator):
Each service listens for events from prior steps and publishes events for subsequent steps:
OrderService: publishes OrderCreated
InventoryService: listens OrderCreated → reserves → publishes InventoryReserved
PaymentService: listens InventoryReserved → charges → publishes PaymentCharged
FulfillmentService: listens PaymentCharged → ships → publishes OrderShipped
On failure:
PaymentService: publishes PaymentFailed
InventoryService: listens PaymentFailed → releases → publishes InventoryReleased
OrderService: listens InventoryReleased → cancels order
Choreography has no single point of failure (no orchestrator) but is harder to trace, harder to debug, and harder to add new steps to without modifying multiple services.
Choosing between them: use orchestration when the Saga involves more than 3 steps, when visibility into Saga state is needed (monitoring, debugging), or when compensation logic is complex. Use choreography when the flow is simple and the services are strongly independently owned.
Database Transactions Within a Single Service #
Not every Agreement problem requires a Saga. If the operation is within a single service’s database boundary, a local transaction is correct, simpler, and has full isolation:
# Agreement within one service — use a transaction, not a Saga
def transfer_between_accounts(from_id: str, to_id: str, amount: int):
with db.transaction():
from_account = db.lock_and_get(f"SELECT * FROM accounts WHERE id = $1 FOR UPDATE", from_id)
if from_account.balance < amount:
raise InsufficientFunds()
db.execute("UPDATE accounts SET balance = balance - $1 WHERE id = $2", amount, from_id)
db.execute("UPDATE accounts SET balance = balance + $1 WHERE id = $2", amount, to_id)
db.execute("INSERT INTO transfers (from_id, to_id, amount) VALUES ($1, $2, $3)",
from_id, to_id, amount)
The rule: use a database transaction when all participants share the same database. Only escalate to a Saga when the operation crosses a service boundary with its own data store.
Agreement Decision Matrix #
| Scenario | Mechanism | Rationale |
|---|---|---|
| Two tables in same database | Local transaction | Single ACID boundary |
| Two databases in same service | XA (if supported) or Saga | Crossing storage boundary |
| Two microservices, each with own DB | Saga | No shared transaction coordinator |
| Long-lived user flow (minutes between steps) | Saga with TTL-bounded reservations | 2PC would hold locks for minutes |
| Operation includes non-compensatable step | Saga with pivot — non-compensatable last | Cannot undo email/regulatory submission |
| Strong isolation required across services | Rethink service boundary | Saga cannot provide isolation |
The last row is important: if the isolation gap in Sagas is unacceptable for your use case, the answer is not a better coordination mechanism — it is a different service boundary. Operations that truly require isolation should share a database.