Skip to main content

Payment Gateway: System Design

Payment Gateway: System Design #

A payment gateway sits between a merchant and the card networks / banks. It authorizes transactions, routes them to the right payment processor, handles 3D Secure authentication, manages captures and refunds, and delivers webhooks. This design follows the standard interview delivery framework and is grounded in Hyperswitch (open-source payment switch, Rust), Stripe’s OpenAPI spec, and the key industry standards.


1. Requirements (~5 minutes) #

Functional Requirements #

Core payment flow

  • Accept a payment request (card, bank transfer, wallet) from a merchant
  • Route the request to an appropriate payment connector (Stripe, Adyen, PayPal, Braintree)
  • Handle 3D Secure / SCA (Strong Customer Authentication) challenge when required
  • Return synchronous authorization result or async status

Payment lifecycle

  • Authorize and auto-capture, or authorize-only with manual capture
  • Void an authorized payment
  • Refund a captured payment (full or partial)
  • Handle disputes / chargebacks

Reliability

  • Idempotent payment creation and confirmation (safe to retry)
  • Webhook delivery for all payment events (with retries)
  • Automatic retry on soft declines using gateway status mapping

Merchant operations

  • Multiple connectors per merchant with routing rules
  • Connector fallback when primary fails
  • Payment method management (vault card tokens)

Non-Functional Requirements #

RequirementTarget
Authorization latency (P99)< 2s (connector RTT dominates)
API latency (P99, excluding connector)< 200ms
Availability99.99% (< 1h downtime/year)
Exactly-once processingNo double charges
PCI DSS complianceLevel 1 (> 6M transactions/year)
Audit trailImmutable log of all state transitions
Throughput1,000 TPS peak authorization

Out of scope: ledger / double-entry accounting, FX conversion, card issuing, bank onboarding.

Capacity Estimation #

Traffic

  • 1,000 TPS peak → ~86M transactions/day
  • Read:write ratio ~5:1 (status checks, reporting)

Storage (per PaymentIntent ~2 KB, PaymentAttempt ~1 KB)

  • 86M × 3 KB = ~260 GB/day raw
  • With 1-year retention: ~90 TB (cold-tiered after 90 days)
  • Redis (idempotency + session cache): ~50 GB hot

Connector calls

  • 1,000 TPS × avg 600ms connector RTT → ~600 concurrent outbound connections per connector
  • Webhook delivery: ~1,000 events/s → async queue

2. Core Entities (~2 minutes) #

Merchant
  └── MerchantConnectorAccount (credentials per connector)
  └── RoutingAlgorithm (static rules / dynamic config)

Customer
  └── PaymentMethod (vaulted card token, bank account)

PaymentIntent          ← top-level payment object
  ├── status: IntentStatus
  ├── amount, currency
  ├── customer_id
  └── PaymentAttempt[] ← one per routing/retry attempt
        ├── status: AttemptStatus
        ├── connector
        ├── connector_transaction_id
        └── error_code, error_message

Refund (linked to PaymentIntent)
Dispute (linked to PaymentAttempt)
WebhookEvent (outbox, linked to PaymentIntent)
ProcessTracker (async job: payment_sync, webhook_retry, refund)

Two-level state model (from Hyperswitch common_enums/src/enums.rs):

IntentStatus — the merchant-facing view:

RequiresPaymentMethod → RequiresConfirmation → RequiresCustomerAction (3DS)
  → RequiresMerchantAction (fraud review) → RequiresCapture → Processing
  → Succeeded | Failed | Cancelled | Expired

AttemptStatus — the connector-level view:

Started → Pending → AuthenticationPending → Authorized → CaptureInitiated
  → Charged | CaptureFailed | AuthorizationFailed | Failure

IntentStatus is derived from the active AttemptStatus. When a connector returns Authorized, PaymentIntent moves to RequiresCapture (for manual capture) or triggers auto-capture.


3. API Interface (~5 minutes) #

Following Stripe’s OpenAPI spec as the industry standard, mirrored in Hyperswitch’s REST API.

Create Payment Intent #

POST /payments

{
  "amount": 1000,
  "currency": "USD",
  "payment_method": "pm_card_visa",
  "capture_method": "automatic" | "manual",
  "confirm": true,
  "customer_id": "cus_xxx",
  "metadata": {},
  "return_url": "https://merchant.com/return"
}

→ 200 PaymentIntent {
  "payment_id": "pay_xxx",
  "status": "processing" | "requires_action",
  "client_secret": "pay_xxx_secret_yyy",  // for browser-side 3DS
  "next_action": {
    "type": "redirect_to_url",
    "redirect_to_url": { "url": "https://..." }
  }
}

Idempotency: Idempotency-Key: <uuid> header. Safe to retry — same key returns same response if already processed.

Confirm Payment #

POST /payments/{payment_id}/confirm

{
  "payment_method": "pm_card_visa",
  "browser_info": { ... }   // for 3DS fingerprinting
}

Capture / Void / Refund #

POST /payments/{payment_id}/capture    { "amount_to_capture": 800 }
POST /payments/{payment_id}/cancel
POST /refunds                          { "payment_id": "pay_xxx", "amount": 500 }

Webhooks (outbound, to merchant) #

POST <merchant_webhook_url>
{
  "event_type": "payment.succeeded" | "payment.failed" | "refund.created",
  "data": { "object": <PaymentIntent> },
  "idempotency_key": "evt_xxx"
}

Standards reference: API security profile follows FAPI 2.0 (OAuth2 + mTLS or DPoP for merchant authentication). Connector-level protocol is ISO 8583 (card networks) or ISO 20022 (bank-to-bank / SEPA / SWIFT). For ACH and wire transfers, Moov provides open-source implementations of NACHA ACH, Fedwire, and ISO 8583 framing — the transport substrate reference for bank rail connectors. PAN storage is prohibited under PCI DSS — raw card numbers never touch application servers.


4. Data Flow (~5 minutes) #

Happy Path: Card Authorization with Auto-Capture #

Merchant → POST /payments/confirm
    │
    ▼
[API Gateway]  TLS termination, merchant auth (API key / OAuth2)
    │
    ▼
[Router Service]
    1. Load PaymentIntent (status = RequiresConfirmation)
    2. Run routing engine → select connector (e.g., Stripe)
    3. Persist AttemptStatus = Started
    4. Build connector request (ConnectorIntegration::build_request)
    │
    ▼
[Connector Gateway]  outbound mTLS to Stripe/Adyen
    │  ISO 8583 or connector REST API
    ▼
[Card Network]  Visa/Mastercard authorization
    │
    ◄── connector response (Authorized / Declined)
    │
[Router Service]
    5. Map connector response → IntentStatus
    6. If Authorized + auto-capture: POST /capture inline
    7. Persist PaymentAttempt (status = Charged)
    8. Persist PaymentIntent (status = Succeeded)
    9. Enqueue WebhookEvent → ProcessTracker
    │
    ▼
[Webhook Scheduler]  async delivery to merchant
    │
    ▼
Merchant ← POST <webhook_url> { event: "payment.succeeded" }

3DS Flow (RequiresCustomerAction) #

Router → connector returns RedirectResponse (ACS URL)
    → IntentStatus = RequiresCustomerAction
    → return next_action.redirect_to_url to browser

Browser → redirect to ACS (bank's 3DS page)
    → customer enters OTP
    → ACS posts result back to return_url

Merchant → POST /payments/{id}/confirm (with authentication result)
    → Router resumes from RequiresCustomerAction
    → proceeds to authorization

5. High-Level Design (~10-15 minutes) #

┌────────────────────────────────────────────────────────────┐
│                      API Gateway                           │
│  TLS termination · rate limiting · merchant auth · routing │
└──────────────────────────┬─────────────────────────────────┘
                           │
              ┌────────────▼────────────┐
              │     Router Service       │
              │  ┌────────────────────┐ │
              │  │  Payment State     │ │
              │  │  Machine           │ │
              │  │  (IntentStatus     │ │
              │  │   AttemptStatus)   │ │
              │  └────────┬───────────┘ │
              │           │             │
              │  ┌────────▼───────────┐ │
              │  │  Routing Engine    │ │
              │  │  Static (Euclid)   │ │
              │  │  Dynamic (gRPC)    │ │
              │  │  GSM fallback      │ │
              │  └────────┬───────────┘ │
              └───────────┼─────────────┘
                          │
         ┌────────────────▼──────────────────┐
         │       Connector Gateway            │
         │  ConnectorIntegration<T,Req,Resp>  │
         │  Stripe · Adyen · PayPal · ...     │
         └─────────────┬──────────┬───────────┘
                       │          │
                  Card Networks   Bank APIs
                  (ISO 8583)      (ISO 20022)
                       │
         ┌─────────────▼─────────────────────┐
         │       Storage Layer               │
         │  PostgreSQL: PaymentIntent,        │
         │    PaymentAttempt, ProcessTracker  │
         │  Redis: idempotency keys,          │
         │    session tokens, rate limits     │
         └─────────────┬─────────────────────┘
                       │
         ┌─────────────▼─────────────────────┐
         │       Scheduler Service           │
         │  Producer: polls ProcessTracker   │
         │  Consumer: Redis Streams          │
         │  Workflows: PaymentSync,          │
         │    WebhookRetry, RefundSync,      │
         │    RevenueRecovery                │
         └───────────────────────────────────┘

Key Components #

API Gateway: merchant authentication (API key → merchant_id lookup), rate limiting per merchant, request routing. PCI DSS network segmentation boundary — raw PANs are tokenized here or never entered (merchant uses Stripe.js / Hyperswitch SDK that POSTs directly to the gateway’s PCI-scoped endpoint).

Router Service: the core. Implements the Operation trait pipeline (ValidateRequest → GetTracker → Domain → UpdateTracker → CallConnector → PostUpdateTracker). All state transitions are database writes before and after the connector call, ensuring crash recovery.

Connector Gateway: each connector (Stripe, Adyen, PayPal) implements ConnectorIntegration<T, Req, Resp>. The trait provides build_request, handle_response, get_error_response. The connector layer translates between Hyperswitch’s internal domain types and each connector’s REST/SOAP API.

Storage: PostgreSQL is the source of truth. Redis caches idempotency key lookups (fast-path dedup) and stores ephemeral 3DS session tokens. ProcessTracker table is the durable queue for async work.

Scheduler Service: producer polls ProcessTracker for due tasks → pushes to Redis Streams → consumer dispatches to 12 workflow types. Retry schedule is a configurable RetryMapping per workflow type. No external workflow engine.

Vault / Tokenization: raw PANs are never stored in application DB. Either a third-party vault (Basis Theory, Skyflow) or the gateway’s own HSM-backed vault stores PANs and issues network tokens. The gateway stores only the vault token reference.


6. Deep Dives (~10 minutes) #

Deep Dive 1: Payment State Machine #

The state machine has two levels (source: hyperswitch/crates/common_enums/src/enums.rs):

AttemptStatus (28 states, connector-granular):

pub enum AttemptStatus {
    Started, Pending, AuthenticationPending, AuthenticationSuccessful,
    Authorized, AuthorizationFailed, Charged, CaptureFailed,
    Unresolved,  // FRM manual review
    Failure, Expired,
    // ...
}

IntentStatus (16 states, merchant-facing):

pub enum IntentStatus {
    RequiresPaymentMethod, RequiresConfirmation,
    RequiresCustomerAction,   // 3DS redirect
    RequiresMerchantAction,   // fraud hold
    RequiresCapture,          // authorized, not yet captured
    Processing, Succeeded, Failed, Cancelled, Expired,
    // ...
}

impl IntentStatus {
    pub fn is_in_terminal_state(self) -> bool {
        matches!(self, Succeeded | Failed | Cancelled | CancelledPostCapture
                     | PartiallyCaptured | Expired)
    }
}

Transition logic on confirm (source: operations/payment_confirm.rs):

let (intent_status, attempt_status, error) =
    match (frm_suggestion, authentication.as_ref()) {
        (Some(FrmSuggestion::FrmCancelTransaction), _) =>
            (IntentStatus::Failed, AttemptStatus::Failure, Some(fraud_error)),
        (Some(FrmSuggestion::FrmManualReview), _) =>
            (IntentStatus::RequiresMerchantAction, AttemptStatus::Unresolved, None),
        (_, Some(auth)) if auth.is_separate_authn_required() =>
            (IntentStatus::RequiresCustomerAction, AttemptStatus::AuthenticationPending, None),
        _ =>
            (IntentStatus::Processing, AttemptStatus::Pending, None),
    };

Why two levels? AttemptStatus is connector-specific (a CaptureInitiated may stay pending for hours on some connectors). IntentStatus is the stable merchant-facing view — it only flips terminal when the attempt is definitively settled.

Deep Dive 2: Idempotency #

The problem: a merchant retries POST /payments/confirm after a timeout. The connector may have already charged the card. Without idempotency, the card is charged twice.

Hyperswitch’s approach (source: diesel_models/src/query/payment_intent.rs):

// "In an active-active setup, a lookup table should be implemented,
//  and the merchant reference ID will serve as the idempotency key"

Three-layer idempotency:

Layer 1 — API level: Idempotency-Key header → lookup in Redis. If found and response is complete, return cached response. TTL: 24h. If found and in-flight, wait and return same result.

Layer 2 — State machine level: PaymentIntent transitions are guarded. Re-confirming a Processing intent returns the current state without re-calling the connector. Terminal states (Succeeded, Failed) are immutable.

Layer 3 — Connector level: all connectors are called with the payment_id as the connector’s idempotency key (passed in Idempotency-Key or X-Request-Id headers per connector). If the connector received the request and charged the card, the retry returns the original transaction ID.

Merchant retry
    → Redis lookup (Idempotency-Key) → HIT → return cached 200
    → MISS → DB lookup PaymentIntent state
        → status = Succeeded → return current state (no connector call)
        → status = Processing → wait for completion
        → status = RequiresConfirmation → proceed (connector not yet called)

Deep Dive 3: Routing Engine #

Hyperswitch supports three routing strategies (source: router/src/core/payments/routing.rs):

Static routing (Euclid DSL): rule-based connector selection. A merchant configures rules like “if currency=USD and amount>100, use Stripe; else use Adyen”. Rules are compiled by the Euclid decision engine into an evaluable form. No runtime latency beyond rule evaluation.

fn perform_static_routing_locally(...) -> RouterResult<Vec<RoutableConnectorChoice>>;
fn resolve_or_fallback_with_approach(...) -> RouterResult<ConnectorCallType>;

Dynamic routing (gRPC): a separate routing service scores connectors by historical success rate, latency, and cost. The router calls this service before each payment:

fn perform_hybrid_routing_if_enabled(...) -> RouterResult<...>;

GSM (Gateway Status Map): when a connector returns an error, the GSM maps the (connector, error_code, error_message) to a decision: retry, do_not_retry, step_up (upgrade to 3DS). Source: router/src/core/payments/retry.rs:

pub async fn do_gsm_actions(
    state: &SessionState,
    payment_data: &mut PaymentData<F>,
    ...) -> RouterResult<Option<api::ConnectorData>> {
    // Looks up GatewayStatusMap for this connector + error code
    // Returns: Some(new_connector) to retry on different connector
    //          None to decline
}

Fallback chain: static routing → primary connector → GSM check → retry on secondary connector → decline.

Deep Dive 4: Connector Abstraction #

Every payment connector implements (source: hyperswitch_interfaces/src/api.rs):

pub trait ConnectorIntegration<T, Req, Resp>: ConnectorCommon {
    fn get_url(&self, req: &RouterData<T, Req, Resp>, connectors: &Connectors) -> Result<String>;
    fn get_headers(&self, ...) -> Result<Vec<(String, Maskable<String>)>>;
    fn get_request_body(&self, ...) -> Result<RequestContent>;
    fn build_request(&self, ...) -> Result<Option<Request>>;
    fn handle_response(&self, ...) -> Result<RouterData<T, Req, Resp>>;
    fn get_error_response(&self, res: Response) -> Result<ErrorResponse>;
}

T is the flow type: Authorize, Capture, Void, Refund, PSync (payment sync), RSync (refund sync). This makes the connector’s implementation for each operation independently type-checked.

The ConnectorCommon base provides base_url(), get_auth_header(), build_error_response() — shared across all flows for that connector.

Adding a new connector requires implementing ConnectorIntegration for each supported flow — the compiler enforces completeness. No connector-specific logic leaks into the router.

Deep Dive 5: Async Processing — ProcessTracker vs Durable Execution Engine #

Payment processing has two distinct classes of async work. They have very different requirements and the right tool differs for each.

Class A: Simple Retry Loops #

Webhook delivery, payment sync after redirect, refund status polling. These are: one operation, linear retry schedule, no branching, no compensation.

Hyperswitch handles these with ProcessTracker (source: diesel_models/src/process_tracker.rs):

pub struct ProcessTracker {
    pub id: String,
    pub runner: Option<String>,       // "PaymentsSyncWorkflow", "OutgoingWebhookRetryWorkflow"
    pub retry_count: i32,
    pub schedule_time: Option<PrimitiveDateTime>,
    pub tracking_data: serde_json::Value,  // payment_id, merchant_id, etc.
    pub status: ProcessTrackerStatus,      // New → Pending → Processing → Finished | Errored
}

Retry schedule (source: scheduler/src/consumer/types/process_data.rs):

pub struct RetryMapping {
    pub start_after: i32,               // seconds before first attempt
    pub frequencies: Vec<(i32, i32)>,   // (interval_seconds, attempt_count)
}

// Webhook retry: 16 attempts over ~4 days
OutgoingWebhookRetryProcessTrackerMapping {
    start_after: 60,
    frequencies: vec![
        (300, 3),    // 3× every 5 min
        (600, 2),    // 2× every 10 min
        (1800, 3),   // 3× every 30 min
        (3600, 4),   // 4× every 1 hour
        (86400, 4),  // 4× every 1 day
    ]
}

The flow: producer polls ProcessTracker table → batches into Redis Streams → consumer dispatches to one of 12 ProcessTrackerWorkflow implementations. Jitter (0–600s random) prevents thundering herd on restarts. For Class A work, ProcessTracker is the right choice — minimal operational overhead, PostgreSQL provides durability, the retry logic is a table lookup.

Where ProcessTracker breaks down: the RevenueRecoveryWorkflow in Hyperswitch is already 43KB of Rust. It manually implements what a workflow engine provides natively: conditional branching, sub-workflows, timers, signals. This is the tell — when your “simple” workflow grows to thousands of lines, you are building a workflow engine by hand.

Class B: Multi-Step, Long-Running Flows #

These exist in every mature payment system:

FlowDurationWhy it’s hard
Dispute / chargeback lifecycle30–90 daysMulti-party (merchant, acquirer, card network), human review, evidence submission deadlines
Dunning / revenue recoveryDays–weeksRetry failed subscription charge, send email D+1, retry D+3, cancel D+14 if unpaid
3DS with long timeoutMinutesCustomer may take minutes to complete OTP; gateway must hold state, not poll
Settlement reconciliationHoursWait for end-of-day file from bank, match against authorized transactions, flag mismatches
Fraud manual reviewHoursRequiresMerchantAction state — human approves or rejects, flow resumes

For Class B, a durable execution engine (Temporal, Restate, AWS Step Functions) is the right substrate. What it provides that ProcessTracker cannot:

1. Durable execution via event sourcing

A workflow’s entire execution history is persisted as an append-only event log. If the worker crashes mid-execution (after step 3 of 8), it replays from the log and resumes at step 4. No application code needed to handle partial completion.

// Temporal: dispute workflow pseudocode
async fn dispute_workflow(dispute_id: &str) {
    let evidence = collect_evidence_activity(dispute_id).await;  // merchant uploads
    // If worker crashes here, replays from log — collect_evidence not re-run
    let deadline = wait_for_timer(Duration::days(7)).await;      // Temporal timer
    let result = submit_to_card_network_activity(evidence).await;
    match result {
        Won  => release_hold_activity(dispute_id).await,
        Lost => debit_merchant_activity(dispute_id).await,
    }
}

2. Timers that survive restarts

wait_for_timer(Duration::days(30)) persists the timer to the Temporal server’s durable log. The worker process can restart 100 times; the timer fires correctly. With ProcessTracker, you’d poll the DB every hour checking if the deadline passed — 720 unnecessary DB reads per dispute.

3. Signals: external events into a running workflow

// Temporal signal: card network sends webhook → signals running dispute workflow
temporal_client.signal_workflow(
    dispute_workflow_id,
    "card_network_decision",
    payload  // Won | Lost | NeedsMoreEvidence
)

The running dispute_workflow is waiting on wait_for_signal("card_network_decision"). The signal unblocks it. With ProcessTracker, you’d need to update a DB row, detect the change in the next poll cycle (up to loop_interval delay), and rebuild all execution context from scratch.

4. Saga compensation

If a multi-step payment flow fails midway, undo the completed steps:

// Saga: reserve inventory → authorize card → create order
// If create_order fails → void_authorization → release_inventory
async fn checkout_workflow(cart_id: &str) {
    let reservation = reserve_inventory_activity(cart_id).await;
    let auth = authorize_card_activity(cart_id).await;
    match create_order_activity(cart_id).await {
        Ok(order) => confirm_activities(reservation, auth, order).await,
        Err(e) => {
            // Compensation: both activities run even if worker crashes here
            void_authorization_activity(auth.id).await;
            release_inventory_activity(reservation.id).await;
        }
    }
}

ProcessTracker has no compensation concept — you’d write the undo logic manually and hope the DB row tracking completion is accurate.

Design Decision: Which to Use #

Is the flow a single operation with linear retry?
    YES → ProcessTracker (webhook delivery, refund sync, payment sync)

Does the flow span multiple steps, involve waits measured in hours/days,
require human signals, or need compensation?
    YES → Durable execution engine (Temporal / Restate)

For a payment gateway MVP: start with ProcessTracker only. Add a durable execution engine when the first Class B flow (dispute lifecycle, dunning) becomes complex enough that the workflow code exceeds ~500 lines.

Durable Execution Engine Trade-offs #

Temporal has established the dominant conceptual model for durable execution — workflows, activities, signals, timers, child workflows. This vocabulary is now how the industry talks about the problem. But the SDK APIs are not a de facto standard the way Stripe’s REST API is: code written for Temporal does not run on Restate, Step Functions, or Conductor. Each engine has incompatible wire protocols and SDK surfaces.

The landscape, mapped to the payment use cases above:

EngineModelPayment fitOperational cost
TemporalWorkflow-as-code (Go/Java/Python/TS SDK)Best for dispute lifecycle, dunning, reconciliation. Signals map cleanly to card network webhooksSeparate Temporal cluster (server + DB). Highest operational cost
RestateDurable handlers embedded in service process (Rust/JVM/TS)Same conceptual model as Temporal, Rust-native — natural fit for Hyperswitch. No separate clusterLow: runs in-process alongside the router service
AWS Step FunctionsState machine DSL (JSON/YAML). States, transitions, wait statesGood for settlement reconciliation (well-defined DAG). Poor for dynamic flows (dispute with conditional branching)Managed, zero ops. Vendor lock-in
Azure Durable FunctionsOrchestrator + activity functions (C#/Python/JS)Similar model to Temporal. Not idiomatic for RustManaged on Azure
Conductor (Netflix)YAML workflow definitions + microservice tasksFits large teams with separate workflow-ops ownership. Complex to self-hostHigh
Inngest / Trigger.devEvent-driven durable functions, serverless-firstGood for webhook-triggered flows (dispute opened → start workflow). Limited timer precisionLow, managed

Temporal is to durable execution what SQL was to databases in 1985: the dominant conceptual model (tables/joins/transactions → workflows/activities/signals), but vendor dialects are incompatible in practice. ANSI SQL took years to emerge; no equivalent standard exists yet for durable execution APIs.

Practical guidance for this design:

Reference Temporal’s programming model as the vocabulary — disputes, dunning, and reconciliation flows are described as workflows with activities and signals regardless of which engine runs them. The implementation choice is operational:

  • Hyperswitch (Rust, self-hosted) → Restate: embeds durable execution in-process, no separate cluster, Rust SDK
  • Cloud-native / managed → AWS Step Functions for DAG-shaped flows, Temporal Cloud for code-centric flows
  • ProcessTracker stays for Class A work (webhook retries, payment sync) regardless — adding a durable execution engine for Class B does not replace it

The boundary rule: if describing a flow requires drawing a sequence diagram with conditional branches, human wait states, or compensation arrows — it belongs in the durable execution engine. If it is a retry loop with a schedule table — ProcessTracker is sufficient.

What If: Payment Flows Modeled Using the Temporal Programming Model #

The following models the Class B payment flows using Temporal’s vocabulary. The pseudocode is language-agnostic — the same concepts apply to Restate, Step Functions, or any durable execution engine.

Core primitives used:

  • activity(name, input) — a durable, retriable unit of work. If the worker crashes mid-execution, the engine replays from the event log and skips completed activities
  • sleep(duration) — a durable timer persisted on the engine server; the worker process can restart freely
  • wait_for_signal(name) — blocks the workflow until an external event arrives (webhook, API call, UI action)
  • race(branches...) — a selector that unblocks on whichever branch completes first (timer vs signal)
  • child_workflow(name, input) — spawns a composable sub-workflow, independently observable and retriable
  • compensate(...) — runs undo activities if a later step fails (saga pattern)

Flow 1: Dispute / Chargeback Lifecycle (30–90 days)

workflow DisputeWorkflow(dispute_id):

    dispute = activity(fetch_dispute, dispute_id)
    activity(freeze_merchant_funds, dispute)
    activity(notify_merchant, dispute)          -- merchant has 7 days to submit evidence

    evidence = race(
        on signal("evidence_submitted") => receive evidence,
        on sleep(7 days)               => evidence = none   -- deadline passed, proceed without
    )

    submission = activity(submit_evidence_to_network, dispute, evidence)

    decision = race(
        on signal("network_decision") => receive decision,
        on sleep(75 days)             => decision = { outcome: timed_out }
    )

    if decision.outcome == won:
        activity(release_merchant_funds, dispute)

    else if decision.outcome == lost:
        activity(debit_merchant_account, dispute)

    else if decision.outcome == needs_representment:
        child_workflow(RepresentmentWorkflow, dispute_id)   -- second chargeback round

    activity(close_dispute, dispute, decision)

The evidence_submitted signal fires when the merchant POSTs evidence via the API. The network_decision signal fires when the card network webhook arrives. The workflow sleeps between steps — no DB polling, no context reconstruction on each wake.


Flow 2: Dunning / Revenue Recovery (21 days, 6 attempts)

workflow DunningWorkflow(subscription_id):

    retry_schedule = [
        (wait: 1 day,  action: retry),
        (wait: 1 day,  action: send_email "payment_failed"),
        (wait: 2 days, action: retry),
        (wait: 3 days, action: send_email "update_payment_method"),
        (wait: 7 days, action: retry),
        (wait: 7 days, action: retry),           -- D+21, final attempt
    ]

    for step in retry_schedule:

        sleep(step.wait)

        sub = activity(fetch_subscription, subscription_id)
        if sub.status == cancelled: return       -- customer cancelled while we slept

        if step.action == retry:
            result = activity(charge_subscription, subscription_id, sub.payment_method_id)
            if result.success:
                activity(send_receipt, sub)
                return                           -- recovered

            if sub.backup_payment_method_id exists:
                result = activity(charge_subscription, subscription_id, sub.backup_payment_method_id)
                if result.success: return

        else if step.action == send_email:
            activity(send_email, sub, step.template)

    -- all retries exhausted
    compensate:
        activity(cancel_subscription, subscription_id)
        activity(send_cancellation_email, subscription_id)

Compare to Hyperswitch’s RevenueRecoveryWorkflow: 43KB of Rust that manually reconstructs execution context on every invocation. Here the retry schedule is visible inline as data, and the “if customer cancelled while we slept” check is a single line rather than a DB state machine.


Flow 3: 3DS Authentication with Customer Timeout (15 minutes)

workflow ThreeDSWorkflow(payment_id):

    -- Step 1: collect browser fingerprint (30s window for JS to POST)
    activity(initiate_device_data_collection, payment_id)
    sleep(30 seconds)

    -- Step 2: redirect customer to bank's ACS challenge page
    challenge = activity(initiate_challenge, payment_id)

    -- Step 3: wait for customer to complete OTP
    auth_result = race(
        on signal("challenge_completed") => receive auth_result,
        on sleep(15 minutes)             => auth_result = { status: timed_out }
    )

    if auth_result.status != success:
        activity(fail_payment, payment_id, reason: "3DS_TIMEOUT")
        return

    -- Step 4: resume authorization with 3DS data attached
    activity(authorize_with_3ds, payment_id, auth_result)

The challenge_completed signal fires when the bank’s ACS posts back to the gateway’s return URL. The gateway’s inbound webhook handler issues signal_workflow(payment_id, "challenge_completed", result). The IntentStatus in PostgreSQL stays RequiresCustomerAction while the workflow waits — the durable engine holds the timer, not a polling loop.


Flow 4: End-of-Day Settlement Reconciliation (nightly per acquirer)

workflow SettlementReconciliationWorkflow(acquirer_id, settlement_date):

    -- acquirer sends ISO 20022 settlement file between 00:00 and 06:00
    settlement_file = race(
        on signal("settlement_file_received") => receive file,
        on sleep(6 hours)                     => file = none
    )

    if settlement_file == none:
        activity(alert_missing_settlement, acquirer_id, settlement_date)
        return error("settlement file not received within window")

    mismatches = activity(parse_and_match_settlement, acquirer_id, settlement_file)

    -- fan out: each mismatch is an independently retriable child workflow
    for mismatch in mismatches:
        child_workflow(MismatchInvestigationWorkflow, mismatch)
            -- may involve sending dispute to acquirer, waiting days for response

    activity(mark_settlement_complete, acquirer_id, settlement_date)

Each MismatchInvestigationWorkflow child is independently observable (query its current state), independently retriable if it fails, and may itself contain multi-day waits for acquirer responses. The fan-out is naturally parallel — the engine runs all children concurrently.


Flow 5: Fraud Manual Review (human-in-the-loop)

workflow FraudReviewWorkflow(payment_id):

    review_case = activity(create_fraud_review_case, payment_id)
    activity(notify_fraud_team, review_case)    -- Slack + email + internal ticket

    -- analyst must decide within 4 hours; auto-decline on timeout (conservative default)
    decision = race(
        on signal("analyst_decision") => receive decision,
        on sleep(4 hours)             => decision = { action: decline, reason: "review_timeout" }
    )

    if decision.action == approve:
        activity(approve_and_authorize, payment_id)

    else if decision.action == decline:
        activity(decline_payment, payment_id, decision.reason)
        activity(notify_merchant_decline, payment_id)

    else if decision.action == request_more_info:
        activity(request_merchant_kyc, payment_id, decision.required_docs)
        child_workflow(KYCSubmissionWorkflow, payment_id)
            -- waits for merchant to upload documents, with its own deadline

    activity(close_fraud_review_case, review_case.id, decision)

The analyst clicks Approve/Decline in the internal fraud review UI. The UI calls signal_workflow(payment_id, "analyst_decision", decision). The IntentStatus stays RequiresMerchantAction in PostgreSQL — the source of truth for the merchant-facing API — while the workflow engine holds the sleep.


Summary: ProcessTracker vs Durable Workflow by flow

FlowProcessTrackerDurable Workflow
Webhook retry (16 attempts / 4 days)RetryMapping config — 8 linesOverkill
Payment sync after redirectSingle polling loopOverkill
3DS timeout (15 min)Polls PaymentIntent status every N secondsrace(signal, sleep(15min)) — no polling
Dunning (6 retries / 21 days)43KB RevenueRecoveryWorkflow in Hyperswitch~40-line sequential workflow
Dispute lifecycle (75 days)Not modeled — too complexNative: signals + child workflows
Settlement reconciliationNot modeledFan-out child workflows per mismatch
Fraud manual reviewNot modeledSignal from review UI unblocks workflow

Standards Reference #

Standard / SourceLayerRole in this design
ISO 8583TransportWire format between acquirer and card network (Visa/Mastercard). Binary, field-coded. Implemented inside card network connector adapters
ISO 20022TransportBank-to-bank message format (SEPA, ACH, Fedwire, SWIFT). Used for bank transfer connectors
NACHA / ACHTransportUS ACH file format for batch bank transfers. Fixed-width record format
MoovImplementationOpen-source Go implementations of NACHA ACH, Fedwire, ISO 8583 framing — transport substrate reference for bank rail connectors
EMVCo 3DS2Auth3D Secure authentication spec. Drives RequiresCustomerAction → redirect → RequiresConfirmation state transitions
Berlin Group NextGenPSD2APIOpen banking REST API spec for European bank account access (PSD2 compliance). Account-to-account payment initiation
Open Banking UKAPIUK’s open banking standard. REST + OAuth2. Basis for UK bank connector implementations
FAPI 2.0SecurityFinancial-grade API security profile (OAuth2 + mTLS or DPoP). Merchant → gateway authentication
PCI DSS Level 1ComplianceProhibits PAN storage in application DB. Requires HSM/vault, network segmentation, audit logs, quarterly pen tests
Stripe OpenAPIAPI contractDe facto REST API standard for payment gateways. Hyperswitch mirrors this spec
Hyperswitch sourceImplementationOpen-source Rust payment switch (~/{code}/hyperswitch). All deep-dive code references above