Table of Contents
Distributed Coordination Issues, Incidents, and Mitigation Strategies #
Deep Dive Series #
For a ground-up treatment of distributed coordination — covering the mechanism taxonomy (Selection, Agreement, Assignment, Suppression, Aggregation, Convergence), infrastructure-level protocols (Raft, etcd, Cassandra gossip, Kafka), and application-level patterns (seat reservation, idempotency, rate limiting, CRDTs) — see the series:
Distributed Coordination: The Hidden Component
Leader Election & Distributed Locks #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| Split-brain / dual leader | S: Distributed job scheduler cluster T: Ensure exactly one node triggers each scheduled job A: Network partition caused both partitions to elect a leader; same job ran twice R: Duplicate emails sent to 2 million users | - Fixed-timeout failure detection - No fencing tokens - Redis SETNX as correctness lock - Async replication without quorum writes | Raft leader election + etcd lease + fencing token Protected resource rejects writes with stale token | - GitHub dual-primary MySQL incident (2012) - Redis Sentinel split-brain during network partition |
| Stale lock holder writes after TTL expiry | S: Payment processor holding a distributed lock during transaction T: Ensure only one processor commits a transaction A: GC pause exceeded TTL; lock expired; new holder acquired; original resumed and double-committed R: Double charge to customer | - GC pauses exceeding lock TTL - No fencing token - Using Redis SETNX for correctness | Fencing token (monotonic revision) Storage layer rejects writes with token ≤ last seen | - Kleppmann (2016) — “How to do distributed locking” |
| Lock not released on crash | S: Inventory management service holding a lock T: Release held inventory on service failure A: Service crashed without releasing Redis lock; no TTL set R: Inventory permanently unavailable until manual intervention | - No TTL on lock key - Missing crash-recovery cleanup - Pessimistic lock without expiry | TTL-bounded locks (always set EX) Heartbeat extension for long operations | - Common Redis lock misuse pattern in production |
| Election storms on network partition | S: 5-node etcd cluster during datacenter network flap T: Maintain stable leadership during transient failures A: Election timeout too short; every network jitter triggered election; no stable leader R: 30 seconds of unavailability during flap | - Election timeout < network jitter - Randomization window too narrow - Heartbeat interval too long relative to timeout | Randomized election timeouts (150–300ms) Heartbeat interval = timeout/3 | - etcd production tuning guides |
Distributed Transactions & Saga #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| 2PC coordinator crash leaves in-doubt transactions | S: Cross-database financial transaction T: Atomically debit account A, credit account B A: Coordinator crashed after sending PREPARE but before COMMIT; both participants held locks for 8 minutes R: Account transfers blocked until coordinator recovered | - Single-node coordinator - No coordinator replication - Long-held locks | Raft-replicated coordinator (Spanner model) Or: use Saga instead of 2PC across service boundaries | - Oracle RAC distributed transaction hangs - PostgreSQL in-doubt transaction incidents |
| Saga partial failure without compensation | S: Hotel + flight booking saga T: Book both hotel and flight atomically A: Flight booking succeeded; hotel booking failed; no compensation triggered R: Customer charged for flight with no hotel | - Missing compensation actions - No saga orchestrator - Fire-and-forget event choreography | Saga with explicit compensation + idempotent steps Orchestrator retries compensation until confirmed | - Microservice booking platform incidents |
| Non-idempotent saga compensation | S: E-commerce order cancellation T: Refund payment when order cancelled A: Refund service crashed after processing refund; orchestrator retried; double refund issued R: Revenue loss from duplicate refunds | - Compensation not idempotent - No deduplication on refund API - Missing idempotency key on compensation | Idempotency key on every saga step and compensation | - Payment platform double-refund incidents |
Membership & Failure Detection #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| GC pause triggers false failure detection | S: Cassandra cluster during JVM GC tuning T: Detect genuinely failed nodes quickly A: Fixed 10s timeout declared GC-paused node dead; rebalancing started; paused node resumed mid-rebalance R: Data inconsistency and rebalancing storm | - Fixed timeout failure detection - Timeout shorter than max GC pause - Aggressive decommission on suspicion | Phi accrual failure detector Adaptive to observed heartbeat intervals | - Cassandra GC-triggered rebalancing incidents |
| Seed node unavailability prevents cluster bootstrap | S: Cassandra cluster restart after datacenter maintenance T: Bring cluster back online after full restart A: Seed nodes restarted last; non-seed nodes couldn’t join; cluster offline for 45 minutes R: Extended downtime during maintenance window | - Seeds restarted in wrong order - Only one seed per datacenter - No automated seed node selection | Seeds first, then non-seeds At least one seed per AZ | - Cassandra operational runbook anti-patterns |
| Membership divergence during partition | S: Cassandra multi-datacenter cluster during WAN link failure T: Continue serving reads/writes in each datacenter A: Each DC continued operating with LOCAL_QUORUM; after partition heal, inconsistent data on overlapping writes R: Data divergence requiring manual repair | - LOCAL_QUORUM without understanding cross-DC consistency - No read repair monitoring - Missing anti-entropy repair schedule | NetworkTopologyStrategy + scheduled repair (nodetool repair)Monitor inconsistency metrics | - Multi-DC Cassandra consistency incidents |
Consistency & Quorum #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| Stale reads after leader change | S: etcd-backed configuration store T: Read latest configuration after write A: Read served by follower before new leader’s writes propagated R: Service started with stale config after leader failover | - Follower reads without lease - Missing linearizable read requirement - etcd Serializable read used where Linearizable required | etcd Linearizable reads (default) Use WithSerializable() only for explicitly stale-OK reads | - Kubernetes controller stale config reads |
| Quorum write succeeds with dead replicas | S: Cassandra RF=3, ONE write consistency, node failure T: Accept writes even with one node down A: Write to ONE succeeded; node came back with stale data; subsequent QUORUM read returned stale value R: Stale data served after node recovery | - Write at ONE, read at QUORUM (W+R=1+2=3, not > N=3) - No read repair triggered - Hinted handoff replay not completed | W + R > N (QUORUM write + QUORUM read) Or: schedule repair after node recovery | - Cassandra consistency tuning incidents |
Suppression — Idempotency & Deduplication #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| Double charge on payment retry | S: Checkout service with retry logic T: Charge customer exactly once despite network failures A: Payment API timed out; client retried with new request; both requests processed R: Customer charged twice | - No idempotency key - Server not idempotent - Retry without deduplication | Client-generated idempotency key (UUID) Server stores (key, result) in same transaction as business operation | - Stripe, Braintree payment double-charge incidents |
| Outbox event published twice | S: Order service publishing OrderCreated events T: Publish event exactly once after order creation A: Outbox poller published event; crashed before marking published=true; re-published on restart R: Downstream inventory service decremented twice | - Non-idempotent event consumer - Outbox poller without atomic mark-as-published - Missing consumer-side dedup | Outbox pattern + idempotent consumer Consumer deduplicates by event ID | - Microservice event sourcing incidents |
| Bloom filter false positive blocks valid URL | S: Web crawler deduplication at scale T: Avoid re-crawling URLs already processed A: Bloom filter sized too small; false positive rate reached 10%; valid new URLs silently skipped R: Crawler missed 10% of new content over two weeks | - Bloom filter undersized for actual URL volume - No false positive rate monitoring - No fallback exact check for high-value URLs | Size Bloom filter for expected N × 1.5 Monitor false positive rate; exact fallback for critical paths | - Web crawler quality degradation |
| Idempotency key reused across different operations | S: Batch payment processor reusing request IDs T: Deduplicate payment retries within a batch A: Same idempotency key used for two different charges (key not scoped to amount + recipient) R: Second charge silently returned cached result of first; revenue lost | - Idempotency key not encoding operation semantics - Key scoped too broadly (user ID only, not user+amount+recipient) | Key = hash(user_id + operation_type + entity_id) Scope key to the full intent, not just the caller | - Payment platform silent revenue loss incidents |
Aggregation — Rate Limiting & Counting #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| Rate limiter bypass via multiple nodes | S: API gateway rate limiting 1000 req/min per user T: Enforce limit across all gateway instances A: Each instance maintained a local counter; user sent 999 req to each of 5 instances; 4995 requests processed R: Downstream service overwhelmed; SLA breached | - Local-only counter with no cross-node sync - No centralized rate state - Per-instance limit instead of per-user global | Centralized Redis counter (INCR + EXPIRE) Or: local budget + periodic sync to global Redis counter | - API gateway incidents during DDoS-like traffic spikes |
| Token bucket allows burst after idle | S: Payment API with 100 req/sec token bucket T: Prevent abuse bursts on sensitive endpoints A: Bucket accumulated 6000 tokens during 60-second idle; user sent 6000 requests instantly R: Downstream fraud service overwhelmed by burst | - Max capacity too high relative to burst tolerance - Bucket capacity = rate × idle period without ceiling | Cap burst capacity separately from rate Set capacity = max_burst (e.g., 200), not rate × window | - Payment API burst incidents |
| Count-Min Sketch undersized causes over-rate-limiting | S: CDN edge rate limiting by IP using Count-Min Sketch T: Block abusive IPs at edge without per-IP state A: Sketch saturated with high-traffic legitimate IPs; hash collisions inflated counts for innocent IPs R: Legitimate users rate-limited; support tickets spiked | - Sketch width too small for traffic volume - No overflow monitoring - Using sketch where exact count required | Size sketch for expected distinct IPs × safety factor Use exact counter for allowlisted / high-value IPs | - CDN edge filtering false positive incidents |
| HyperLogLog merge produces incorrect cardinality | S: Analytics platform counting unique daily users across 10 shards T: Merge per-shard HLL sketches for global unique count A: Sketches used different hash seeds per shard; element-wise max merge invalid across different hash spaces R: Reported unique users 40% lower than actual | - Different hash seeds per HLL instance - Merging incompatible sketches | All HLL instances must use the same hash function and seed Redis PFMERGE handles this correctly | - Analytics platform cardinality undercount |
Convergence — CRDTs & Eventual Consistency #
| Issue | STAR Incident Example | Contributing Patterns | Canonical Solution Pattern | Real-world Incidents |
|---|---|---|---|---|
| LWW conflict resolution silently drops writes | S: Multi-region Cassandra storing user profile updates T: Accept concurrent writes from two regions A: User updated email in EU region and phone in US region simultaneously; LWW kept only the higher-timestamp write; other field silently discarded R: User’s profile update lost; support ticket filed | - LWW applied at row level, not column level - Wall clock timestamps with inter-region drift - No conflict detection or notification | Column-level LWW with NTP-synchronized clocks Or: use conditional updates (CAS) for critical fields | - Cassandra multi-region data loss incidents |
| CRDT counter goes negative | S: Inventory system using PN-Counter CRDT T: Track available stock across regional replicas A: Both regions decremented independently below zero during partition; after merge, stock showed −15 R: Orders placed against negative inventory; oversell | - PN-Counter has no lower bound - No inventory constraint enforced at CRDT level - Partition allowed unbounded decrements | Reserve capacity in each region before decrement Or: use strong consistency (CP) for inventory, not CRDTs | - Multi-region inventory oversell incidents |
| Operational transform divergence on concurrent edits | S: Collaborative document editor T: Merge concurrent edits from two users into consistent document A: OT client applied remote operations without transforming against locally-buffered unacknowledged ops; document diverged between clients R: Different users saw different document state; data lost on save | - OT transform not applied to in-flight local ops - Client applied remote ops before server acknowledgment - Missing TP2 transform property | Buffer local unacknowledged ops; transform all incoming ops against buffer Server-ordered OT simplifies: server is total order authority | - Collaborative editor divergence incidents |
| Gossip convergence too slow during membership storm | S: Redis Cluster during rolling restart of 50 nodes T: Propagate slot ownership changes to all nodes quickly A: Gossip fanout too low; convergence took 3 minutes; stale slot routing caused MOVED errors to spike R: 3 minutes of elevated error rate during rolling restart | - Default gossip interval too long for cluster size - Fanout too low (f=2 for 50 nodes needs ~6 rounds at 1s = 6s; 3 min suggests fanout problem) - No gossip acceleration during membership change | Increase gossip fanout during topology changes Monitor gossip convergence time as an SLI | - Redis Cluster MOVED error spike during maintenance |
There's no articles to list here yet.