Table of Contents

Distributed Coordination Issues, Incidents, and Mitigation Strategies #

Deep Dive Series #

For a ground-up treatment of distributed coordination — covering the mechanism taxonomy (Selection, Agreement, Assignment, Suppression, Aggregation, Convergence), infrastructure-level protocols (Raft, etcd, Cassandra gossip, Kafka), and application-level patterns (seat reservation, idempotency, rate limiting, CRDTs) — see the series:

Distributed Coordination: The Hidden Component

Leader Election & Distributed Locks #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Split-brain / dual leader	S: Distributed job scheduler cluster T: Ensure exactly one node triggers each scheduled job A: Network partition caused both partitions to elect a leader; same job ran twice R: Duplicate emails sent to 2 million users	- Fixed-timeout failure detection - No fencing tokens - Redis SETNX as correctness lock - Async replication without quorum writes	Raft leader election + etcd lease + fencing token Protected resource rejects writes with stale token	- GitHub dual-primary MySQL incident (2012) - Redis Sentinel split-brain during network partition
Stale lock holder writes after TTL expiry	S: Payment processor holding a distributed lock during transaction T: Ensure only one processor commits a transaction A: GC pause exceeded TTL; lock expired; new holder acquired; original resumed and double-committed R: Double charge to customer	- GC pauses exceeding lock TTL - No fencing token - Using Redis SETNX for correctness	Fencing token (monotonic revision) Storage layer rejects writes with token ≤ last seen	- Kleppmann (2016) — “How to do distributed locking”
Lock not released on crash	S: Inventory management service holding a lock T: Release held inventory on service failure A: Service crashed without releasing Redis lock; no TTL set R: Inventory permanently unavailable until manual intervention	- No TTL on lock key - Missing crash-recovery cleanup - Pessimistic lock without expiry	TTL-bounded locks (always set EX) Heartbeat extension for long operations	- Common Redis lock misuse pattern in production
Election storms on network partition	S: 5-node etcd cluster during datacenter network flap T: Maintain stable leadership during transient failures A: Election timeout too short; every network jitter triggered election; no stable leader R: 30 seconds of unavailability during flap	- Election timeout < network jitter - Randomization window too narrow - Heartbeat interval too long relative to timeout	Randomized election timeouts (150–300ms) Heartbeat interval = timeout/3	- etcd production tuning guides

Distributed Transactions & Saga #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
2PC coordinator crash leaves in-doubt transactions	S: Cross-database financial transaction T: Atomically debit account A, credit account B A: Coordinator crashed after sending PREPARE but before COMMIT; both participants held locks for 8 minutes R: Account transfers blocked until coordinator recovered	- Single-node coordinator - No coordinator replication - Long-held locks	Raft-replicated coordinator (Spanner model) Or: use Saga instead of 2PC across service boundaries	- Oracle RAC distributed transaction hangs - PostgreSQL in-doubt transaction incidents
Saga partial failure without compensation	S: Hotel + flight booking saga T: Book both hotel and flight atomically A: Flight booking succeeded; hotel booking failed; no compensation triggered R: Customer charged for flight with no hotel	- Missing compensation actions - No saga orchestrator - Fire-and-forget event choreography	Saga with explicit compensation + idempotent steps Orchestrator retries compensation until confirmed	- Microservice booking platform incidents
Non-idempotent saga compensation	S: E-commerce order cancellation T: Refund payment when order cancelled A: Refund service crashed after processing refund; orchestrator retried; double refund issued R: Revenue loss from duplicate refunds	- Compensation not idempotent - No deduplication on refund API - Missing idempotency key on compensation	Idempotency key on every saga step and compensation	- Payment platform double-refund incidents

Membership & Failure Detection #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
GC pause triggers false failure detection	S: Cassandra cluster during JVM GC tuning T: Detect genuinely failed nodes quickly A: Fixed 10s timeout declared GC-paused node dead; rebalancing started; paused node resumed mid-rebalance R: Data inconsistency and rebalancing storm	- Fixed timeout failure detection - Timeout shorter than max GC pause - Aggressive decommission on suspicion	Phi accrual failure detector Adaptive to observed heartbeat intervals	- Cassandra GC-triggered rebalancing incidents
Seed node unavailability prevents cluster bootstrap	S: Cassandra cluster restart after datacenter maintenance T: Bring cluster back online after full restart A: Seed nodes restarted last; non-seed nodes couldn’t join; cluster offline for 45 minutes R: Extended downtime during maintenance window	- Seeds restarted in wrong order - Only one seed per datacenter - No automated seed node selection	Seeds first, then non-seeds At least one seed per AZ	- Cassandra operational runbook anti-patterns
Membership divergence during partition	S: Cassandra multi-datacenter cluster during WAN link failure T: Continue serving reads/writes in each datacenter A: Each DC continued operating with LOCAL_QUORUM; after partition heal, inconsistent data on overlapping writes R: Data divergence requiring manual repair	- LOCAL_QUORUM without understanding cross-DC consistency - No read repair monitoring - Missing anti-entropy repair schedule	NetworkTopologyStrategy + scheduled repair (`nodetool repair`) Monitor inconsistency metrics	- Multi-DC Cassandra consistency incidents

Consistency & Quorum #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Stale reads after leader change	S: etcd-backed configuration store T: Read latest configuration after write A: Read served by follower before new leader’s writes propagated R: Service started with stale config after leader failover	- Follower reads without lease - Missing linearizable read requirement - etcd Serializable read used where Linearizable required	etcd Linearizable reads (default) Use `WithSerializable()` only for explicitly stale-OK reads	- Kubernetes controller stale config reads
Quorum write succeeds with dead replicas	S: Cassandra RF=3, ONE write consistency, node failure T: Accept writes even with one node down A: Write to ONE succeeded; node came back with stale data; subsequent QUORUM read returned stale value R: Stale data served after node recovery	- Write at ONE, read at QUORUM (W+R=1+2=3, not > N=3) - No read repair triggered - Hinted handoff replay not completed	W + R > N (QUORUM write + QUORUM read) Or: schedule repair after node recovery	- Cassandra consistency tuning incidents

Suppression — Idempotency & Deduplication #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Double charge on payment retry	S: Checkout service with retry logic T: Charge customer exactly once despite network failures A: Payment API timed out; client retried with new request; both requests processed R: Customer charged twice	- No idempotency key - Server not idempotent - Retry without deduplication	Client-generated idempotency key (UUID) Server stores (key, result) in same transaction as business operation	- Stripe, Braintree payment double-charge incidents
Outbox event published twice	S: Order service publishing OrderCreated events T: Publish event exactly once after order creation A: Outbox poller published event; crashed before marking published=true; re-published on restart R: Downstream inventory service decremented twice	- Non-idempotent event consumer - Outbox poller without atomic mark-as-published - Missing consumer-side dedup	Outbox pattern + idempotent consumer Consumer deduplicates by event ID	- Microservice event sourcing incidents
Bloom filter false positive blocks valid URL	S: Web crawler deduplication at scale T: Avoid re-crawling URLs already processed A: Bloom filter sized too small; false positive rate reached 10%; valid new URLs silently skipped R: Crawler missed 10% of new content over two weeks	- Bloom filter undersized for actual URL volume - No false positive rate monitoring - No fallback exact check for high-value URLs	Size Bloom filter for expected N × 1.5 Monitor false positive rate; exact fallback for critical paths	- Web crawler quality degradation
Idempotency key reused across different operations	S: Batch payment processor reusing request IDs T: Deduplicate payment retries within a batch A: Same idempotency key used for two different charges (key not scoped to amount + recipient) R: Second charge silently returned cached result of first; revenue lost	- Idempotency key not encoding operation semantics - Key scoped too broadly (user ID only, not user+amount+recipient)	Key = hash(user_id + operation_type + entity_id) Scope key to the full intent, not just the caller	- Payment platform silent revenue loss incidents

Aggregation — Rate Limiting & Counting #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
Rate limiter bypass via multiple nodes	S: API gateway rate limiting 1000 req/min per user T: Enforce limit across all gateway instances A: Each instance maintained a local counter; user sent 999 req to each of 5 instances; 4995 requests processed R: Downstream service overwhelmed; SLA breached	- Local-only counter with no cross-node sync - No centralized rate state - Per-instance limit instead of per-user global	Centralized Redis counter (INCR + EXPIRE) Or: local budget + periodic sync to global Redis counter	- API gateway incidents during DDoS-like traffic spikes
Token bucket allows burst after idle	S: Payment API with 100 req/sec token bucket T: Prevent abuse bursts on sensitive endpoints A: Bucket accumulated 6000 tokens during 60-second idle; user sent 6000 requests instantly R: Downstream fraud service overwhelmed by burst	- Max capacity too high relative to burst tolerance - Bucket capacity = rate × idle period without ceiling	Cap burst capacity separately from rate Set `capacity = max_burst` (e.g., 200), not `rate × window`	- Payment API burst incidents
Count-Min Sketch undersized causes over-rate-limiting	S: CDN edge rate limiting by IP using Count-Min Sketch T: Block abusive IPs at edge without per-IP state A: Sketch saturated with high-traffic legitimate IPs; hash collisions inflated counts for innocent IPs R: Legitimate users rate-limited; support tickets spiked	- Sketch width too small for traffic volume - No overflow monitoring - Using sketch where exact count required	Size sketch for expected distinct IPs × safety factor Use exact counter for allowlisted / high-value IPs	- CDN edge filtering false positive incidents
HyperLogLog merge produces incorrect cardinality	S: Analytics platform counting unique daily users across 10 shards T: Merge per-shard HLL sketches for global unique count A: Sketches used different hash seeds per shard; element-wise max merge invalid across different hash spaces R: Reported unique users 40% lower than actual	- Different hash seeds per HLL instance - Merging incompatible sketches	All HLL instances must use the same hash function and seed Redis PFMERGE handles this correctly	- Analytics platform cardinality undercount

Convergence — CRDTs & Eventual Consistency #

Issue	STAR Incident Example	Contributing Patterns	Canonical Solution Pattern	Real-world Incidents
LWW conflict resolution silently drops writes	S: Multi-region Cassandra storing user profile updates T: Accept concurrent writes from two regions A: User updated email in EU region and phone in US region simultaneously; LWW kept only the higher-timestamp write; other field silently discarded R: User’s profile update lost; support ticket filed	- LWW applied at row level, not column level - Wall clock timestamps with inter-region drift - No conflict detection or notification	Column-level LWW with NTP-synchronized clocks Or: use conditional updates (CAS) for critical fields	- Cassandra multi-region data loss incidents
CRDT counter goes negative	S: Inventory system using PN-Counter CRDT T: Track available stock across regional replicas A: Both regions decremented independently below zero during partition; after merge, stock showed −15 R: Orders placed against negative inventory; oversell	- PN-Counter has no lower bound - No inventory constraint enforced at CRDT level - Partition allowed unbounded decrements	Reserve capacity in each region before decrement Or: use strong consistency (CP) for inventory, not CRDTs	- Multi-region inventory oversell incidents
Operational transform divergence on concurrent edits	S: Collaborative document editor T: Merge concurrent edits from two users into consistent document A: OT client applied remote operations without transforming against locally-buffered unacknowledged ops; document diverged between clients R: Different users saw different document state; data lost on save	- OT transform not applied to in-flight local ops - Client applied remote ops before server acknowledgment - Missing TP2 transform property	Buffer local unacknowledged ops; transform all incoming ops against buffer Server-ordered OT simplifies: server is total order authority	- Collaborative editor divergence incidents
Gossip convergence too slow during membership storm	S: Redis Cluster during rolling restart of 50 nodes T: Propagate slot ownership changes to all nodes quickly A: Gossip fanout too low; convergence took 3 minutes; stale slot routing caused MOVED errors to spike R: 3 minutes of elevated error rate during rolling restart	- Default gossip interval too long for cluster size - Fanout too low (f=2 for 50 nodes needs ~6 rounds at 1s = 6s; 3 min suggests fanout problem) - No gossip acceleration during membership change	Increase gossip fanout during topology changes Monitor gossip convergence time as an SLI	- Redis Cluster MOVED error spike during maintenance

There's no articles to list here yet.