Skip to main content
  1. System Design Components/

Distributed Coordination Issues, Incidents, and Mitigation Strategies #

Deep Dive Series #

For a ground-up treatment of distributed coordination — covering the mechanism taxonomy (Selection, Agreement, Assignment, Suppression, Aggregation, Convergence), infrastructure-level protocols (Raft, etcd, Cassandra gossip, Kafka), and application-level patterns (seat reservation, idempotency, rate limiting, CRDTs) — see the series:

Distributed Coordination: The Hidden Component


Leader Election & Distributed Locks #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Split-brain / dual leaderS: Distributed job scheduler cluster
T: Ensure exactly one node triggers each scheduled job
A: Network partition caused both partitions to elect a leader; same job ran twice
R: Duplicate emails sent to 2 million users
- Fixed-timeout failure detection
- No fencing tokens
- Redis SETNX as correctness lock
- Async replication without quorum writes
Raft leader election + etcd lease + fencing token
Protected resource rejects writes with stale token
- GitHub dual-primary MySQL incident (2012)
- Redis Sentinel split-brain during network partition
Stale lock holder writes after TTL expiryS: Payment processor holding a distributed lock during transaction
T: Ensure only one processor commits a transaction
A: GC pause exceeded TTL; lock expired; new holder acquired; original resumed and double-committed
R: Double charge to customer
- GC pauses exceeding lock TTL
- No fencing token
- Using Redis SETNX for correctness
Fencing token (monotonic revision)
Storage layer rejects writes with token ≤ last seen
- Kleppmann (2016) — “How to do distributed locking”
Lock not released on crashS: Inventory management service holding a lock
T: Release held inventory on service failure
A: Service crashed without releasing Redis lock; no TTL set
R: Inventory permanently unavailable until manual intervention
- No TTL on lock key
- Missing crash-recovery cleanup
- Pessimistic lock without expiry
TTL-bounded locks (always set EX)
Heartbeat extension for long operations
- Common Redis lock misuse pattern in production
Election storms on network partitionS: 5-node etcd cluster during datacenter network flap
T: Maintain stable leadership during transient failures
A: Election timeout too short; every network jitter triggered election; no stable leader
R: 30 seconds of unavailability during flap
- Election timeout < network jitter
- Randomization window too narrow
- Heartbeat interval too long relative to timeout
Randomized election timeouts (150–300ms)
Heartbeat interval = timeout/3
- etcd production tuning guides

Distributed Transactions & Saga #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
2PC coordinator crash leaves in-doubt transactionsS: Cross-database financial transaction
T: Atomically debit account A, credit account B
A: Coordinator crashed after sending PREPARE but before COMMIT; both participants held locks for 8 minutes
R: Account transfers blocked until coordinator recovered
- Single-node coordinator
- No coordinator replication
- Long-held locks
Raft-replicated coordinator (Spanner model)
Or: use Saga instead of 2PC across service boundaries
- Oracle RAC distributed transaction hangs
- PostgreSQL in-doubt transaction incidents
Saga partial failure without compensationS: Hotel + flight booking saga
T: Book both hotel and flight atomically
A: Flight booking succeeded; hotel booking failed; no compensation triggered
R: Customer charged for flight with no hotel
- Missing compensation actions
- No saga orchestrator
- Fire-and-forget event choreography
Saga with explicit compensation + idempotent steps
Orchestrator retries compensation until confirmed
- Microservice booking platform incidents
Non-idempotent saga compensationS: E-commerce order cancellation
T: Refund payment when order cancelled
A: Refund service crashed after processing refund; orchestrator retried; double refund issued
R: Revenue loss from duplicate refunds
- Compensation not idempotent
- No deduplication on refund API
- Missing idempotency key on compensation
Idempotency key on every saga step and compensation- Payment platform double-refund incidents

Membership & Failure Detection #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
GC pause triggers false failure detectionS: Cassandra cluster during JVM GC tuning
T: Detect genuinely failed nodes quickly
A: Fixed 10s timeout declared GC-paused node dead; rebalancing started; paused node resumed mid-rebalance
R: Data inconsistency and rebalancing storm
- Fixed timeout failure detection
- Timeout shorter than max GC pause
- Aggressive decommission on suspicion
Phi accrual failure detector
Adaptive to observed heartbeat intervals
- Cassandra GC-triggered rebalancing incidents
Seed node unavailability prevents cluster bootstrapS: Cassandra cluster restart after datacenter maintenance
T: Bring cluster back online after full restart
A: Seed nodes restarted last; non-seed nodes couldn’t join; cluster offline for 45 minutes
R: Extended downtime during maintenance window
- Seeds restarted in wrong order
- Only one seed per datacenter
- No automated seed node selection
Seeds first, then non-seeds
At least one seed per AZ
- Cassandra operational runbook anti-patterns
Membership divergence during partitionS: Cassandra multi-datacenter cluster during WAN link failure
T: Continue serving reads/writes in each datacenter
A: Each DC continued operating with LOCAL_QUORUM; after partition heal, inconsistent data on overlapping writes
R: Data divergence requiring manual repair
- LOCAL_QUORUM without understanding cross-DC consistency
- No read repair monitoring
- Missing anti-entropy repair schedule
NetworkTopologyStrategy + scheduled repair (nodetool repair)
Monitor inconsistency metrics
- Multi-DC Cassandra consistency incidents

Consistency & Quorum #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Stale reads after leader changeS: etcd-backed configuration store
T: Read latest configuration after write
A: Read served by follower before new leader’s writes propagated
R: Service started with stale config after leader failover
- Follower reads without lease
- Missing linearizable read requirement
- etcd Serializable read used where Linearizable required
etcd Linearizable reads (default)
Use WithSerializable() only for explicitly stale-OK reads
- Kubernetes controller stale config reads
Quorum write succeeds with dead replicasS: Cassandra RF=3, ONE write consistency, node failure
T: Accept writes even with one node down
A: Write to ONE succeeded; node came back with stale data; subsequent QUORUM read returned stale value
R: Stale data served after node recovery
- Write at ONE, read at QUORUM (W+R=1+2=3, not > N=3)
- No read repair triggered
- Hinted handoff replay not completed
W + R > N (QUORUM write + QUORUM read)
Or: schedule repair after node recovery
- Cassandra consistency tuning incidents

Suppression — Idempotency & Deduplication #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Double charge on payment retryS: Checkout service with retry logic
T: Charge customer exactly once despite network failures
A: Payment API timed out; client retried with new request; both requests processed
R: Customer charged twice
- No idempotency key
- Server not idempotent
- Retry without deduplication
Client-generated idempotency key (UUID)
Server stores (key, result) in same transaction as business operation
- Stripe, Braintree payment double-charge incidents
Outbox event published twiceS: Order service publishing OrderCreated events
T: Publish event exactly once after order creation
A: Outbox poller published event; crashed before marking published=true; re-published on restart
R: Downstream inventory service decremented twice
- Non-idempotent event consumer
- Outbox poller without atomic mark-as-published
- Missing consumer-side dedup
Outbox pattern + idempotent consumer
Consumer deduplicates by event ID
- Microservice event sourcing incidents
Bloom filter false positive blocks valid URLS: Web crawler deduplication at scale
T: Avoid re-crawling URLs already processed
A: Bloom filter sized too small; false positive rate reached 10%; valid new URLs silently skipped
R: Crawler missed 10% of new content over two weeks
- Bloom filter undersized for actual URL volume
- No false positive rate monitoring
- No fallback exact check for high-value URLs
Size Bloom filter for expected N × 1.5
Monitor false positive rate; exact fallback for critical paths
- Web crawler quality degradation
Idempotency key reused across different operationsS: Batch payment processor reusing request IDs
T: Deduplicate payment retries within a batch
A: Same idempotency key used for two different charges (key not scoped to amount + recipient)
R: Second charge silently returned cached result of first; revenue lost
- Idempotency key not encoding operation semantics
- Key scoped too broadly (user ID only, not user+amount+recipient)
Key = hash(user_id + operation_type + entity_id)
Scope key to the full intent, not just the caller
- Payment platform silent revenue loss incidents

Aggregation — Rate Limiting & Counting #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
Rate limiter bypass via multiple nodesS: API gateway rate limiting 1000 req/min per user
T: Enforce limit across all gateway instances
A: Each instance maintained a local counter; user sent 999 req to each of 5 instances; 4995 requests processed
R: Downstream service overwhelmed; SLA breached
- Local-only counter with no cross-node sync
- No centralized rate state
- Per-instance limit instead of per-user global
Centralized Redis counter (INCR + EXPIRE)
Or: local budget + periodic sync to global Redis counter
- API gateway incidents during DDoS-like traffic spikes
Token bucket allows burst after idleS: Payment API with 100 req/sec token bucket
T: Prevent abuse bursts on sensitive endpoints
A: Bucket accumulated 6000 tokens during 60-second idle; user sent 6000 requests instantly
R: Downstream fraud service overwhelmed by burst
- Max capacity too high relative to burst tolerance
- Bucket capacity = rate × idle period without ceiling
Cap burst capacity separately from rate
Set capacity = max_burst (e.g., 200), not rate × window
- Payment API burst incidents
Count-Min Sketch undersized causes over-rate-limitingS: CDN edge rate limiting by IP using Count-Min Sketch
T: Block abusive IPs at edge without per-IP state
A: Sketch saturated with high-traffic legitimate IPs; hash collisions inflated counts for innocent IPs
R: Legitimate users rate-limited; support tickets spiked
- Sketch width too small for traffic volume
- No overflow monitoring
- Using sketch where exact count required
Size sketch for expected distinct IPs × safety factor
Use exact counter for allowlisted / high-value IPs
- CDN edge filtering false positive incidents
HyperLogLog merge produces incorrect cardinalityS: Analytics platform counting unique daily users across 10 shards
T: Merge per-shard HLL sketches for global unique count
A: Sketches used different hash seeds per shard; element-wise max merge invalid across different hash spaces
R: Reported unique users 40% lower than actual
- Different hash seeds per HLL instance
- Merging incompatible sketches
All HLL instances must use the same hash function and seed
Redis PFMERGE handles this correctly
- Analytics platform cardinality undercount

Convergence — CRDTs & Eventual Consistency #

IssueSTAR Incident ExampleContributing PatternsCanonical Solution PatternReal-world Incidents
LWW conflict resolution silently drops writesS: Multi-region Cassandra storing user profile updates
T: Accept concurrent writes from two regions
A: User updated email in EU region and phone in US region simultaneously; LWW kept only the higher-timestamp write; other field silently discarded
R: User’s profile update lost; support ticket filed
- LWW applied at row level, not column level
- Wall clock timestamps with inter-region drift
- No conflict detection or notification
Column-level LWW with NTP-synchronized clocks
Or: use conditional updates (CAS) for critical fields
- Cassandra multi-region data loss incidents
CRDT counter goes negativeS: Inventory system using PN-Counter CRDT
T: Track available stock across regional replicas
A: Both regions decremented independently below zero during partition; after merge, stock showed −15
R: Orders placed against negative inventory; oversell
- PN-Counter has no lower bound
- No inventory constraint enforced at CRDT level
- Partition allowed unbounded decrements
Reserve capacity in each region before decrement
Or: use strong consistency (CP) for inventory, not CRDTs
- Multi-region inventory oversell incidents
Operational transform divergence on concurrent editsS: Collaborative document editor
T: Merge concurrent edits from two users into consistent document
A: OT client applied remote operations without transforming against locally-buffered unacknowledged ops; document diverged between clients
R: Different users saw different document state; data lost on save
- OT transform not applied to in-flight local ops
- Client applied remote ops before server acknowledgment
- Missing TP2 transform property
Buffer local unacknowledged ops; transform all incoming ops against buffer
Server-ordered OT simplifies: server is total order authority
- Collaborative editor divergence incidents
Gossip convergence too slow during membership stormS: Redis Cluster during rolling restart of 50 nodes
T: Propagate slot ownership changes to all nodes quickly
A: Gossip fanout too low; convergence took 3 minutes; stale slot routing caused MOVED errors to spike
R: 3 minutes of elevated error rate during rolling restart
- Default gossip interval too long for cluster size
- Fanout too low (f=2 for 50 nodes needs ~6 rounds at 1s = 6s; 3 min suggests fanout problem)
- No gossip acceleration during membership change
Increase gossip fanout during topology changes
Monitor gossip convergence time as an SLI
- Redis Cluster MOVED error spike during maintenance

There's no articles to list here yet.