Skip to main content
  1. Distributed Coordination: The Hidden Component/

Membership — Knowing Who Is Alive

Membership — Knowing Who Is Alive #

Before a distributed system can elect a leader, assign work, or form a quorum, it must answer a simpler question: who is currently in the system? Membership is the coordination substrate — the information layer that all other mechanisms depend on. You cannot run consensus among an unknown set of participants. You cannot assign a shard to a node you do not know exists. You cannot form a quorum without knowing how many nodes are alive.

Membership is solved at the infrastructure level. Applications do not typically build membership detection; they inherit it from the platform (Cassandra’s gossip, Kubernetes via etcd, Kafka via ZooKeeper or KRaft). But understanding how membership works — and how it fails — is essential for understanding why the coordination mechanisms built on top of it behave as they do.

What Membership Must Provide #

A membership system must answer:

  1. Who is in the cluster? — the set of nodes that are expected to participate
  2. Who is alive? — the subset of members that are reachable and functioning
  3. Who has joined recently? — new nodes that should start receiving work
  4. Who has left or failed? — nodes that should no longer receive work

Membership changes trigger downstream coordination: shard rebalancing (Assignment), leader re-election (Selection), quorum recalculation (Agreement). The accuracy and latency of membership information directly determines the correctness and performance of these mechanisms.

Gossip Protocols: Epidemic Propagation #

The dominant membership protocol for large distributed systems is gossip — an epidemic broadcast algorithm inspired by how rumors spread in social networks.

Gossip propagation: O(log N) rounds to reach all nodes

Mechanics #

Each node maintains a partial view of cluster state. On a configurable interval (typically 1 second in Cassandra), every node:

  1. Updates its heartbeat — increments a local counter, creating a (generation, version) pair. Generation is a monotonic timestamp from when the node started; version increments roughly every second.
  2. Selects gossip targets — randomly picks one or more live peers to exchange information with.
  3. Optionally contacts unreachable nodes — with lower probability, attempts to contact nodes currently suspected of failure.
  4. Gossips with seed nodes — if no peers are reachable, contacts configured seed nodes. Seed nodes are the bootstrap points for new nodes joining the cluster.

The exchange is a push-pull model: the initiating node sends its known state, the receiving node replies with its own state, and both nodes merge the two views by keeping the highest version numbers for each key.

Convergence Guarantee #

Gossip achieves logarithmic convergence: with fanout f, a state change propagates to all N nodes in O(log_f(N)) gossip rounds. For Cassandra with fanout 2 and a 25,000-node cluster, approximately 15 rounds are sufficient. At 1-second gossip intervals, cluster-wide propagation takes about 15 seconds in the worst case.

This is eventual consistency for membership state — not immediate, not guaranteed to be synchronized at any given moment, but bounded convergence with well-known latency.

Message Complexity #

Each node sends a fixed number of messages per gossip round, independent of cluster size. This is what makes gossip scalable: adding more nodes does not increase the per-node messaging rate. Total cluster message rate is O(N) per gossip round, not O(N²).

Push, Pull, Push-Pull #

Three gossip strategies exist:

  • Push: Nodes with new updates actively send to random peers. Efficient when updates are infrequent.
  • Pull: Nodes periodically query random peers for updates. Efficient when updates are frequent.
  • Push-pull: Nodes both send their updates and request the peer’s state in the same exchange. Fastest convergence; Cassandra uses this model.

Failure Detection: From Timeout to Suspicion #

A membership system must distinguish between a node that is slow and a node that has failed. The naive approach — declare a node dead if it has not heartbeated within a fixed timeout — is fragile. Too short a timeout produces false positives (slow nodes declared dead during GC pauses). Too long a timeout means failures are detected slowly, leaving the cluster in a degraded state for longer.

Fixed Timeout Failure Detection #

The simplest approach: if a heartbeat has not been received within T seconds, mark the node as failed.

Failure mode: GC pauses in JVM-based systems can easily exceed 10–30 seconds. During a pause, the node sends no heartbeats. A fixed timeout shorter than the maximum GC pause produces false positives — healthy nodes declared dead.

This was the failure mode behind many Cassandra production incidents: GC-paused nodes were declared dead, triggering rebalancing. The paused node resumed, found its data had been redistributed, and the cluster entered a confused state.

Phi Accrual Failure Detector #

The Phi accrual failure detector (Hayashibara et al., 2004) replaces the binary alive/dead judgment with a continuous suspicion level φ. Rather than a fixed threshold, it computes the probability that the node has failed given the observed heartbeat arrival distribution.

How it works:

The detector samples the inter-arrival times of heartbeats from each node. It models the distribution of these intervals (typically as an exponential distribution). When a heartbeat is overdue, it computes:

φ(t) = -log₁₀(P(T > t_now - t_last))

where t_last is the last heartbeat time and P(T > Δt) is the probability that a heartbeat would be this late given the observed distribution.

As the overdue period increases, φ grows. The operator configures a threshold φ_threshold:

  • If φ < φ_threshold: node is considered alive
  • If φ ≥ φ_threshold: node is suspected of failure

Properties:

  • Adaptive: the suspicion threshold adjusts to observed network conditions. A node on a slow link has longer inter-arrival times; the detector learns this and does not false-positive.
  • Configurable accuracy: increasing φ_threshold reduces false positives at the cost of slower detection. Cassandra’s default is φ = 8, corresponding to roughly 1 false positive per 5,000 years under a 1-second heartbeat interval.
  • Gradual escalation: rather than a cliff edge from “alive” to “dead”, operators can observe φ increasing and take action before the threshold is crossed.

Cassandra’s Failure Detection in Practice #

Cassandra uses the Phi accrual failure detector for intra-cluster failure detection. Each node computes φ for every other node based on gossip heartbeat arrivals.

The gossip state for each node includes:

  • Heartbeat state: (generation, version) — used to compute inter-arrival times
  • Application state: node status (NORMAL, LEAVING, REMOVING, JOINING), load, schema version, tokens
  • Endpoint state: collection of heartbeat + application state

When a node’s φ exceeds the threshold, the local node marks it as UNREACHABLE and gossips this suspicion. A consensus of suspicions across multiple nodes transitions the failed node to DOWN in the cluster’s view.

Membership in Raft: Cluster Configuration #

Raft handles membership differently from gossip. Rather than epidemic propagation of heartbeat state, Raft treats cluster membership as a committed log entry. Adding or removing a node requires a two-phase configuration change (joint consensus) that is replicated through the log like any other state change.

Why joint consensus?

If you simply switch from configuration C_old (3 nodes: A, B, C) to C_new (5 nodes: A, B, C, D, E) by updating the config on each node independently, there is a window where different nodes use different configurations to calculate majorities. A_old + B_old might form a quorum under C_old (2/3). A_new + D_new + E_new might form a quorum under C_new (3/5). Two leaders could exist simultaneously.

Joint consensus (C_old,new) requires majorities from both the old and new configuration for any decision during the transition. Only after the joint configuration is committed does the cluster transition to C_new alone.

etcd implements this via learner nodes: new nodes are added as learners (receive log entries but do not vote) until they are caught up, then promoted to voters via a configuration change committed through Raft.

The Seed Node Problem #

Gossip-based membership requires bootstrap. When a node first joins, it has no peers. Seed nodes are pre-configured addresses that new nodes contact first:

# cassandra.yaml
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "10.0.0.1,10.0.0.2"

Failure mode: If all seed nodes are unavailable when a non-seed node starts, the node cannot join the cluster. It cannot bootstrap its membership state.

Mitigation: Configure at least one seed node per availability zone. Ensure seed nodes are restarted before non-seed nodes in a full cluster restart.

Important: Seed nodes do not have any special role after bootstrapping. They are not leaders or coordinators. Once a node has joined and learned the cluster topology via gossip, it does not contact seed nodes again.

Membership and the Quorum Calculation #

Membership state is used to calculate quorums. In Cassandra, the replication factor and consistency level determine the quorum:

QUORUM = floor(replication_factor / 2) + 1

For RF=3: QUORUM = 2. Writes succeed if 2 of 3 replicas acknowledge. Reads return the latest value if 2 of 3 replicas respond.

Consistency guarantee: W + R > N (write quorum + read quorum > replication factor) ensures that the write set and read set always intersect — any read will see at least one node that has the latest write.

For RF=3: QUORUM write (2) + QUORUM read (2) = 4 > 3 ✓

Failure mode under membership change: if a node is removed from the cluster but its data has not been streamed to other nodes, read quorums may include the removed node. Queries that require the removed node’s copy will fail. Decommissioning must complete data streaming before the node is removed from the ring.

Membership vs Consensus: Different Consistency Models #

A critical distinction:

Gossip membership is eventually consistent. Different nodes may have different views of cluster membership at any given moment. This is acceptable because:

  • Temporary inconsistency in membership view does not cause data corruption
  • The consequence is that a node may route a request to a node that recently failed — the request fails and is retried
  • Eventually (within seconds), all nodes converge on the same membership view

Raft/Paxos-based membership (etcd, ZooKeeper) is strongly consistent. Membership changes are replicated through the consensus log and are linearizable — all nodes see the same membership state at the same logical time. This is required for systems where even a brief split in membership view could cause two leaders to be elected (which would be catastrophic).

The choice between eventual and strong consistency for membership depends on the coordination mechanism built on top of it:

  • If membership drives quorum calculation in a database: eventual consistency for membership is acceptable (Cassandra)
  • If membership drives leader election in a consensus system: strong consistency for membership is required (etcd, ZooKeeper)

Failure Modes Summary #

FailureMechanism affectedSymptomMitigation
GC pause exceeds timeoutFixed-timeout failure detectionFalse positive — healthy node marked deadSwitch to Phi accrual failure detector
Seed nodes unavailableGossip bootstrapNew nodes cannot join clusterSeed nodes per AZ; restart seeds first
Gossip network partitionMembership stateDifferent sides have different cluster viewsWait for partition to heal; do not simultaneously decommission nodes
Membership change during quorum operationQuorum size calculationWrite succeeds with stale quorum sizeUse joint consensus (Raft) or complete data streaming before removal (Cassandra)
Cascading failure detectionRebalancingMany nodes declared dead simultaneously triggers mass rebalancingRate-limit decommissioning; pause rebalancing during incidents