Skip to main content
  1. Distributed Coordination: The Hidden Component/

Convergence — CRDTs and Collaborative Editing

Convergence — CRDTs and Collaborative Editing #

Convergence is the coordination mechanism for systems where replicas accept writes independently — without prior coordination — and must eventually reach identical state. Unlike Agreement (all replicas agree before committing), Convergence allows temporary divergence and resolves it after the fact.

The correctness requirements:

Strong Eventual Consistency (SEC): any two replicas that have received the same set of updates will be in the same state, regardless of the order in which updates were received. Availability: writes succeed even when other replicas are unreachable.

The tradeoff: in exchange for availability and partition tolerance, the application must define a merge function that resolves conflicting concurrent updates without human intervention.

Why Merge Functions Are Hard #

Consider two users concurrently editing a counter:

  • Replica A: counter = 5, user increments by 3 → counter = 8
  • Replica B: counter = 5, user decrements by 2 → counter = 3

When the two replicas synchronize, what is the correct value? 6 (5 + 3 - 2)? 8 (A wins)? 3 (B wins)? Without additional metadata, the system cannot determine which is correct.

The answer depends on the data type and the intended semantics. CRDTs (Conflict-free Replicated Data Types) encode the semantics into the data structure itself, making the merge function unambiguous.

State-Based CRDTs (CvRDTs) #

A state-based CRDT (Convergent Replicated Data Type) works by:

  1. Each replica maintains its full local state.
  2. Periodically, replicas exchange states (gossip).
  3. On receiving a remote state, the replica merges it with its local state using a join operation.
  4. The join must form a join-semilattice: it must be commutative, associative, and idempotent.

The semilattice property guarantees convergence: regardless of the order or number of state exchanges, all replicas converge to the same state.

G-Counter (Grow-Only Counter) #

CRDT counter: element-wise max merge

A G-Counter allows each replica to increment its own counter independently. Each replica maintains a vector of counters, one per replica:

class GCounter:
    def __init__(self, node_id: str, num_nodes: int):
        self.node_id = node_id
        self.counters = [0] * num_nodes  # one slot per replica

    def increment(self, by: int = 1):
        node_index = int(self.node_id.split('_')[1])
        self.counters[node_index] += by

    def value(self) -> int:
        return sum(self.counters)

    def merge(self, other: 'GCounter') -> 'GCounter':
        result = GCounter(self.node_id, len(self.counters))
        result.counters = [max(a, b) for a, b in zip(self.counters, other.counters)]
        return result

Why element-wise max works: each slot belongs to exactly one replica. Only that replica increments its slot. The element-wise max is always correct — if replica A’s slot is 3 and replica B’s knowledge of A’s slot is 2, A has made increments that B hasn’t seen. Taking the max (3) captures those increments.

Convergence proof: the merge is commutative (max(A,B) = max(B,A)), associative (max(max(A,B),C) = max(A,max(B,C))), and idempotent (max(A,A) = A). The state only moves “up” in the lattice (counters only increase). All replicas converge to the same maximum values.

PN-Counter (Increment and Decrement) #

A PN-Counter extends the G-Counter to support both increments and decrements by maintaining two G-Counters: one for increments (P) and one for decrements (N):

class PNCounter:
    def __init__(self, node_id: str, num_nodes: int):
        self.p = GCounter(node_id, num_nodes)  # Positive (increments)
        self.n = GCounter(node_id, num_nodes)  # Negative (decrements)

    def increment(self, by: int = 1):
        self.p.increment(by)

    def decrement(self, by: int = 1):
        self.n.increment(by)

    def value(self) -> int:
        return self.p.value() - self.n.value()

    def merge(self, other: 'PNCounter') -> 'PNCounter':
        result = PNCounter(self.p.node_id, len(self.p.counters))
        result.p = self.p.merge(other.p)
        result.n = self.n.merge(other.n)
        return result

The drawback: the value can go negative. If increments from one partition and decrements from another partition are both large, the merged value may be negative even if the intended semantics are “count should be ≥ 0.” Applications must handle or constrain this.

LWW-Register (Last-Write-Wins) #

An LWW-Register stores a single value with a timestamp. Concurrent writes resolve by taking the write with the highest timestamp:

class LWWRegister:
    def __init__(self, value=None, timestamp=0):
        self.value = value
        self.timestamp = timestamp

    def write(self, value, timestamp):
        if timestamp > self.timestamp:
            self.value = value
            self.timestamp = timestamp

    def merge(self, other: 'LWWRegister') -> 'LWWRegister':
        if other.timestamp > self.timestamp:
            return LWWRegister(other.value, other.timestamp)
        return LWWRegister(self.value, self.timestamp)

The timestamp problem: LWW-Register depends on clocks. If two concurrent writes have the same timestamp (wall clock collision), the merge is non-deterministic. Mitigation: use hybrid logical clocks (HLC) or vector clocks as timestamps.

Use in Cassandra: Cassandra uses LWW semantics for column updates by default. Each column value is stored with a microsecond timestamp. On read, if two replicas have different values for the same column, the value with the higher timestamp wins. Clock drift between nodes can cause the “wrong” write to win if the clocks are not synchronized.

OR-Set (Observed-Remove Set) #

A naïve 2P-Set (two-phase set) supports add and remove but only allows an element to be removed once — after removal, it can never be re-added. The OR-Set (Observed-Remove Set) allows re-addition:

Each element in the set is paired with a unique tag (UUID). Adding an element creates a (element, tag) pair. Removing an element removes all known (element, tag) pairs at the time of the remove operation.

class ORSet:
    def __init__(self):
        self.added = set()    # Set of (element, tag) pairs
        self.removed = set()  # Set of (element, tag) pairs

    def add(self, element):
        tag = uuid.uuid4()
        self.added.add((element, tag))

    def remove(self, element):
        # Remove all currently known tags for this element
        to_remove = {(e, t) for (e, t) in self.added if e == element}
        self.removed.update(to_remove)

    def contains(self, element) -> bool:
        active = self.added - self.removed
        return any(e == element for (e, _) in active)

    def merge(self, other: 'ORSet') -> 'ORSet':
        result = ORSet()
        result.added = self.added | other.added
        result.removed = self.removed | other.removed
        return result

Concurrent add and remove: if replica A removes element X (removing all its tags) while replica B concurrently adds element X (creating a new tag), after merge: the new add tag is in added but not in removed (the remove only captured tags known at remove time). X is in the set. The add “wins” over the concurrent remove — by design.

The semantics choice: “add wins over concurrent remove” is one policy. “Remove wins” is another. OR-Sets encode “add wins.” There is no universally correct policy; the application must choose.

LWW-Element-Set and Shopping Cart #

The LWW-Element-Set stores the last add or remove timestamp per element. An element is in the set if the latest timestamp for that element is an “add”:

class LWWElementSet:
    def __init__(self):
        self.adds = {}     # element -> timestamp
        self.removes = {}  # element -> timestamp

    def add(self, element, timestamp):
        if timestamp > self.adds.get(element, 0):
            self.adds[element] = timestamp

    def remove(self, element, timestamp):
        if timestamp > self.removes.get(element, 0):
            self.removes[element] = timestamp

    def contains(self, element) -> bool:
        add_ts = self.adds.get(element, 0)
        rem_ts = self.removes.get(element, 0)
        return add_ts > rem_ts  # Or >= for "add wins on tie"

    def merge(self, other: 'LWWElementSet') -> 'LWWElementSet':
        result = LWWElementSet()
        for elem, ts in {**self.adds, **other.adds}.items():
            result.adds[elem] = max(self.adds.get(elem, 0), other.adds.get(elem, 0))
        for elem, ts in {**self.removes, **other.removes}.items():
            result.removes[elem] = max(self.removes.get(elem, 0), other.removes.get(elem, 0))
        return result

Amazon’s Dynamo paper used a shopping cart implemented as an OR-Set-like structure. Each cart item add/remove was tagged, and conflicting carts were merged by the client. The system allowed customers to see items they had removed (false adds) rather than lose items they had added (false removes) — an intentional business decision.

Operation-Based CRDTs (CmRDTs) #

An operation-based CRDT (Commutative Replicated Data Type) works differently:

  1. Instead of exchanging full state, replicas exchange operations.
  2. Operations must be commutative (apply(op_A, apply(op_B, state)) = apply(op_B, apply(op_A, state))).
  3. The delivery layer must ensure each operation is delivered exactly once (at-least-once + deduplication).

Op-based CRDTs are more efficient for large states (send only the delta, not the full state) but require a reliable broadcast layer.

class OpBasedCounter:
    def __init__(self):
        self.value = 0
        self.applied = set()  # Seen operation IDs

    def apply_increment(self, op_id: str, delta: int):
        if op_id in self.applied:
            return  # Deduplication
        self.value += delta
        self.applied.add(op_id)

    def apply_decrement(self, op_id: str, delta: int):
        if op_id in self.applied:
            return
        self.value -= delta
        self.applied.add(op_id)

The applied set is the deduplication mechanism — this is Suppression (Chapter 8) in service of Convergence.

Operational Transforms: Google Docs #

CRDTs are not the only approach to Convergence. Operational Transforms (OT) are an older technique (Ellis & Gibbs, 1989) used in collaborative text editing — including Google Docs.

The Problem #

Two users edit the same document concurrently:

Initial: "Hello World"
User A (position 6): insert " Beautiful" → "Hello Beautiful World"
User B (position 0): delete "Hello" → " World"

If User A’s insert is applied first, then User B’s delete must be adjusted. If “Hello” is deleted first, User A’s insert position has shifted.

Without adjustment: both operations applied naively produce “Hello World” (User A’s insert at 6 gets applied to the already-modified document at the wrong position).

The Transform Function #

OT defines a transform(op_a, op_b) function that adjusts op_a assuming op_b has already been applied:

def transform_insert_after_delete(insert_op, delete_op):
    """Adjust insert position assuming delete has already been applied."""
    if delete_op.position < insert_op.position:
        # Delete shifted our position left
        return InsertOp(
            position=insert_op.position - delete_op.length,
            text=insert_op.text
        )
    return insert_op  # Delete was after our position; no adjustment needed

def transform_delete_after_insert(delete_op, insert_op):
    """Adjust delete position assuming insert has already been applied."""
    if insert_op.position <= delete_op.position:
        # Insert shifted our position right
        return DeleteOp(
            position=delete_op.position + len(insert_op.text),
            length=delete_op.length
        )
    elif insert_op.position < delete_op.position + delete_op.length:
        # Insert is inside the delete range; extend delete range
        return DeleteOp(
            position=delete_op.position,
            length=delete_op.length + len(insert_op.text)
        )
    return delete_op

Server-Ordered OT (Google Docs) #

The challenge with OT is defining the transform function for all pairs of operation types and proving that the transform satisfies the convergence conditions (TP1 and TP2). This is mathematically complex and error-prone.

Google Docs avoids this complexity by using a server as the total order authority:

  1. All clients send operations to the server.
  2. The server assigns each operation a sequential revision number.
  3. The server transforms each operation against all operations that have been committed at a higher revision.
  4. Clients receive operations from the server in revision order.

This reduces OT to a simpler single-server problem: the server only needs to transform each incoming operation against the operations that have arrived since the client’s last known revision. The transform function is simpler because the server is the single source of truth for ordering.

Client A sends: op_A at client revision 10
Server has: op_X (revision 11), op_Y (revision 12) committed since rev 10
Server: transform(op_A, op_X), transform(result, op_Y) → op_A'
Server commits op_A' as revision 13
Server broadcasts op_A' to all clients

Client A receives op_A’ (its own operation, transformed) and updates its revision to 13. Other clients receive the transformed op_A’ and apply it.

Why not CRDTs for Google Docs? CRDTs for text editing (e.g., YATA, RGA — Replicated Growable Array) have been developed and are used in Yjs, Automerge, and Figma. They support peer-to-peer collaboration without a server. Google Docs predates these developments and uses a server for simplicity and as a scaling point. The server also handles access control, version history, and conflict resolution UI — which are easier with a central coordinator.

CRDT Use Cases in Production #

Riak: uses a CRDT data model for its key-value store. Maps (nested CRDTs), Sets (OR-Sets), Counters (PN-Counters), Flags (LWW-Flags), and Registers (LWW-Registers) are first-class Riak data types.

Redis: Redis CRDT module (part of Redis Enterprise) provides CRDT-based replicated data types for multi-region active-active configurations.

Figma multiplayer: Figma uses CRDTs (specifically an OT-CRDT hybrid) for its collaborative design tool. Design objects are CRDTs; text editing uses OT within text elements.

Distributed caches: cache invalidation with OR-Sets — adding and removing cache entries from multiple nodes without coordination, converging to the correct state.

Convergence Properties Summary #

CRDT TypeOperationsMergeNotes
G-Counterincrementelement-wise maxNo decrement
PN-Counterincrement, decrementelement-wise max (P and N separately)Can go negative
LWW-Registerwritemax-timestamp winsRequires synchronized clocks
OR-Setadd, removeunion of tags, union of removesAdd wins over concurrent remove
LWW-Element-Setadd, removemax-timestamp per elementAdd or remove wins based on timestamp
RGA / YATAinsert, deletecausal ordering by unique IDUsed for collaborative text editing

The key insight: any data type that supports monotonically growing state with an element-wise max merge is a state-based CRDT. The art of CRDT design is modeling the desired semantics as a lattice where state only moves “up.”