Table of Contents
Google Docs: System Design #
A real-time collaborative document editor. Multiple users edit the same document simultaneously; every peer converges to identical state without a locking protocol. Grounded in the Yjs CRDT library (YATA algorithm), Automerge’s causal-hash sync protocol, and ShareDB’s Operational Transformation pipeline.
1. Requirements #
Functional Requirements #
- Concurrent editing — multiple users edit the same document simultaneously; all peers converge to the same final state
- Real-time sync — changes propagate to all connected peers within ~100ms
- Offline editing — a client can edit while disconnected; changes merge correctly on reconnect
- Persistent storage — document state survives server restarts; full edit history queryable
- Presence and awareness — show cursor positions, selections, and online status of collaborators in real time
- Rich text — bold, italic, headings, links, lists, embedded images, comments
- Undo / redo — per-user undo that doesn’t revert other users’ changes
- Comments and suggestions — threaded comments anchored to document ranges; suggestion mode (tracked changes)
- Access control — owner, editor, commenter, viewer roles; link-sharing
- Version history — restore named versions; view document at any past point in time
Non-Functional Requirements #
- Availability — 99.99% uptime; document state must not be lost even on server crash mid-edit
- Latency — local keystrokes applied immediately (optimistic); peer propagation P99 < 150ms
- Consistency — strong eventual consistency: all peers that have received the same set of operations arrive at identical state, regardless of order of arrival
- Scalability — single document supports up to ~100 concurrent editors; millions of documents per cluster
- Storage — full operation log retained indefinitely; snapshot for fast load
Capacity Estimation #
| Metric | Estimate |
|---|---|
| Active documents | 1B+ total; ~50M with active sessions at peak |
| Concurrent editors per doc | Avg 2–3; max ~100 |
| Operation rate per active doc | ~5 ops/sec at peak typing speed |
| Peak cluster write rate | 50M docs × 5 ops/sec = ~250M ops/sec (distributed across doc shards) |
| Avg document size | ~50KB (text + formatting) |
| Operation log storage | ~1KB/op × 1M ops/doc for heavily edited docs |
| Presence updates | ~10 updates/sec per active user (cursor moves) |
Presence updates are the highest-frequency traffic and must be handled separately from document mutations — they are ephemeral (not persisted) and can be lost without correctness impact.
2. Core Entities #
Item (Yjs) #
Everything inserted into a Yjs document — a character, a formatting marker, an embedded object — is represented as an Item:
// From yjs/src/structs/Item.js
class Item extends AbstractStruct {
constructor(id, left, origin, right, rightOrigin, parent, parentSub, content) {
// id: { client: uint53, clock: uint } — Lamport timestamp, globally unique
this.origin = origin // ID | null: the item that was to our LEFT at insert time
this.left = left // Item | null: current left neighbor (may differ from origin after merges)
this.right = right // Item | null: current right neighbor
this.rightOrigin = rightOrigin // ID | null: the item that was to our RIGHT at insert time
this.parent = parent // YType | ID | string: the containing shared type
this.parentSub = parentSub // string | null: map key if parent is YMap
this.content = content // AbstractContent: the actual data
// info bitfield:
// bit1: keep (do not GC even if deleted)
// bit2: countable (contributes to length)
// bit3: deleted
// bit4: fast-search marker
this.info = this.content.isCountable() ? binary.BIT2 : 0
}
}
origin and rightOrigin are the causal anchors: they record the left and right neighbors at the time of insertion, not the current neighbors. This is the information that enables conflict resolution when concurrent inserts happen at the same position — the YATA algorithm uses these IDs to deterministically order conflicting inserts across all peers.
The id = { client, clock } is a Lamport timestamp. client is a random 53-bit integer assigned when the Doc is created (generateNewClientId()). clock increments for each inserted item. Deletions do not increment the clock — they are handled as a separate state-based CRDT (a delete set), not as operations.
Doc (Yjs) #
// From yjs/src/utils/Doc.js
class Doc extends ObservableV2 {
constructor(opts) {
this.clientID = generateNewClientId() // random uint53: this peer's identity
this.guid = opts.guid ?? uuidv4() // document identity (stable across sessions)
this.gc = opts.gc ?? true // garbage collect deleted items?
this.share = new Map() // name → YType (the shared types)
this.store = new StructStore() // all Items indexed by (client, clock)
this._transaction = null // currently active transaction
}
transact(f, origin, local) { ... } // all mutations must happen inside a transaction
get(key, typeName) { ... } // get or create a named shared type
}
share is the document’s named root types: ydoc.get('content') returns a YText (or YArray, YMap) that all peers share. Multiple calls return the same object.
StructStore (Yjs) #
// From yjs/src/utils/StructStore.js
class StructStore {
constructor() {
// Map<clientID, Array<GC|Item|Skip>> — items per client in clock order
this.clients = new Map()
this.pendingStructs = null // updates waiting for missing causal deps
this.pendingDs = null // pending delete sets
}
getClock(client) { ... } // next expected clock for this client
}
// The state vector: Map<clientID, clock> — "what is the highest clock I have seen from each client"
export const getStateVector = store => {
const sm = new Map()
store.clients.forEach((structs, client) => {
const last = structs[structs.length - 1]
sm.set(client, last.id.clock + last.length)
})
return sm
}
The state vector is the compact summary of what a peer has received: { clientA: 150, clientB: 42 } means “I have the first 150 operations from A and the first 42 from B.” Two peers exchange state vectors to compute exactly which operations each one is missing from the other. No full document scan required.
Transaction (Yjs) #
// From yjs/src/utils/Transaction.js
class Transaction {
constructor(doc, origin, local) {
this.doc = doc
this.deleteSet = createIdSet() // items deleted in this transaction
this.insertSet = createIdSet() // items inserted in this transaction
this.changed = new Map() // YType → Set<parentSub>: which types were modified
this._beforeState = null // state vector before transaction started
this._afterState = null // state vector after transaction committed
this.origin = origin // who triggered this transaction (for loop detection)
this.local = local // true = originated on this peer
}
}
All mutations are wrapped in a transaction. On commit, the transaction generates a binary update message — the set of new items plus the delete set — and broadcasts it to connected peers via the doc.on('update', ...) event. Remote updates arrive through Y.applyUpdate(doc, update), which wraps them in a transaction with local=false.
Change (Automerge) #
Automerge groups operations into Change objects — atomic units of history:
// From automerge/rust/automerge/src/change.rs
pub struct Change {
stored: StoredChange<'static, Verified>, // columnar-encoded operations
len: usize, // number of operations in this change
}
impl Change {
pub fn actor_id(&self) -> &ActorId { ... } // who made this change
pub fn start_op(&self) -> NonZeroU64 { ... } // first operation counter in this change
pub fn max_op(&self) -> u64 {
self.stored.start_op().get() + (self.len as u64) - 1
}
pub fn message(&self) -> Option<&String> { ... } // optional commit message
}
Each Change has a ChangeHash (SHA-256 of its encoded bytes). Changes reference their dependencies: Vec<ChangeHash> — the set of changes that must have been applied before this one. This forms a causal DAG (directed acyclic graph): changes are a partial order, not a total order. Two changes from different actors with no dependency relationship are concurrent.
Snapshot (ShareDB) #
// From sharedb/lib/snapshot.js (inferred from backend usage)
{
id: string, // document ID
v: number, // version number (monotonically increasing integer)
type: string, // OT type URI (e.g., 'http://sharejs.org/types/rich-text')
data: any, // current document content
m: object // metadata (ctime, mtime)
}
In ShareDB’s OT model, every submitted operation targets a specific version v. If the server’s snapshot is at version v and the client submits an op for version v, it applies directly. If the snapshot has advanced to v+k, the server fetches the intervening k ops and transforms the submitted op through each of them before applying.
3. API / System Interface #
WebSocket (real-time sync) #
The primary interface for collaborative editing. All session traffic — document operations, awareness, presence — flows over a single persistent WebSocket connection per client.
Yjs sync protocol messages (binary encoded):
SyncStep1: { type: 0, stateVector: Uint8Array }
// "Here is my state vector — tell me what I'm missing"
SyncStep2: { type: 1, update: Uint8Array }
// "Here are the operations you're missing (diff of your sv vs mine)"
Update: { type: 2, update: Uint8Array }
// "I made a change — apply this update"
QueryAwareness: { type: 3, clientIDs: number[] }
// Request awareness state for specific clients
Awareness: { type: 4, added: [], updated: [], removed: [], states: Map }
// Cursor positions, user names, colors — ephemeral, not persisted
The sync protocol is intentionally separated from the transport. Yjs defines the message format; providers (y-websocket, y-webrtc, y-dat) implement the transport.
ShareDB Wire Protocol #
ShareDB uses a JSON-over-WebSocket protocol. Key messages:
// Client → Server: subscribe to document updates
{ a: 'sub', c: 'documents', d: 'doc-id', v: 42 }
// Client → Server: submit an operation
{ a: 'op', c: 'documents', d: 'doc-id', v: 42, op: [...] }
// Server → Client: operation from another client
{ a: 'op', c: 'documents', d: 'doc-id', v: 43, op: [...], src: 'client-xyz' }
// Server → Client: acknowledge submitted op
{ a: 'ack', v: 43 }
The v field is the critical coordination mechanism — it is the document version the operation was written against. The server transforms the op to the current snapshot version if they differ.
REST API (document management) #
POST /api/documents/ # create document
GET /api/documents/{id}/ # fetch document (current snapshot)
GET /api/documents/{id}/history # list versions
GET /api/documents/{id}/history/{version} # snapshot at version
PATCH /api/documents/{id}/permissions # update ACL
Awareness Protocol (presence) #
Awareness is handled outside the document CRDT — it’s ephemeral state (cursor position, user color, online status) that does not need to persist:
// Each client publishes its own awareness state:
awareness.setLocalState({
user: { name: 'Alice', color: '#ff6b6b' },
cursor: { anchor: { type: 'text', index: 1523 }, head: { type: 'text', index: 1530 } }
})
// Remote peers receive state updates via the Awareness protocol message (type 4)
// States expire if no heartbeat is received within 30 seconds
Standards Reference #
| Standard / Spec | Body | Scope |
|---|---|---|
| YATA (Yet Another Transformation Approach) | Nicolaescu et al. 2016 | The CRDT algorithm implemented by Yjs |
| Automerge sync protocol | arxiv.org/abs/2012.00472 | Bloom-filter-based change set reconciliation |
| OT (Operational Transformation) | Ellis & Gibbs 1989; Jupiter protocol 1995 | Foundation of ShareDB’s transform model |
| JSON-OT / rich-text OT | ottypes/rich-text | OT type for rich text (used by ShareDB) |
| WebSocket RFC 6455 | IETF | Transport for real-time sync |
| CRDT survey | Shapiro et al. 2011 | Theoretical foundation for CRDTs |
4. Data Flow #
Local edit path (optimistic) #
User types a character
│
▼
Rich text editor (ProseMirror / Quill / TipTap)
│ generates local transaction
▼
ydoc.transact(f, origin='local')
│ creates Item{id: {clientID, clock++}, origin: leftNeighbor.id, ...}
│ calls item.integrate(transaction) → linked into document list
│ generates binary update message
▼
Local state updated immediately (optimistic apply)
│
▼
Update message broadcast to server + peers via WebSocket
Remote update path (apply + integrate) #
WebSocket message arrives: { type: Update, update: Uint8Array }
│
▼
Y.applyUpdate(ydoc, update)
│ decode update: extract new Items + DeleteSet
│ for each Item:
│ check causal dependencies (item.origin must already exist in store)
│ if deps missing: add to pendingStructs, wait
│ if deps present: item.integrate(transaction) → conflict resolution
▼
Observers fire: ydoc.on('update', ...) → forward to other connected peers
│
▼
UI re-renders with merged state
ShareDB submit path (server-coordinated OT) #
Client submits op at version v
│ { a: 'op', v: 42, op: [...] }
▼
Server Agent receives op
│
▼
SubmitRequest.submit()
│ fetch current snapshot from db
│ if snapshot.v === op.v: apply directly (no transform needed)
│ if snapshot.v > op.v:
│ db.getOpsToSnapshot(from=op.v, to=snapshot.v) ← fetch intervening ops
│ for each intervening op:
│ op.op = type.transform(op.op, intervening.op, 'left')
│ op.v = snapshot.v
▼
ot.apply(snapshot, op) → new snapshot
│
▼
db.commit(snapshot, op)
│
▼
pubsub.publish(docId, op) → broadcast to all subscribed agents
│
▼
Client receives ack { a: 'ack', v: 43 }
Other clients receive op { a: 'op', v: 43, op: [...], src: 'submitter' }
5. High Level Design #
┌─────────────────────────────────────────────────────────────┐
│ Browser Clients │
│ Rich text editor (ProseMirror / Quill / TipTap) │
│ ydoc (Yjs Doc) + awareness │
└──────────────┬─────────────────────────┬────────────────────┘
│ WebSocket (binary Yjs) │ REST (doc mgmt)
┌──────────────▼─────────────────────────▼────────────────────┐
│ Collaboration Server │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Sync Handler │ │ Awareness Hub │ │
│ │ │ │ │ │
│ │ SyncStep1/2 │ │ ephemeral state │ │
│ │ Update forward │ │ cursor fanout │ │
│ │ state vector │ │ 30s TTL │ │
│ └──────────┬───────┘ └──────────────────┘ │
│ │ │
│ ┌──────────▼───────────────────────────────────┐ │
│ │ Document Room (per doc) │ │
│ │ connected clients list │ │
│ │ in-memory Yjs doc (optional hot cache) │ │
│ └──────────┬───────────────────────────────────┘ │
└─────────────┼───────────────────────────────────────────────┘
│
┌─────────────▼───────────────────────────────────────────────┐
│ Persistence Layer │
│ │
│ Document Store (PostgreSQL / Firestore) │
│ full Yjs update log (binary encoded) │
│ periodic snapshots (encoded ydoc state) │
│ │
│ Snapshot Cache (Redis) │
│ encoded ydoc state for hot documents │
│ TTL: evict after 1h of inactivity │
│ │
│ Object Storage (S3) │
│ large documents, attachments, version archives │
└─────────────────────────────────────────────────────────────┘
Horizontal scaling: Each document is handled by one collaboration server node at a time (the “room” lives on one node). When a node receives a connection for a document that is resident on another node, it either:
- Routes via a load balancer using consistent hashing on doc ID, or
- Forwards updates through a pub/sub bus (Redis) so any node can serve any document
For Google’s scale, documents are sharded across thousands of nodes. Within a node, a document room fans out updates to all connected WebSocket sessions.
6. Deep Dives #
Deep Dive 1: OT vs CRDT — The Fundamental Choice #
Both Operational Transformation and CRDTs solve the same problem: how do two concurrent edits to the same document converge to the same result? They make different trade-offs.
Operational Transformation (OT) — ShareDB’s approach:
// From sharedb/lib/ot.js
exports.transform = function(type, op, appliedOp) {
// Two concurrent ops: op (ours) and appliedOp (theirs, already committed)
// Transform op as if appliedOp had already been applied:
if ('op' in appliedOp && 'op' in op) {
op.op = type.transform(op.op, appliedOp.op, 'left')
// 'left' = tie-breaking side: if both ops insert at the same position,
// 'left' wins (goes first). The other peer uses 'right'.
}
if (op.v != null) op.v++ // advance the version
}
OT requires a central server that serializes all operations. The server is the single source of truth for operation ordering. When a client’s op arrives at version v but the server is at version v+k, the server fetches the k intervening ops and transforms the incoming op through each one. The SubmitRequest._transformOp() implements this:
// From sharedb/lib/submit-request.js
SubmitRequest.prototype._transformOp = function(ops) {
for (var i = 0; i < ops.length; i++) {
var err = ot.transform(type, this.op, ops[i])
if (err) return err
}
}
The OT invariant: apply(apply(doc, A), transform(B, A)) === apply(apply(doc, B), transform(A, B)) — both orderings of concurrent ops A and B must produce the same result.
The OT problem: maintaining this invariant for complex rich-text operations across more than two peers is extremely hard. The Jupiter protocol (1995, Google’s OT basis) achieves this by routing all ops through a central server, eliminating the multi-peer transformation problem. The server linearizes history; clients only transform against the server’s log.
CRDT approach (Yjs):
CRDTs embed conflict resolution in the data structure itself. No transformation step needed. The Item.integrate() method in Yjs resolves conflicts at the point of insertion:
// From yjs/src/structs/Item.js — the YATA conflict resolution algorithm
integrate(transaction, offset) {
if (this.parent) {
// Find the correct position to insert this item.
// Conflict: another peer's item o was also inserted after the same origin.
// Resolution: scan right from origin, collecting conflicting items.
// Stop when: o.origin is to the LEFT of our origin (we win by causality)
// OR o.id < this.id (tie-break by client ID)
let o = left !== null ? left.right : parent._start
const conflictingItems = new Set()
const itemsBeforeOrigin = new Set()
while (o !== null && o !== this.right) {
itemsBeforeOrigin.add(o)
conflictingItems.add(o)
if (compareIDs(this.origin, o.origin)) {
if (o.id.client < this.id.client) { left = o }
} else if (itemsBeforeOrigin.has(store.getItem(o.origin))) {
left = o
}
o = o.right
}
// Insert this item after left
this.left = left
...
}
}
The YATA algorithm guarantees: any two peers that apply the same set of Items in any order arrive at the same linked list. No server required. Peers can sync peer-to-peer.
Trade-off summary:
| Dimension | OT (ShareDB) | CRDT (Yjs) |
|---|---|---|
| Server requirement | Central server mandatory | Optional (P2P possible) |
| Conflict resolution | Server transforms ops | Local algorithm on each peer |
| History | Linear version numbers | Causal DAG (partial order) |
| Complexity | Simple per-op transform; hard to prove correct for N peers | Complex integration algorithm; provably convergent by construction |
| Offline support | Limited (needs server on reconnect) | First-class (merge on reconnect) |
| Undo semantics | Easy (invert op, transform forward) | Hard (must un-apply specific items) |
| Google Docs actual choice | Original implementation used OT (Jupiter); later moved to a hybrid | Modern editors (Notion, Linear) use CRDT |
Deep Dive 2: YATA — The Yjs Conflict Resolution Algorithm #
The core invariant Yjs must maintain: given any two concurrent insertions at the same position, every peer must order them identically.
The problem: Alice and Bob both type after character at position 5. Alice inserts ‘X’, Bob inserts ‘Y’. On Alice’s machine: ...5, X, .... On Bob’s: ...5, Y, .... When these updates merge, which comes first: X or Y? And critically: every peer must make the same choice.
Yjs’s solution: Each Item records its origin (left neighbor at insert time) and rightOrigin (right neighbor at insert time). These are stable IDs — they don’t change as the document evolves.
The integrate() method scans right from the origin position, collecting “conflicting” items — items from other peers that also claim the same origin. It orders them using a deterministic rule:
For a conflicting item o:
if o.origin is to the LEFT of our origin:
→ our item goes BEFORE o (we are more specifically positioned)
elif o.origin == our origin AND o.id.client < this.id.client:
→ our item goes AFTER o (tie-break: lower client ID wins)
Because all peers apply the same deterministic rule, they arrive at the same ordering without communication.
The rightOrigin field (an improvement over the original YATA paper) handles the case where multiple concurrent inserts happen at the same location in a way that reduces the number of conflicts that need tie-breaking, improving practical performance.
Item storage in two structures (from Yjs INTERNALS.md):
- Linked list in document order:
item.left,item.right— the traversal structure for rendering text - StructStore indexed by (client, clock): binary-searchable array per client — used for sync (finding which items a peer is missing) and for the
originlookup duringintegrate()
// Fast-search marker: cache of (index, item) pairs
// Yjs maintains up to 80 markers — the most recently looked-up positions
// When inserting at position N, binary search the markers, then linear scan from nearest marker
// This turns O(n) position lookup into O(√n) in practice
Deep Dive 3: Deletions as a State-Based CRDT #
Yjs treats insertions and deletions differently — a deliberate design choice.
Insertions are operation-based: each insert is an Item with a unique ID, broadcast to all peers. They form the linked list.
Deletions are state-based: a deletion just sets a bit on an Item.info field (bit3: deleted). No delete record is added to the StructStore. The deleted item remains in the linked list but is skipped during rendering.
// info bitfield on Item:
// bit3 = deleted
item.markDeleted = () => { this.info |= binary.BIT3 }
// Checking deletion:
get deleted() { return (this.info & binary.BIT3) > 0 }
The delete set is propagated as a compact IdSet alongside the insert operations in each update message:
// From yjs/src/utils/Transaction.js
class Transaction {
this.deleteSet = createIdSet() // {clientID → [{clock, length}]} — run-length encoded
}
Why this design: Deletes don’t need causal ordering the way inserts do. If two peers concurrently delete the same character, applying both deletes is idempotent (the bit is already set). If one peer inserts and another deletes the same character, both operations are preserved — the character exists in the linked list (because the insert is in the StructStore) but is marked deleted. This is the “last-write-wins on deletion” semantic.
Garbage collection: When gc=true, Yjs eventually replaces deleted Items with lightweight GC structs that only record the client ID and clock range. The content is discarded. This is safe once all peers have received the deletion — the GC struct preserves the causal anchor (the ID range still exists for origin lookups) while freeing the content memory.
Deep Dive 4: State Vector Sync Protocol #
The sync protocol is the mechanism by which two peers discover and exchange missing operations. It uses state vectors — not operation logs.
Step 1 — Exchange state vectors:
// Peer A sends to Peer B:
SyncStep1: { stateVector: { clientA: 150, clientB: 42 } }
// "I have the first 150 ops from A and 42 from B"
Step 2 — Send the diff:
// Peer B receives SyncStep1
// B computes: what does A need?
// B has clientB: 100 → A is missing B's ops 42..100
// B has clientC: 50 → A has no ops from C at all
// B encodes these missing items as a binary update and sends SyncStep2:
SyncStep2: { update: [items from clientB[42..100], items from clientC[0..50]] }
The getStateVector(store) function generates the state vector in O(n_clients) time — just reading the last clock from each client’s array:
// From yjs/src/utils/StructStore.js
export const getStateVector = store => {
const sm = new Map()
store.clients.forEach((structs, client) => {
const last = structs[structs.length - 1]
sm.set(client, last.id.clock + last.length)
})
return sm
}
Pending structs: If a received item’s origin doesn’t exist yet in the store (a causally earlier item hasn’t arrived), the item is added to store.pendingStructs. When a subsequent update arrives and fills in the gap, Yjs retries integrating the pending items.
// From yjs/src/utils/StructStore.js
this.pendingStructs = null // { missing: Map<client, clock>, update: Uint8Array }
This handles network reordering without requiring ordered delivery — though in practice WebSocket (TCP) provides ordering, this makes the protocol robust to use over UDP or P2P transports.
Deep Dive 5: Automerge’s Causal Hash DAG and Bloom-Filter Sync #
Automerge takes a different approach to both storage and sync. Where Yjs uses Lamport timestamps and state vectors, Automerge uses content-addressed Change objects and a bloom-filter-based reconciliation protocol.
Change DAG:
Every change in Automerge has a ChangeHash — a SHA-256 of its content. A change records its dependencies: Vec<ChangeHash> — the hashes of all changes that must have been applied before this one. This creates a DAG where edges point from a change to its causal predecessors.
// From automerge/rust/automerge/src/change.rs
pub struct Change {
stored: StoredChange<'static, Verified>,
len: usize,
}
// hash = SHA-256(encoded bytes) — content-addressed, immutable
// deps = changes that causally precede this one
The document’s current state is summarized by its heads — the set of change hashes that have no successors. Two peers that have the same set of heads have identical state.
Sync protocol — bloom filters:
Automerge’s sync protocol (based on arxiv.org/abs/2012.00472) avoids sending the full set of known change hashes, which could be millions of entries. Instead it uses bloom filters:
// From automerge/rust/automerge/src/sync/state.rs
pub struct State {
pub shared_heads: Vec<ChangeHash>, // hashes both peers definitely have
pub last_sent_heads: Vec<ChangeHash>, // heads we last sent
pub their_heads: Option<Vec<ChangeHash>>,
pub their_have: Option<Vec<Have>>, // bloom filter of what they have
pub sent_hashes: BTreeSet<ChangeHash>,
pub in_flight: bool,
}
pub struct Have {
pub last_sync: Vec<ChangeHash>, // their heads at last successful sync
pub bloom: BloomFilter, // all changes they've added since last_sync
}
The protocol loop:
// From automerge/rust/automerge/src/sync.rs (example in module docs)
let mut peer1_state = sync::State::new();
let message = peer1.sync().generate_sync_message(&mut peer1_state);
// peer1 sends: { have: [Have { last_sync: [], bloom: <all my changes> }], heads: [...] }
peer2.sync().receive_sync_message(&mut peer2_state, message)?;
// peer2 checks bloom filter: which of my changes does peer1 probably not have?
// peer2 sends back those changes
// Loop until neither peer has anything new to send
The bloom filter allows each peer to say “here is a probabilistic summary of all changes I have.” The receiving peer queries its own changes against the filter — if a change is NOT in the filter, the sender definitely doesn’t have it. False positives cause unnecessary sends but never cause missing changes.
Why this differs from Yjs: Yjs’s state vector approach requires that every operation from every client has a monotonically increasing clock — it is a total order per client. Automerge’s change DAG is a partial order — there is no global clock. This makes Automerge more natural for long-running offline edits where many independent changes accumulate, but requires the bloom filter protocol to reconcile efficiently rather than a simple clock comparison.
Deep Dive 6: ShareDB’s Server-Coordinated Transform Pipeline #
ShareDB implements the Jupiter OT protocol: all operations flow through a single server which linearizes history.
// From sharedb/lib/backend.js
function Backend(options) {
this.db = options.db || new MemoryDB() // persistent op log + snapshots
this.pubsub = options.pubsub || new MemoryPubSub() // fan-out to subscribed agents
this.milestoneDb = options.milestoneDb || new NoOpMilestoneDB() // periodic snapshots
this.middleware = Object.create(null) // hooks: submit, apply, commit, afterWrite
}
The SubmitRequest is the server-side unit of work for processing one submitted op:
// From sharedb/lib/submit-request.js
SubmitRequest.prototype.submit = function(callback) {
backend.db.getSnapshot(collection, id, fields, options, (err, snapshot) => {
if (snapshot.v === op.v) {
// Fast path: no transform needed
return applyAndCommit()
}
if (snapshot.v > op.v) {
// Transform path: fetch ops from op.v to snapshot.v, transform through each
backend.db.getOpsToSnapshot(collection, id, op.v, snapshot, (err, ops) => {
err = request._transformOp(ops) // mutates op.op in place
if (err) return retry()
applyAndCommit()
})
}
})
}
SubmitRequest.prototype._transformOp = function(ops) {
for (var i = 0; i < ops.length; i++) {
var err = ot.transform(type, this.op, ops[i])
if (err) return err
}
}
Retry on conflict: If two clients submit simultaneously (both at version v), one wins the database commit (first write wins) and the other gets a version mismatch error and retries. On retry it fetches the snapshot, transforms through the winning op, and submits again:
// From sharedb/lib/submit-request.js
SubmitRequest.prototype.retry = function(callback) {
this.retries++
this.backend.emit('timing', 'submit.retry', ...)
this.submit(callback) // re-run the full submit pipeline
}
Middleware hooks: ShareDB exposes hooks at each stage of the submit pipeline:
// From sharedb/lib/backend.js
Backend.prototype.MIDDLEWARE_ACTIONS = {
submit: 'submit', // op received, before any processing
apply: 'apply', // op transformed, about to be applied to snapshot
commit: 'commit', // new snapshot computed, about to write to db
afterWrite: 'afterWrite' // successfully committed; side effects here
}
This middleware architecture allows access control enforcement (submit), server-side document validation (apply), and audit logging (afterWrite) to be injected without modifying core logic.
Presence in ShareDB: Cursor positions are handled via the transformPresence function:
// From sharedb/lib/ot.js
exports.transformPresence = function(presence, op, isOwnOp) {
presence.p = type.transformPresence(presence.p, op.op, isOwnOp)
presence.v++
}
When an op shifts text positions (e.g., an insert at position 5 shifts all cursors at positions ≥5 right by the insert length), the OT type’s transformPresence function adjusts cursor positions accordingly. This keeps remote cursors visually anchored to the correct text positions.
Deep Dive 7: Persistence and Snapshots #
The storage problem: A document with 1M edits cannot be replayed from scratch on every load. Two mechanisms address this:
1. Yjs update log + snapshot:
Persistence layout:
updates/doc-id/ ← append-only log of binary Yjs updates
snapshots/doc-id ← periodic encoded ydoc state (Y.encodeStateAsUpdate)
Load sequence:
1. Fetch latest snapshot (encoded ydoc state)
2. Apply snapshot to new Y.Doc
3. Fetch updates newer than snapshot timestamp
4. Apply incremental updates
5. Doc is current
Y.encodeStateAsUpdate(ydoc) produces a binary encoding of all Items in the StructStore — equivalent to a full document snapshot but in the Yjs update format. Loading this is equivalent to replaying all history but much faster, since it is a single write rather than many integrated items.
2. ShareDB milestones:
ShareDB’s milestoneDb stores periodic snapshots at specific versions. On load, the backend fetches the nearest milestone snapshot and replays ops from that version forward:
// From sharedb/lib/backend.js
this.milestoneDb = options.milestoneDb || new NoOpMilestoneDB()
// When a milestone is saved (configurable interval, e.g., every 100 ops):
// milestoneDb.getMilestoneSnapshot(collection, id, version, callback)
// On fetch:
// 1. Get milestone snapshot at version M
// 2. Get ops from M to current
// 3. Apply ops to snapshot → current state
Conflict between GC and version history: Yjs’s garbage collector discards deleted item content once it is no longer needed. But if you want to restore a document to a past version (when those items were not yet deleted), you need the original content. The gc: false option disables GC but increases memory:
// From yjs/src/utils/Doc.js
const doc = new Y.Doc({ gc: false })
// All deleted items retain their content
// Required for: version history, undo manager, suggestion mode
Google Docs solves this by maintaining two document variants: a GC-enabled version for the live document, and a GC-disabled history snapshot for version restoration.
Deep Dive 8: Undo / Redo in a CRDT #
Undo in a collaborative editor is semantically harder than in a single-user editor. The naive approach — invert the last operation and apply it — fails when other users have edited the document since.
The correct semantics: Undo should revert your own recent changes without touching other users’ changes. If Alice types “hello” and Bob concurrently types “world” at a different position, Alice undoing her “hello” should not affect Bob’s “world.”
Yjs implements this with UndoManager:
// UndoManager tracks which Items were inserted/deleted in each transaction
// It records: { insertSet: Set<ItemID>, deleteSet: Set<ItemID> }
// Undo = delete the items in insertSet + undelete the items in deleteSet
// BUT: only for items created by the local user (origin === undoManager.origin)
// AND: transform the undo against all intervening remote changes
const undoManager = new Y.UndoManager(ytext, {
captureTimeout: 500, // bundle changes within 500ms into one undo step
trackedOrigins: new Set([localOrigin]) // only track local changes
})
undoManager.undo() // revert last local change, skipping remote changes
undoManager.redo() // re-apply the undone change
The key insight: because Items have stable IDs, “delete item X” is safe even if X has moved in the document. The undo is expressed in terms of Item IDs, not positions. Position-based undo (e.g., “delete the character at position 50”) would be wrong after concurrent inserts shifted positions.
Suggestion mode (tracked changes): Rather than applying changes directly, the document uses a shadow Y.Doc with gc: false and a flag isSuggestionDoc: true:
// From yjs/src/utils/Doc.js
class Doc {
constructor({ isSuggestionDoc = false } = {}) {
this.isSuggestionDoc = isSuggestionDoc
this.cleanupFormatting = !isSuggestionDoc
// Suggestion docs preserve all formatting items, even redundant ones,
// so that suggestions can be accepted/rejected independently
}
}
A suggestion is a set of Items in the suggestion doc that have not yet been promoted to the main doc. Accepting a suggestion copies its Items into the main doc. Rejecting it discards the suggestion doc changes.
7. What If: Durable Execution for Document Lifecycle Flows #
Most collaborative editing flows are synchronous and handled by the WebSocket layer. But several document lifecycle flows are Class B async work — multi-step, long-running, human-signal-dependent:
Export Workflow (PDF / DOCX) #
workflow DocumentExportWorkflow(doc_id, format, user_id):
snapshot = activity(take_export_snapshot, doc_id)
// snapshot is immutable — export is against a point-in-time view
export_job = activity(enqueue_render_job, snapshot, format)
result = race(
on signal("render_complete") => receive result,
on sleep(10 minutes) => result = { status: timed_out }
)
if result.status == success:
url = activity(upload_to_storage, result.bytes, format)
activity(notify_user_export_ready, user_id, url)
else:
activity(notify_user_export_failed, user_id, result.error)
Comment Resolution Workflow #
workflow CommentResolutionWorkflow(doc_id, comment_id):
comment = activity(fetch_comment, doc_id, comment_id)
activity(notify_mentioned_users, comment)
resolution = race(
on signal("comment_resolved") => receive resolution,
on signal("comment_reopened") => receive resolution,
on sleep(30 days) => resolution = { status: auto_archived }
)
activity(update_comment_status, doc_id, comment_id, resolution)
if resolution.status == resolved:
activity(notify_thread_participants, comment, resolution)
Access Request Workflow (human-in-the-loop) #
workflow AccessRequestWorkflow(doc_id, requester_id, requested_role):
doc = activity(fetch_document_metadata, doc_id)
activity(notify_owner_access_request, doc.owner_id, requester_id, requested_role)
decision = race(
on signal("access_granted") => receive decision,
on signal("access_denied") => receive decision,
on sleep(7 days) => decision = { outcome: expired }
)
if decision.outcome == granted:
activity(update_acl, doc_id, requester_id, decision.role)
activity(notify_requester_access_granted, requester_id, doc_id)
elif decision.outcome == denied:
activity(notify_requester_access_denied, requester_id, doc_id)
else:
activity(notify_requester_request_expired, requester_id, doc_id)
These workflows are structurally identical to the payment dispute and ACH dispute workflows: a multi-day waiting period, a human signal that resolves the wait, and compensation paths for each outcome.