Table of Contents

Dropbox: System Design #

Derived using the 20-step derivation framework. Every step produces an explicit output artifact. No hand-wavy steps.

Ordering Principle #

Product requirements (upload, sync, share)
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, ordering, evolution         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

Step 1 — Problem Normalization #

Goal: Convert product-language requirements into precise operations over state.

Original Requirement	Actor	Operation	State Touched
User uploads a file	User	create-or-overwrite	File (block_list, version); Block (content, hash)
User downloads a file	Client	read	File (block_list); Block (content by hash)
File auto-syncs across all user devices	System	read change log + apply delta	FileChangeLog (events); SyncCursor (offset per device); Device local state
Conflict resolution when offline edits diverge	System	detect divergence + create Conflict object	FileVersion (version vectors); Conflict (process object)
User shares a file/folder with another user	User	create relationship	SharePermission (grantee, resource, access level)
Shared user accesses the file	Grantee	eligibility check + read	SharePermission; File; Block
View version history of a file	User	read projection	FileVersion (append-only log)
Delta transfer (don’t re-upload unchanged bytes)	Client	compute local hashes, read missing list, upload only missing blocks	Block (content-addressed); File (block_list diff)
Work offline, sync on reconnect	Client	buffer local events offline; replay on reconnect	FileChangeLog (local event buffer); SyncCursor (replayed from offset)
Revoke share permission	Owner	delete relationship	SharePermission
Delete a file	User	state transition	File (active → deleted); FileChangeLog (deletion event appended)
Restore a deleted file	User	state transition + overwrite	File (deleted → active, block_list restored from FileVersion)

Hidden write exposures:

“Auto-sync” is not a simple read. It requires the system to track which device has seen which change (SyncCursor is an offset), detect delta (which blocks changed), and apply the delta to local state. Three operations, not one.
“Conflict resolution” hides a process object (Conflict) with its own state machine. The conflict is created when divergence is detected, then resolved (manually or automatically) into a new FileVersion.
“Version history” is a derived view over the FileVersion append-only log — not primary state to be designed independently.

Step 2 — Object Extraction #

Goal: Identify the minimal set of primary state objects. Apply all four purity tests.

Primary Objects #

Object	Class	Justification
File	Stable entity	Long-lived, has identity (file_id), evolves via block_list overwrites and state transitions (active/deleted)
Block	Event	Immutable once stored; content-addressed by SHA-256(content); never mutated
FileVersion	Event	Immutable snapshot of a file’s block_list at a point in time; append-only
FileChangeLog	Event stream	Append-only log of all mutations to a File (upload, delete, restore, conflict-resolve)
Device	Stable entity	Tracks per-device presence and sync state
Conflict	Process object	Created when version vectors diverge; has state machine (open → resolved); must persist across the resolution lifecycle
SharePermission	Relationship object	An edge between a User and a File/Folder with access level and its own lifecycle (active → revoked)
User	Stable entity	Account identity, quota, plan

Derived / Rejected Objects #

Candidate	Problem	Disposition
SyncCursor	Derivable: it is a pointer (offset) into FileChangeLog per device. If stored as mutable primary state alongside the log, there is dual truth — the log AND the cursor both describe “what this device has seen.” Correct: SyncCursor is a materialized offset bookmark, not primary state.	Derived view — offset into FileChangeLog per (device_id, file_id)
VersionHistory	Derivable from FileVersion records filtered by file_id, ordered by created_at	Derived projection
StorageQuotaUsed	Derivable from File records owned by user × block sizes	Derived projection (cached)
FolderContents	Derivable from File records with parent_folder_id	Derived projection

Four Purity Tests per Object #

File #

Ownership purity: Written by the owning user (uploads, deletes, restores) and by the conflict-resolution path. These are distinct operations with distinct guards — ownership is clear. ✓
Evolution purity: Overwrite of block_list on each new upload version; state machine for active/deleted. These are on different fields and different guards. The split into File (current state) + FileVersion (history log) keeps each pure. ✓
Ordering purity: FileVersions are totally ordered by version number within file_id. ✓
Non-derivability: The current block_list of a File cannot be derived without knowing which version is current — that pointer lives in File. ✓

Block #

Ownership purity: Written by the upload service, never modified, never deleted (content-addressed storage). Single writer under append semantics. ✓
Evolution purity: Append-only. Once a block with a given hash exists, it never changes. ✓
Ordering purity: No meaningful ordering — blocks are a content-addressed set. Hash is the identity. ✓
Non-derivability: Block content cannot be reconstructed without the actual bytes. ✓

Conflict #

Ownership purity: Created by the sync service (when divergence detected); resolved by user or auto-resolver. Two distinct writers, but at distinct lifecycle phases — creation is system-only, resolution is user-or-system. ✓
Evolution purity: State machine: detected → open → resolved. Each transition has a guard. ✓
Ordering purity: Causal lifecycle order — transitions must follow valid paths. ✓
Non-derivability: A Conflict is not derivable from FileVersions alone. The Conflict object carries user choice, resolution metadata, and the resulting FileVersion reference. ✓

SharePermission #

Ownership purity: Written by the file/folder owner (grant, revoke). Single writer per permission. ✓
Evolution purity: State machine: active → revoked. Revocation is terminal. ✓
Ordering purity: No meaningful ordering within a grantee’s permission set. ✓
Non-derivability: Access rights cannot be derived from File metadata alone. ✓

Step 3 — Axis Assignment #

Goal: For every primary object, define ownership, evolution, and ordering (bound to scope).

Object: File
  Ownership:   Multi-writer, one winner per file_id (multiple devices of same user may write; only one write wins per CAS)
  Evolution:   Overwrite (block_list replaced on new version); State machine (active/deleted)
  Ordering:    Total order on version_number within file_id

Object: Block
  Ownership:   Single writer (upload service); content-addressed so identity is the hash
  Evolution:   Append-only (immutable once stored)
  Ordering:    No meaningful order (set semantics; hash = identity)

Object: FileVersion
  Ownership:   System-only (created by upload service and conflict resolver; users never write directly)
  Evolution:   Append-only (immutable snapshot)
  Ordering:    Total order by version_number within file_id

Object: FileChangeLog
  Ownership:   System-only (written by upload, delete, restore, conflict-resolve services)
  Evolution:   Append-only (events are immutable)
  Ordering:    Total order by sequence_number within file_id

Object: Device
  Ownership:   Single writer per device_id (device registers itself; service updates last_seen)
  Evolution:   Overwrite (last_seen, sync_offset fields updated in place)
  Ordering:    No meaningful order across devices

Object: Conflict
  Ownership:   Multi-writer across lifecycle phases: system creates, user or auto-resolver resolves
  Evolution:   State machine (detected → open → resolved)
  Ordering:    Causal lifecycle order (transitions must follow valid paths)

Object: SharePermission
  Ownership:   Single writer (the file/folder owner)
  Evolution:   State machine (active → revoked)
  Ordering:    No meaningful order within a grantee's permission set

Object: User
  Ownership:   Single writer (user self-writes profile; billing system writes quota)
  Evolution:   Overwrite (mutable fields: name, email, plan, quota_used_bytes)
  Ordering:    No meaningful order

Circuit topology insight: The Dropbox sync system is a transmission line. Think of two capacitors (local device state, server state) connected through a sync medium. File changes are charge flowing to equalize. Delta sync is minimizing the charge transfer needed. A conflict is what happens when both capacitors are charged differently during isolation (offline period) — they cannot simply merge charge; the system must detect divergence and arbitrate.

Step 4 — Invariant Extraction #

Goal: Convert requirements into precise, testable invariants. These are implementation-independent.

Eligibility Invariants #

E1 — Upload eligibility: A user may upload a file only if their quota_used_bytes + new_file_bytes ≤ quota_limit_bytes.

E2 — Access eligibility: A user may read a file only if: (a) they own the file, OR (b) there exists a SharePermission record where grantee_id = user_id AND resource_id covers the file AND status = active AND access_level ∈ {read, write}.

E3 — Write eligibility on shared file: A user may upload a new version of a file they do not own only if: there exists a SharePermission where access_level = write AND status = active.

E4 — Delete eligibility: Only the owner may delete a file. Grantees with write access may not delete.

Ordering Invariants #

O1 — Version monotonicity: For any file_id, if version V exists, version V+1 must have created_at > V.created_at. No two FileVersions for the same file may have the same version_number.

O2 — Change log monotonicity: For any file_id, FileChangeLog sequence numbers are strictly monotonically increasing. Events are never reordered or deleted.

Accounting Invariants #

A1 — Quota consistency: User.quota_used_bytes = SUM(Block.size_bytes for all blocks reachable from active Files owned by user). This must hold after every upload and delete. (In practice, computed asynchronously; the synchronous enforcement is a pessimistic quota check at upload time against a cached counter.)

A2 — Block reference integrity: Every block_hash in any File.block_list must have a corresponding Block record in the block store. A file must never reference a block that has not been durably committed.

Uniqueness / Idempotency Invariants #

U1 — Block global dedup: There is at most one Block record for any given content_hash. If two uploads produce the same hash, only one Block is stored. Both uploads succeed, but the bytes are stored once.

U2 — Idempotent block upload: Uploading a block with the same content_hash twice must be a no-op. The second upload must not corrupt or duplicate the block.

U3 — Upload idempotency: A client may retry an upload request with the same idempotency key and receive the same result without creating duplicate FileVersions.

Propagation Invariants #

P1 — Sync completeness: If a file changes on device A and device B is online and subscribed, device B must eventually receive the change event. “Eventually” is bounded by SLO (Step 15).

P2 — Conflict detection completeness: If two devices modify the same file while one is offline, the system must detect the divergence when the offline device reconnects and create a Conflict object. The divergence must not be silently overwritten.

P3 — Deletion propagation: If a file is deleted on device A, device B must receive a deletion event. Device B must not continue to serve the file as live after the event is processed.

Access-Control Invariants #

AC1 — Permission revocation is immediate: Once a SharePermission is revoked (status = revoked), all subsequent access attempts by the grantee must be denied. No caching of permissions beyond a bounded TTL (configurable; must be ≤ 60 seconds in strong mode).

AC2 — Grantee cannot escalate: A grantee with read access cannot perform write operations. Access level enforcement is the sole authority — no capability tokens that can be forged client-side.

Step 5 — Design Point Derivation #

Goal: For each invariant cluster, derive the minimal enforcing mechanism. One DP per cluster; no over-engineering.

Invariant Cluster	Design Point	Reasoning
E1 (quota), A1 (quota consistency)	Quota gate with cached counter + pessimistic check	Exact enforcement requires a distributed counter; exact counter is expensive at upload scale. Pessimistic: pre-check cached `quota_used_bytes`; deduct on commit; reconcile asynchronously. Over-quota risk is bounded by cache staleness window.
E2, E3, E4, AC1, AC2 (access control)	Permission check service with bounded-TTL cache	Permissions are read at every file access; must be fast. Cache permissions with TTL ≤ 60s for strong mode. Revocation propagates via cache invalidation event. Source of truth is SharePermission table.
O1 (version monotonicity), U3 (upload idempotency)	CAS on (file_id, current_version_number) + Idempotency Key store	Each upload atomically increments version_number; if CAS fails, caller retries with fresh version. Idempotency key prevents duplicate FileVersions on retry.
U1, U2 (block dedup + idempotency)	Content-addressed block store keyed by SHA-256 hash	Hash = identity = idempotency key. Existence check before upload eliminates duplicate bytes. No separate dedup index needed — the hash IS the key.
O2 (log monotonicity)	Single-partition append-only log per file_id	Sequence numbers assigned by the log; append is atomic. Consumers read from offset.
P1 (sync propagation)	Change event stream + per-device subscription + push notification	FileChangeLog events are published; devices subscribe; server pushes or client polls from stored offset.
P2 (conflict detection)	Version vector per (file_id, device_id); divergence detection on reconnect	Each device tracks a version vector. On reconnect, server compares device’s vector against server’s vector. Divergence → Conflict object created.
P3 (deletion propagation)	Deletion event in FileChangeLog + device processes event	Deletion is an event appended to the log. All subscribed devices receive it and mark local copy as deleted. Soft-delete on server (retain block_list for restore); hard-delete after TTL or explicit purge.
AC1 (revocation immediacy)	Cache invalidation event on revocation + TTL backstop	On revoke, publish invalidation to permission cache. TTL ≤ 60s ensures stale cache expires even if invalidation is lost.
A2 (block reference integrity)	Two-phase commit for file metadata + block store, OR: commit blocks first, then commit file metadata atomically	Block must exist before File references it. Write blocks first; write file metadata only after all blocks are confirmed durable. This is a sequencing invariant, not a 2PC invariant.

Step 6 — Mechanism Selection #

Goal: Mechanical bridge from invariant + axis → concrete implementation mechanism. Apply the full derivation table for four key paths.

6.1 Invariant Type → Mechanism Family #

Invariant Type	Mechanism Family
Eligibility	Guard + atomic state check
Ordering	Sequence number + CAS
Accounting	Pessimistic counter + async reconciliation
Uniqueness/Idempotency	Content hash as key + existence check
Propagation	Event stream + subscription
Access-control	Permission store + TTL cache + invalidation

6.2 Ownership × Evolution → Concurrency Mechanism #

Object	Ownership	Evolution	Table Lookup Result
File	Multi-writer, one winner	Overwrite + state machine	CAS on (file_id, version_number)
Block	Single writer	Append-only	No concurrency mechanism needed; hash prevents collision by design
FileVersion	System-only	Append-only	Idempotency key on (file_id, version_number)
Conflict	Multi-writer across lifecycle phases	State machine	CAS on (conflict_id, status)
SharePermission	Single writer	State machine	CAS on (permission_id, status)

6.3 Mechanical Derivation — Four Key Paths #

Path A: File Write Conflict Detection #

Invariant driving this: P2 (divergence must be detected) + O1 (version monotonicity).

Ownership × Evolution: File is multi-writer (multiple devices of same user may write while offline), overwrite + state machine.

Table lookup: Multi-writer + overwrite → CAS on version.

Q1 (scope): Divergence spans multiple devices (cross-service/cross-process). Not within a single service. But it is also not cross-region in the primary case. Scope = cross-device within one user’s account. The conflict is detected at the server on reconnect. Mechanism: CAS on server-side version vector, not distributed 2PC (devices do not coordinate with each other).

Q2 (failure): What if the device crashes mid-upload? → Idempotency Key. What if the network partitions during upload? → CAS detects stale version on retry; client must re-read current state.

Q3 (data): Version vectors are not commutative in the way CRDT requires (concurrent overwrites of a file are not mergeable without user intent). Content-addressed blocks help detect what changed but not resolve which change wins. → CAS + version vector.

Q4 (access): Reads » Writes (most devices read the synced state; conflicts are rare). Current state is read on every sync. → CQRS-lite: write path updates the version vector; read path serves from a read replica.

Q5 (coupling): The Conflict object must be created when divergence is detected — this is an async-guaranteed propagation from the sync service to the conflict service. → Outbox + Relay pattern. The sync service writes a “conflict_detected” event to its outbox table in the same transaction as updating the file’s version vector; a relay publishes it to the conflict service.

Required combination: CAS + Idempotency Key always (Step 6.4 rule).

Resulting mechanism: Version vector per (file_id, device_id) stored in FileChangeLog. On upload from device D, server computes current version vector. If device D’s local version vector and the server’s vector are concurrent (neither dominates), a Conflict is created via Outbox + Relay. If device D’s vector dominates (no offline divergence), the upload proceeds with CAS on version_number.

Conflict detection algorithm sketch:

function detect_conflict(device_id, file_id, device_version_vector):
    server_vv = read_version_vector(file_id)
    if dominates(device_vv, server_vv):
        # device is ahead — fast-forward server
        return ACCEPT
    elif dominates(server_vv, device_vv):
        # server is ahead — device needs to pull
        return STALE
    else:
        # concurrent — conflict
        return CONFLICT

Path B: Block Dedup #

Invariant driving this: U1 (global dedup — at most one block per hash) + U2 (idempotency of block upload).

Ownership × Evolution: Block is single writer, append-only. But it is a global dedup — across all users.

Q1 (scope): Global dedup is cross-user. The block store is a single global namespace keyed by content_hash. Scope: global, single-writer-per-key (two clients uploading the same block race, but both should succeed idempotently).

Q2 (failure): If the uploading client crashes mid-block-upload, the block is incomplete. The client must retry. → Idempotency Key = content_hash. The server checks: does a block with this hash exist? If yes, skip upload. If no (or partial), upload resumes from offset (resumable upload protocol).

Q3 (data): Blocks are content-addressed. Content-addressed → hash = natural idempotency key (Q3 rule from framework). This eliminates the need for a separate idempotency key store for block uploads. The hash IS the key.

Q4 (access): Block uploads (writes) are less frequent than block downloads (reads), but block existence checks happen on every upload attempt for every block in the manifest. → Content-addressed object store with O(1) existence check (HTTP HEAD on S3 key = hash).

Q5 (coupling): Block existence must be confirmed before File metadata references it (A2). This is an in-transaction-sequencing coupling, not a saga. Write block first; commit file metadata only after block is durable. No async coupling needed here.

Resulting mechanism: S3 (or equivalent) with key = SHA-256 hex of block content. Upload protocol:

Client computes SHA-256(block_content) locally.
Client sends manifest (list of hashes) to server.
Server performs bulk existence check (S3 HEAD per hash, or batched lookup in a block index table).
Server returns list of missing hashes.
Client uploads only missing blocks (PUT to S3 pre-signed URL per hash).
Server confirms receipt (S3 HEAD again or use S3 event notification).
Server commits File metadata with new block_list.

The dedup ratio is high for common file types. A 10MB PDF split into 4MB blocks: if block 1 is identical to a block already stored by another user, it is not re-uploaded. Cross-user dedup is a storage multiplier.

Path C: Sync Notification #

Invariant driving this: P1 (sync completeness) + P3 (deletion propagation).

Ownership × Evolution: FileChangeLog is system-only, append-only.

Q1 (scope): Propagation from the server to many devices. This is a fan-out problem: one file change → N device notifications. Scope: cross-service (sync service → push notification service → devices).

Q2 (failure): Device is offline when event is published. → The device must catch up from its stored offset (SyncCursor = offset in FileChangeLog). The event log is durable; the device resumes from its last-processed offset on reconnect. This is the “async-guaranteed” pattern. Missing push notification is survivable because the device polls on reconnect from its offset.

Q3 (data): Events are not commutative (they must be replayed in order). → Not CRDT. The log is totally ordered per file_id (O2).

Q4 (access): Write (device uploads change) triggers fan-out read to many devices. → Fan-out on Write: when a change is committed, publish to a per-user change channel. Devices subscribed to the channel receive the notification. Devices not connected store their offset and catch up later.

Q5 (coupling): Sync notification must be async-guaranteed: the upload service must not block on device notification delivery. → Outbox + Relay: upload service writes FileChangeLog event to its own DB in the same transaction as committing the new FileVersion; a relay publishes to the notification channel (Redis pub/sub or Kafka topic).

Resulting mechanism:

FileChangeLog stored in Cassandra (high-write append, per-file_id partition key).
On commit, relay publishes event to Kafka topic file-changes (partitioned by user_id for ordering).
Notification service consumes Kafka, pushes to device via WebSocket or long-poll.
Device stores sync_cursor = last_processed_sequence_number locally and in Device table.
On reconnect, device sends its cursor; server streams all events since that cursor from FileChangeLog.
Reconnect storm mitigation: exponential backoff + full jitter on reconnect attempt after outage (per AWS Architecture Blog recommendation).

Invariant driving this: E2, E3, E4 (access eligibility) + AC1 (revocation immediacy) + AC2 (no escalation).

Ownership × Evolution: SharePermission is single writer per record, state machine (active → revoked).

Q1 (scope): Permission check is within-service (the file service checks permissions before serving content). Cross-service for access from different tools (e.g., a linked app). Within the core service: permission check on every read/write request.

Q2 (failure): What if the permission service is slow or unavailable? → Circuit Breaker: if the permission service is down, fail closed (deny access), not open (allow all). This preserves AC1 and AC2 under degradation.

Q3 (data): Permissions are read » written (many reads per grant/revoke event). → Cache-Aside: cache SharePermission records in Redis keyed by (grantee_id, resource_id). TTL ≤ 60s for strong mode. On revocation, publish a cache invalidation event (via Kafka or Redis pub/sub) to all file service instances.

Q4 (access): read » write → Cache-Aside is correct. The cache entry is a materialized permission check result: {can_read: true, can_write: false}.

Q5 (coupling): Revocation must propagate with bounded delay (≤ 60s, per AC1). This is async-guaranteed propagation. → Outbox + Relay for revocation events: SharePermission service writes revocation event to outbox; relay publishes to cache invalidation channel; all file service instances invalidate their local cache entry for that (grantee_id, resource_id) pair.

CAS + Lease requirement (Step 6.4):

CAS on SharePermission.status: the revocation is a CAS from active to revoked. Prevents double-revocation or race between grant and revoke.
No lease needed here because the single-writer invariant holds (only the owner writes).

Resulting mechanism:

PostgreSQL share_permissions table with status column.
Revocation: UPDATE share_permissions SET status='revoked', revoked_at=NOW() WHERE permission_id=? AND status='active' (CAS-equivalent via conditional update).
Cache: Redis hash perm:{grantee_id}:{resource_id} → {can_read, can_write, ttl}.
Cache invalidation: on revocation, write to outbox; relay publishes perm.invalidate:{grantee_id}:{resource_id} to Redis pub/sub; all file service instances subscribe and delete their cache entry.
Circuit Breaker: if Redis is unavailable, fall back to direct DB query. If DB is unavailable, fail closed.

6.4 Required Combinations Applied #

Path	CAS Required?	Idempotency Key Required?	Lease Required?	Fencing Token?
File write conflict	YES — CAS on version_number	YES — upload idempotency key	NO (no crash-holder scenario)	NO
Block dedup	NO (hash is the identity; existence check is idempotent by design)	YES — content_hash IS the idempotency key	NO	NO
Sync notification	NO	YES — sequence_number prevents duplicate processing	NO	NO
Share permission	YES — conditional update on status	YES — idempotency key on grant/revoke request	NO	NO

Step 7 — Axiomatic Validation #

Goal: Source-of-truth table. No dual truth.

State Question	Source of Truth	Notes
What bytes does this file contain?	Block store (S3), keyed by SHA-256 hash	Block content is immutable; hash is the identity
What is the current block_list of a file?	`files` table, `block_list` column (Postgres)	Updated on every new version via CAS
What versions has this file had?	`file_versions` table (Cassandra or Postgres append-only)	Immutable append log; never updated in place
What events have happened to this file?	`file_change_log` table (Cassandra, per file_id partition)	Immutable event log; source of truth for sync
What has device D seen?	`devices` table, `sync_cursor` column (Postgres or Redis)	This is a bookmark, not a duplicate of the log. The log is source of truth; the cursor is a pointer. NO DUAL TRUTH.
Does user U have access to file F?	`share_permissions` table (Postgres)	Redis cache is a read cache, NOT a source of truth. Invalidated on revocation.
How much storage has user U used?	`users` table, `quota_used_bytes` column	Cached counter; asynchronously reconciled against actual block references
Is there a conflict on file F?	`conflicts` table (Postgres)	Conflict object is primary state with its own lifecycle

Dual truth check:

The SyncCursor is the canonical example of dual-truth risk. If both the FileChangeLog and a separate SyncCursor table claim to be the source of truth for “what device D has seen,” updates to both must be kept in sync — which is exactly the dual-truth problem. The correct design is: FileChangeLog is the source of truth for what happened; SyncCursor is a bookmark (offset) stored in Device or a separate cursor table, pointing into the log. If the cursor is lost, it can be reset to 0 (full resync) or to a known checkpoint. The log is never lost.

Validation result: No dual truth found. Each state question has exactly one source of truth.

Step 8 — Algorithm Design #

Goal: Pseudocode for every write path and state machines.

8.1 Block Upload (Delta Sync) Algorithm #

// Client-side: compute local manifest
function compute_manifest(file_path) -> List[BlockManifestEntry]:
    blocks = split_file_into_blocks(file_path, block_size=4MB)
    manifest = []
    for block in blocks:
        hash = sha256(block.content)
        manifest.append({hash: hash, offset: block.offset, size: block.size})
    return manifest

// Client sends manifest to server
// Server checks which blocks are missing
function check_missing_blocks(file_id, manifest) -> List[hash]:
    missing = []
    for entry in manifest:
        if not block_store.exists(entry.hash):  // S3 HEAD request
            missing.append(entry.hash)
    return missing

// Client uploads only missing blocks
function upload_missing_blocks(missing_hashes, manifest):
    for hash in missing_hashes:
        block = find_block_in_manifest(manifest, hash)
        presigned_url = get_presigned_upload_url(hash)
        http_put(presigned_url, block.content)
        // Retry with exponential backoff on failure
        // Hash = idempotency key; re-uploading same hash is safe

// Server commits new file version
function commit_file_version(file_id, user_id, manifest, idempotency_key):
    // Idempotency check
    if idempotency_store.exists(idempotency_key):
        return idempotency_store.get_result(idempotency_key)

    // CAS on version_number
    current = db.select("SELECT version_number FROM files WHERE file_id=?", file_id)
    new_version = current.version_number + 1

    block_list = [entry.hash for entry in manifest]

    // Verify all blocks exist before committing
    for hash in block_list:
        assert block_store.exists(hash), "Block not durable: " + hash

    // Atomic commit: new FileVersion + update File.block_list + append to FileChangeLog
    db.transaction():
        db.execute("""
            UPDATE files
            SET block_list=?, version_number=?, updated_at=NOW()
            WHERE file_id=? AND version_number=?
        """, block_list, new_version, file_id, current.version_number)
        // CAS: if version_number changed since read, this UPDATE affects 0 rows → retry

        if db.rows_affected == 0:
            raise ConcurrentWriteError("Retry with fresh version")

        db.execute("""
            INSERT INTO file_versions (file_id, version_number, block_list, created_at, created_by_device)
            VALUES (?, ?, ?, NOW(), ?)
        """, file_id, new_version, block_list, device_id)

        db.execute("""
            INSERT INTO file_change_log (file_id, event_type, version_number, created_at)
            VALUES (?, 'upload', ?, NOW())
        """, file_id, new_version)

        // Outbox entry for sync notification relay
        db.execute("""
            INSERT INTO outbox (event_type, payload, created_at)
            VALUES ('file_changed', ?, NOW())
        """, json({file_id, new_version, user_id}))

    idempotency_store.put(idempotency_key, {version_number: new_version})
    return {version_number: new_version}

8.2 Conflict Detection State Machine #

States: detected → open → resolved

Transitions:
  NONE → detected    : trigger = sync service detects concurrent version vectors
  detected → open    : trigger = Conflict object created in DB, user notified
  open → resolved    : trigger = user chooses a winner (or auto-resolver picks)
  resolved → NONE    : terminal (new FileVersion created, Conflict closed)

function on_device_reconnect(device_id, file_id, device_version_vector):
    server_vv = read_version_vector(file_id)

    relation = compare_version_vectors(device_vv, server_vv)

    if relation == DOMINATES:
        // device is ahead of server — accept device's version
        commit_file_version(file_id, device_version)

    elif relation == DOMINATED:
        // server is ahead — tell device to pull
        return {action: "pull", server_version: server_vv}

    elif relation == CONCURRENT:
        // Conflict: neither dominates
        conflict_id = create_conflict(file_id, device_id, device_vv, server_vv)
        // Outbox entry → notify user
        return {action: "conflict", conflict_id: conflict_id}

function resolve_conflict(conflict_id, resolution_choice):
    // CAS on conflict status
    conflict = db.select("SELECT * FROM conflicts WHERE conflict_id=? AND status='open'", conflict_id)
    if conflict is None:
        return ALREADY_RESOLVED

    if resolution_choice == KEEP_SERVER:
        winning_block_list = conflict.server_block_list
    elif resolution_choice == KEEP_DEVICE:
        winning_block_list = conflict.device_block_list
    elif resolution_choice == KEEP_BOTH:
        // Create a copy of the device version with a suffix (e.g., "file (conflicted copy).txt")
        create_conflict_copy(conflict.file_id, conflict.device_block_list)
        winning_block_list = conflict.server_block_list

    db.transaction():
        db.execute("UPDATE conflicts SET status='resolved' WHERE conflict_id=? AND status='open'", conflict_id)
        commit_file_version(conflict.file_id, winning_block_list, idempotency_key=conflict_id+"_resolve")

8.3 Sync Pull Algorithm (Device on Reconnect) #

function sync_on_reconnect(device_id, file_id):
    // Read stored cursor (last processed sequence number)
    cursor = db.select("SELECT sync_cursor FROM devices WHERE device_id=?", device_id)
    last_seq = cursor.sync_cursor ?? 0

    // Read all events since last_seq
    events = db.select("""
        SELECT * FROM file_change_log
        WHERE file_id=? AND sequence_number > ?
        ORDER BY sequence_number ASC
    """, file_id, last_seq)

    for event in events:
        apply_event_to_local_state(device_id, event)

        // Update cursor atomically after each event
        db.execute("""
            UPDATE devices SET sync_cursor=? WHERE device_id=? AND sync_cursor=?
        """, event.sequence_number, device_id, last_seq)
        last_seq = event.sequence_number

function apply_event_to_local_state(device_id, event):
    if event.type == 'upload':
        delta = compute_block_delta(local_block_list, event.block_list)
        download_missing_blocks(delta.missing_blocks)
        update_local_file(event.block_list)
    elif event.type == 'delete':
        mark_local_file_deleted(event.file_id)
    elif event.type == 'restore':
        delta = compute_block_delta(local_block_list, event.block_list)
        download_missing_blocks(delta.missing_blocks)
        update_local_file(event.block_list)

8.4 Quota Enforcement Algorithm #

function check_quota_before_upload(user_id, new_file_bytes):
    // Pessimistic check against cached counter
    current_usage = redis.get("quota:" + user_id) ?? db.select("SELECT quota_used_bytes FROM users WHERE user_id=?", user_id)
    quota_limit = db.select("SELECT quota_limit_bytes FROM users WHERE user_id=?", user_id)

    if current_usage + new_file_bytes > quota_limit:
        raise QuotaExceededError()

    // Reserve space optimistically
    redis.incrby("quota:" + user_id, new_file_bytes)
    // On upload failure, release reservation
    // On upload success, no-op (counter already incremented)
    // Async reconciliation job computes exact usage periodically

Step 9 — Logical Data Model #

Goal: Schema with partition keys derived from invariant scope.

Tables #

`files` #

CREATE TABLE files (
    file_id         UUID PRIMARY KEY,
    owner_user_id   UUID NOT NULL REFERENCES users(user_id),
    parent_folder_id UUID REFERENCES folders(folder_id),
    name            TEXT NOT NULL,
    block_list      TEXT[] NOT NULL,         -- ordered list of SHA-256 hashes
    version_number  BIGINT NOT NULL DEFAULT 1,
    status          TEXT NOT NULL DEFAULT 'active',  -- active | deleted
    size_bytes      BIGINT NOT NULL,
    content_hash    TEXT,                    -- hash of full file (optional, for quick comparison)
    created_at      TIMESTAMPTZ NOT NULL,
    updated_at      TIMESTAMPTZ NOT NULL,
    CONSTRAINT files_status_check CHECK (status IN ('active', 'deleted'))
);
-- Partition key for access: owner_user_id (all files by user)
-- CAS key: (file_id, version_number) for optimistic concurrency
CREATE INDEX files_owner_idx ON files(owner_user_id, status);
CREATE INDEX files_folder_idx ON files(parent_folder_id, status);

`file_versions` #

-- Append-only; never updated. Cassandra or Postgres.
CREATE TABLE file_versions (
    file_id         UUID NOT NULL,
    version_number  BIGINT NOT NULL,
    block_list      TEXT[] NOT NULL,
    size_bytes      BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    created_by_device UUID,
    created_by_user UUID NOT NULL,
    PRIMARY KEY (file_id, version_number)
);
-- Partition key: file_id (all versions of one file co-located)
-- Ordering: version_number ASC within file_id

`file_change_log` #

-- Append-only event log. Cassandra preferred for high-write throughput.
CREATE TABLE file_change_log (
    file_id         UUID NOT NULL,
    sequence_number BIGINT NOT NULL,         -- monotonically increasing within file_id
    event_type      TEXT NOT NULL,           -- upload | delete | restore | conflict_resolved | shared | unshared
    version_number  BIGINT,                  -- which FileVersion this event references (if applicable)
    actor_device_id UUID,
    actor_user_id   UUID NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    metadata        JSONB,
    PRIMARY KEY (file_id, sequence_number)
);
-- Partition key: file_id → all events for one file co-located, ordered by sequence_number

`blocks` (index table — not the block content itself) #

-- Block content stored in S3. This table is an index for existence checks and metadata.
CREATE TABLE blocks (
    content_hash    TEXT PRIMARY KEY,        -- SHA-256 hex
    size_bytes      BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    reference_count BIGINT NOT NULL DEFAULT 1  -- for GC: when 0, eligible for deletion
);

`devices` #

CREATE TABLE devices (
    device_id       UUID PRIMARY KEY,
    user_id         UUID NOT NULL REFERENCES users(user_id),
    device_name     TEXT,
    platform        TEXT,                    -- mac | windows | linux | ios | android | web
    sync_cursor     BIGINT NOT NULL DEFAULT 0,  -- offset into file_change_log per user
    last_seen_at    TIMESTAMPTZ,
    registered_at   TIMESTAMPTZ NOT NULL
);
CREATE INDEX devices_user_idx ON devices(user_id);

Note on SyncCursor: The sync_cursor in the devices table is a bookmark into the FileChangeLog, not a duplicate of the log. It is the device’s “read position.” The log is the source of truth; this cursor is how the device knows where to resume. If lost, resync from 0.

`conflicts` #

CREATE TABLE conflicts (
    conflict_id         UUID PRIMARY KEY,
    file_id             UUID NOT NULL REFERENCES files(file_id),
    status              TEXT NOT NULL DEFAULT 'open',  -- open | resolved
    server_version_number BIGINT NOT NULL,
    device_id           UUID NOT NULL,
    device_block_list   TEXT[] NOT NULL,
    server_block_list   TEXT[] NOT NULL,
    resolution          TEXT,               -- keep_server | keep_device | keep_both
    resolved_at         TIMESTAMPTZ,
    created_at          TIMESTAMPTZ NOT NULL,
    CONSTRAINT conflicts_status_check CHECK (status IN ('open', 'resolved'))
);
CREATE INDEX conflicts_file_idx ON conflicts(file_id, status);

`share_permissions` #

CREATE TABLE share_permissions (
    permission_id   UUID PRIMARY KEY,
    resource_id     UUID NOT NULL,           -- file_id or folder_id
    resource_type   TEXT NOT NULL,           -- file | folder
    grantor_user_id UUID NOT NULL REFERENCES users(user_id),
    grantee_user_id UUID NOT NULL REFERENCES users(user_id),
    access_level    TEXT NOT NULL,           -- read | write | admin
    status          TEXT NOT NULL DEFAULT 'active',  -- active | revoked
    created_at      TIMESTAMPTZ NOT NULL,
    revoked_at      TIMESTAMPTZ,
    CONSTRAINT sp_status_check CHECK (status IN ('active', 'revoked')),
    CONSTRAINT sp_access_check CHECK (access_level IN ('read', 'write', 'admin'))
);
CREATE INDEX sp_grantee_idx ON share_permissions(grantee_user_id, status);
CREATE INDEX sp_resource_idx ON share_permissions(resource_id, status);

`idempotency_keys` #

CREATE TABLE idempotency_keys (
    idempotency_key TEXT PRIMARY KEY,
    result          JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    expires_at      TIMESTAMPTZ NOT NULL     -- TTL 24h
);

`outbox` #

CREATE TABLE outbox (
    outbox_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type      TEXT NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    published_at    TIMESTAMPTZ,
    status          TEXT NOT NULL DEFAULT 'pending'  -- pending | published
);
CREATE INDEX outbox_pending_idx ON outbox(status, created_at) WHERE status='pending';

`users` #

CREATE TABLE users (
    user_id             UUID PRIMARY KEY,
    email               TEXT UNIQUE NOT NULL,
    plan                TEXT NOT NULL DEFAULT 'free',  -- free | plus | professional | business
    quota_limit_bytes   BIGINT NOT NULL DEFAULT 2147483648,  -- 2 GB default
    quota_used_bytes    BIGINT NOT NULL DEFAULT 0,           -- cached counter; reconciled async
    created_at          TIMESTAMPTZ NOT NULL
);

Step 10 — Technology Landscape #

Goal: Capability → shape → product mapping.

Capability Needed	Shape Required	Product Selected	Reason
Block storage (immutable, content-addressed, globally durable)	Object store: key = hash, value = bytes; S3-compatible; high throughput PUT/GET	AWS S3	Industry standard; 11-nines durability; CDN-compatible; multi-region replication; pre-signed URLs eliminate proxy hop
File metadata (current state, version, status)	Relational: ACID transactions, CAS via optimistic locking, foreign keys	PostgreSQL	Strong consistency; supports CAS via conditional UPDATE; row-level locking; JSONB for block_list
File change log + version history (high-write append, large volume, time-series)	Wide-column store: append-only, partition by file_id, time-ordered	Apache Cassandra	Optimized for append-heavy workloads; tunable consistency; partition key = file_id for co-location
Sync notification (real-time push to connected devices)	Pub/sub with persistence; consumer groups for multiple devices	Apache Kafka	Ordered per partition (user_id); consumer offset = device sync cursor; replay from any offset; durable
Permission cache (low-latency read, TTL, pub/sub invalidation)	In-memory cache with TTL and pub/sub	Redis	O(1) GET/SET; TTL support; Redis pub/sub for invalidation events; Sentinel/Cluster for HA
Quota counter (high-throughput increment/decrement)	In-memory atomic counter with persistence	Redis	INCRBY/DECRBY atomic; persist to Postgres on write; reconcile async
Conflict management (process object with state machine, strong consistency)	Relational: ACID, CAS on status	PostgreSQL	Same cluster as files; conflict references file; transactional integrity
Idempotency key store (short-lived, key-value, TTL 24h)	Key-value with TTL	Redis (or Postgres with TTL index)	Redis for speed; Postgres if durability is required across restarts
CDN (block download acceleration for frequently accessed files)	Edge cache with origin pull	Cloudflare CDN	Global PoPs; origin = S3; cache key = block hash (content-addressed = perfect cache hit rate for identical blocks)
Outbox relay	Background job consuming outbox table	Custom relay + Kafka producer	Polls outbox table; publishes events to Kafka; marks published

Step 11 — Deployment Topology #

Goal: Service boundaries and failure domains.

Services #

┌────────────────────────────────────────────────────────────────────┐
│                         Client Layer                               │
│  Desktop (Mac/Win/Linux)  Mobile (iOS/Android)  Web (Browser)     │
└────────────────────────────┬───────────────────────────────────────┘
                             │ HTTPS + WebSocket
┌────────────────────────────▼───────────────────────────────────────┐
│                         API Gateway                                │
│  Rate limiting, auth, TLS termination, routing                    │
└──┬──────────┬──────────┬──────────┬───────────────────────────────┘
   │          │          │          │
   ▼          ▼          ▼          ▼
┌──────┐  ┌──────┐  ┌──────┐  ┌────────────────┐
│Upload│  │Sync  │  │Share │  │Notification    │
│Svc   │  │Svc   │  │Svc   │  │Svc (WebSocket) │
└──┬───┘  └──┬───┘  └──┬───┘  └──────┬─────────┘
   │          │          │             │
   ▼          ▼          ▼             ▼
┌─────────────────────────────────────────────────┐
│                  Data Layer                      │
│                                                  │
│  PostgreSQL (files, conflicts, shares, users)    │
│  Cassandra (file_change_log, file_versions)      │
│  Redis (permission cache, quota counters,        │
│          pub/sub, idempotency keys)              │
│  Kafka (file-changes topic, sync events)         │
│  S3 (block content)                              │
│  Cloudflare CDN (block download acceleration)    │
└─────────────────────────────────────────────────┘

Service Responsibilities #

Service	Responsibility	Failure Domain
Upload Service	Manifest validation, block existence check, block upload coordination, commit FileVersion, append FileChangeLog, write outbox	Stateless; horizontally scaled; failure = upload fails, client retries
Sync Service	Reconnect handling, version vector comparison, conflict detection, streaming events from FileChangeLog to device	Stateless; horizontally scaled; failure = device retries from stored cursor
Share Service	Grant/revoke permissions, check eligibility, publish invalidation events	Stateless; failure = permission operation fails, client retries
Notification Service	WebSocket connections to devices, receive Kafka events, push to connected devices	Stateful (WebSocket connections); failure = device falls back to poll
Outbox Relay	Poll outbox table, publish to Kafka, mark published	Background job; failure = events delayed, not lost
Conflict Resolver	Receive conflict events from Kafka, present to user, apply resolution	Stateless; failure = conflict remains open until user next opens app

Failure Domains #

US-EAST-1 failure: Failover to US-WEST-2 (standby). PostgreSQL primary in US-EAST-1; read replicas in US-WEST-2. S3 cross-region replication. Cassandra multi-region. Kafka MirrorMaker.
Single service failure: Other services unaffected. Clients experience partial degradation (e.g., no push notifications if Notification Service is down, but sync still works on reconnect via Sync Service).
Redis failure: Fall back to direct Postgres permission check. Quota counter reconciled from Postgres.
Kafka failure: Outbox relay accumulates pending events. Devices fall back to poll-based sync.

Step 12 — Consistency Model #

Goal: Per-path classification of consistency guarantees.

Operation Path	Consistency Model	Justification
Block upload (PUT to S3)	Eventual (S3 strong read-after-write for the same key since 2020)	S3 offers strong read-after-write consistency for PUT; replication to other regions is eventual
File metadata commit (CAS on version_number)	Strong (serializable via Postgres transaction)	CAS requires seeing the latest version; Postgres serializable transaction ensures this
FileChangeLog append	Eventual (Cassandra quorum write + quorum read = strong per partition)	Quorum read guarantees reading latest write; events are ordered by sequence_number within partition
Permission check (cache hit)	Eventual (bounded by TTL ≤ 60s)	Cache may be up to 60s stale after revocation; acceptable per AC1 SLO
Permission check (cache miss → DB)	Strong (Postgres primary read)	Cache miss goes to source of truth
Sync event delivery	Eventual (Kafka delivery guarantee: at-least-once)	Events may be delivered more than once; consumers deduplicate by sequence_number
Quota check (Redis counter)	Eventual (cached counter; reconciled async)	May over-allow up to cache staleness window; bounded over-quota by reconciliation job
Conflict detection	Strong (version vector comparison at server; Postgres transaction on conflict creation)	Conflict must not be silently dropped; strong consistency required

Step 13 — Scaling Model #

Goal: Scale type per dimension, hotspots, and strategy.

Scaling Dimensions #

Dimension	Scale Type	Hotspot Risk	Strategy
Block storage (bytes)	Horizontal partition by hash prefix	None (content-addressed; uniform distribution)	S3 auto-scales; no action needed
File metadata (rows)	Horizontal sharding by user_id	Popular users (celebrity accounts in B2B context)	Shard Postgres by user_id; per-shard PostgreSQL instance
FileChangeLog (events/sec)	Write-scale by file_id partition in Cassandra	Hot file (shared file edited frequently)	Cassandra partition by file_id; consistent hashing; increase replication factor for hot partitions
Upload throughput (Gbps)	Horizontal: more Upload Service instances; S3 transfer acceleration	None (direct-to-S3 bypasses proxy)	Pre-signed URLs: client uploads directly to S3, bypassing service tier entirely; service only handles manifest and commit
Sync notification fanout	Write fan-out to N devices	User with thousands of devices (enterprise)	Cap devices per user; batch notification; Kafka partition by user_id; Notification Service shards by user_id
Permission reads	Read-heavy; cache-dominated	None after cache warm-up	Redis cluster; permission cache hit rate > 99% after warm-up
Reconnect storm (after outage)	Thundering herd	All devices reconnect simultaneously after region recovery	Exponential backoff + full jitter on client; rate limiting at API Gateway; Sync Service queue with backpressure

Block-Level Dedup Savings #

Global dedup across users reduces storage by an estimated 30-70% for typical enterprise workloads (common file types like PDFs, Office documents, and images with identical blocks). The dedup ratio compounds with block size: 4MB blocks capture document-level dedup; 64KB blocks capture within-document dedup but increase manifest overhead.

Partition Strategy for FileChangeLog #

The FileChangeLog in Cassandra is partitioned by file_id. This means all events for one file are co-located, and reads for sync (streaming events from an offset) are efficient. The risk is a hot partition for a file with very high event rate (e.g., a frequently edited shared document with many collaborators). Mitigation: rate-limit writes per file_id at the Upload Service level.

Step 14 — Failure Model #

Goal: Per failure type enumeration with detection, impact, and mitigation.

Failure	Detection	Impact	Mitigation
Block upload fails mid-stream (network drop)	Client receives error or timeout	Partial block upload; file commit not attempted	Resumable upload: client stores upload offset locally; resumes from offset on retry; block hash = idempotency key ensures no corruption
File commit fails after blocks uploaded (DB crash)	Client receives 500 or timeout	Blocks uploaded to S3 but file metadata not committed; orphaned blocks	Idempotency key: client retries commit; server checks idempotency key store; if not found, re-runs commit; orphaned blocks cleaned by GC job after TTL
Sync service crash during sync	Client receives disconnect	Device’s sync_cursor not advanced past the crash point	Client stores cursor locally; on reconnect, resumes from last local cursor; server replays from that point
Kafka partition leader failure	Kafka replication detects; leader re-elected	Brief delay in sync notification delivery	Kafka auto-elects new leader; producers retry; at-least-once delivery guaranteed; consumers deduplicate
Redis cache failure	Health check + circuit breaker	Permission cache misses; fallback to Postgres	Circuit breaker opens; all permission reads go to Postgres primary; higher latency but correct; Redis recovers and cache warms
Reconnect storm after regional outage	Spike in sync request rate; API Gateway metrics	Upload Service and Sync Service CPU spike; latency increase; possible retry cascade	Exponential backoff + full jitter on client; API Gateway rate limiting per user; Sync Service autoscaling with pre-provisioned capacity
Ghost file in device cache (file deleted on server, still cached locally)	Deletion event in FileChangeLog not yet processed by device	Device shows deleted file as available	Deletion event is durable in FileChangeLog; device processes on reconnect; TTL on local cache entries (24h backstop); device validates file existence on open
Version vector clock skew (device clock wrong)	Conflict detection produces false positive	Spurious conflict object created	Use server-assigned sequence numbers for ordering; do not trust device clocks for version comparison; version vectors use sequence numbers, not wall clock
S3 regional outage	S3 health checks; upload failures	Uploads fail; downloads may serve from CDN cache for recently accessed blocks	CDN caches recently downloaded blocks; uploads queue locally on device and retry when S3 recovers; read-only mode for cached files
Database primary failure (Postgres)	pg_stat_replication lag monitoring; health check timeout	All writes fail; reads fall to replica (stale)	Automatic failover to replica (Patroni or RDS Multi-AZ); failover time ≈ 30s; uploads fail-fast during failover; clients retry
Conflict resolution failure (resolver crashes)	Conflict stuck in ‘open’ state	User cannot resolve conflict	Conflict remains in ‘open’ state; user can retry resolution; no data loss (both versions preserved)
Quota counter corruption (Redis restart)	Quota reconciliation job detects mismatch	Users may upload beyond quota limit until reconciliation	Reconciliation job runs every 15 minutes; computes exact usage from file records; corrects Redis counter and Postgres field

Step 15 — SLOs #

SLO	Target	Measurement
Upload availability	99.9% (43.8 min/month downtime)	HTTP success rate for upload commits, measured per 5-min window
Upload P99 latency (small file <1MB)	< 500ms	Time from first HTTP byte to commit response
Upload P99 latency (large file 1GB, delta = 0 new bytes)	< 2s	Manifest check + commit without block uploads
Download P99 latency (cache hit at CDN)	< 50ms	Time from request to first byte of first block
Download P99 latency (cache miss, S3 origin)	< 500ms	Time from request to first byte, S3 origin
Sync propagation latency (device online, file changed on another device)	< 5s end-to-end (P99)	Time from commit on device A to event received on device B
Permission revocation propagation	< 60s (P99)	Time from revocation commit to all file service instances invalidating cache
Conflict detection latency	< 2s from reconnect to conflict object created	Time from device reconnect to conflict notification received
Version history availability	99.9%	HTTP success rate for version list API
Data durability (uploaded blocks)	11-nines (≥ 99.999999999%)	Inherited from S3; supplemented by cross-region replication

Step 16 — Operational Parameters #

Parameter	Default Value	Tunable?	Notes
Block size	4 MB	Yes (1MB–64MB)	Larger blocks = better dedup ratio; smaller blocks = finer-grained delta; 4MB is a common sweet spot
Block hash algorithm	SHA-256	No	Cryptographically secure; collision probability negligible at any realistic block count
Idempotency key TTL	24 hours	Yes	After 24h, a retry of an old upload will create a new version
Permission cache TTL	60 seconds (strong mode); 300 seconds (eventual mode)	Yes	Trade-off: lower TTL = faster revocation propagation; higher TTL = lower permission check latency
FileChangeLog retention	180 days	Yes	After 180 days, old events are archived to cold storage (S3 Glacier)
Version history retention	Per plan (Free: 30 days; Plus: 180 days; Professional: 365 days; Business: unlimited)	Per plan	Older versions archived or deleted per plan
Reconnect backoff base	1 second	Yes	Exponential backoff with full jitter; cap at 60 seconds
Reconnect backoff cap	60 seconds	Yes	Prevents thundering herd from collapsing to fixed interval
Upload max file size	50 GB (Business plan)	Yes	Enforced at API Gateway
Max devices per user	100	Yes	Caps notification fan-out overhead
Quota reconciliation interval	15 minutes	Yes	How often the background job recomputes exact usage from DB
Block GC delay	7 days after last reference	Yes	After all files referencing a block are deleted, block is eligible for S3 deletion after 7 days

Step 17 — Runbooks #

Runbook 1: High Upload Failure Rate #

Trigger: Upload success rate drops below 99% for 5 consecutive minutes.

Diagnosis sequence:

Check API Gateway error logs: distinguish 4xx (client errors, e.g., quota exceeded) from 5xx (service errors).
If 5xx: check Upload Service logs for DB connection errors, S3 timeout errors, or CAS contention errors.
If DB CAS contention high: check pg_stat_activity for long-running transactions blocking the files table.
If S3 errors: check AWS S3 Service Health Dashboard; check S3 error rate in CloudWatch.
If Upload Service crash: check pod restart count in Kubernetes; check OOM events.

Mitigation:

DB contention: kill long-running queries; identify and fix the offending query.
S3 outage: activate read-only mode (downloads continue from CDN; uploads queue locally on device).
Upload Service crash: Kubernetes auto-restarts; clients retry with exponential backoff.

Runbook 2: Sync Propagation Latency Spike #

Trigger: P99 sync latency exceeds 10 seconds for 5 minutes.

Diagnosis sequence:

Check Kafka consumer group lag for file-changes topic. High lag = notification delivery is behind.
Check Notification Service throughput and connection count. Dropped WebSocket connections?
Check Sync Service queue depth. Reconnect storm?
Check FileChangeLog read latency in Cassandra. Hot partition?

Mitigation:

Kafka consumer lag: scale up Notification Service replicas; increase Kafka partition count.
Reconnect storm: check if a regional incident just recovered; engage rate limiting at API Gateway.
Cassandra hot partition: identify the hot file_id; rate-limit uploads to that file_id at Upload Service.

Runbook 3: Permission Revocation Not Propagating #

Trigger: Alert from permission audit job: grantee can still access file 5 minutes after revocation.

Diagnosis sequence:

Verify revocation is committed in Postgres share_permissions table (status = 'revoked').
Check Redis cache entry for (grantee_id, resource_id): is TTL still active?
Check Outbox table: is the revocation event still in pending status (Outbox relay stalled)?
Check Kafka: did the perm.invalidate message reach the file service instances?

Mitigation:

Cache entry stale: force TTL expiry via Redis DEL on the cache key (manual intervention).
Outbox relay stalled: restart outbox relay service; it will republish pending events.
Kafka consumer stalled: restart notification service consumer group.

Runbook 4: Reconnect Storm After Outage #

Trigger: Sync request rate spikes to > 10x normal immediately after service recovery.

Actions (in order):

Verify API Gateway rate limiting is active: 100 sync requests/user/minute.
Check that clients are using exponential backoff + jitter (not fixed-interval retry).
If Sync Service CPU > 80%: scale up horizontally (Kubernetes HPA should trigger; verify).
If Cassandra read latency spikes due to backlog: temporarily increase read timeout; increase Cassandra read replica count.
Monitor until request rate normalizes (typically 5-10 minutes with proper backoff).

Runbook 5: Orphaned Block Cleanup #

Trigger: Block GC job reports orphaned blocks (blocks in S3 with reference_count = 0 for > 7 days).

Actions:

Verify reference_count calculation: run reconciliation query across files.block_list and blocks table.
If reference_count is correct (truly unreferenced): schedule S3 deletion via lifecycle policy.
Log block hashes before deletion for audit trail.
Do NOT delete blocks with reference_count > 0 (even if files are in deleted state — blocks may be referenced by FileVersions for restore purposes).

Step 18 — Observability #

Metrics #

Metric	Type	Labels	Alert Threshold
`upload.requests.total`	Counter	status (success/failure), error_type	—
`upload.latency.seconds`	Histogram	percentile	P99 > 2s for 5min
`upload.bytes.total`	Counter	—	—
`block.dedup.ratio`	Gauge	—	< 0.3 (unexpectedly low dedup)
`sync.propagation.latency.seconds`	Histogram	—	P99 > 10s for 5min
`kafka.consumer.lag`	Gauge	consumer_group, topic	> 10000 messages
`permission.cache.hit_rate`	Gauge	—	< 0.95 for 5min
`conflict.open.count`	Gauge	—	> 1000 open conflicts
`quota.check.failures.total`	Counter	—	> 100/min
`db.connection_pool.waiting`	Gauge	db_name	> 10 waiting
`s3.error.rate`	Gauge	operation	> 0.01 (1%)
`device.reconnect.rate`	Gauge	—	> 10x baseline

Traces #

Every upload request carries a trace ID from client through Upload Service → S3 → DB commit → Outbox.
Every sync event carries the FileChangeLog sequence_number as a trace tag.
Conflict detection traces: version vector comparison and conflict creation are traced as a single span.

Logs #

Log Event	Fields	Purpose
`upload.committed`	file_id, version_number, user_id, device_id, block_count, new_block_count, duration_ms	Audit + dedup stats
`conflict.detected`	file_id, device_id, device_vv, server_vv, conflict_id	Conflict rate monitoring
`conflict.resolved`	conflict_id, resolution, resolved_by	Resolution tracking
`permission.granted`	permission_id, grantor, grantee, resource_id, access_level	Access audit
`permission.revoked`	permission_id, grantor, grantee, resource_id, revoked_at	Access audit
`quota.exceeded`	user_id, current_bytes, limit_bytes, file_size_bytes	Quota enforcement
`sync.resumed`	device_id, cursor_from, cursor_to, events_replayed	Sync health

Dashboards #

Upload health: request rate, success rate, P50/P95/P99 latency, block upload rate, dedup ratio.
Sync health: propagation latency, Kafka consumer lag, reconnect rate, conflict rate.
Storage: total blocks stored, bytes stored, dedup savings (bytes not uploaded due to dedup), orphaned blocks.
Access control: permission grant/revoke rate, cache hit rate, revocation propagation latency.
Infrastructure: DB connection pool, Cassandra read/write latency, Redis memory usage, S3 error rate.

Step 19 — Cost Model #

Cost Drivers #

Component	Unit Cost (approximate)	Volume (hypothetical 10M users)	Monthly Cost Estimate
S3 block storage	$0.023/GB/month	10M users × 2GB average = 20PB; with 50% dedup = 10PB	$230,000
S3 PUT requests (block uploads)	$0.005 per 1000 PUT	10M uploads/day × 20 blocks avg = 200M PUT/day = 6B/month	$30,000
S3 GET requests (block downloads)	$0.0004 per 1000 GET	50M downloads/day × 20 blocks avg = 1B GET/day = 30B/month	$12,000
Cloudflare CDN	~$0.01/GB (egress)	30B GET/month; CDN hit rate 80%; 20% from S3 = 6B × 20KB avg = 120TB egress from S3	$1,200 (S3 egress) + Cloudflare plan
PostgreSQL (RDS Multi-AZ db.r6g.4xlarge)	~$1,200/month per instance	2 instances (primary + replica)	$2,400
Cassandra (3 nodes × i3.4xlarge)	~$1,200/month per node	3 nodes in US-EAST-1; 3 in US-WEST-2	$7,200
Redis (ElastiCache r6g.xlarge)	~$250/month	2 instances (primary + replica)	$500
Kafka (MSK 3 brokers × kafka.m5.2xlarge)	~$400/month per broker	3 brokers	$1,200
Compute (ECS/Kubernetes, Upload + Sync + Share + Notification services)	~$0.05/vCPU-hour	50 vCPU average	$1,800
Data transfer (S3 to EC2, same region)	Free	—	$0
Total estimated			~$286,000/month

Cost Optimization Levers #

Dedup ratio is the biggest lever: Global cross-user block dedup at 4MB block size typically achieves 40-60% storage reduction for enterprise document workloads. Each percentage point = $2,300/month at this scale.
CDN cache hit rate: Blocks are content-addressed and immutable → perfect CDN cache invalidation (never invalidate; TTL = infinite). High cache hit rate dramatically reduces S3 GET requests.
S3 Intelligent-Tiering: Blocks not accessed for 30 days automatically move to cheaper storage tiers ($0.0125/GB vs. $0.023/GB).
Cassandra compression: LZ4 compression on FileChangeLog reduces storage by ~3x; Cassandra native.
Right-size Cassandra: At lower scales, Cassandra can be replaced with Postgres (fewer moving parts; lower operational cost).

Step 20 — Evolution Stages #

Stage 1: MVP (0 → 100K users) #

What to build:

Single-region deployment (US-EAST-1).
Upload Service + Sync Service as a monolith.
PostgreSQL for everything (files, file_versions, file_change_log, share_permissions).
S3 for block storage.
No real-time push: device polls every 30 seconds.
Basic conflict detection: last-write-wins (simpler; document caveat to users).
No global block dedup: per-user dedup only.

What to defer:

Cassandra (Postgres handles the write load at this scale).
Kafka (polling is sufficient).
Global dedup (per-user dedup is simpler and sufficient).
Redis (permissions checked directly from Postgres on every request).
Multi-region.

Graduation criteria: 100K users, upload P99 < 1s, sync delay < 30s (polling interval).

Stage 2: Growth (100K → 1M users) #

Adds:

Separate Upload, Sync, Share, and Notification services (decompose monolith by service boundary).
Redis for permission cache (reduce Postgres read load).
WebSocket-based push notifications (reduce sync latency from 30s to < 5s).
Global block dedup (same content hash = skip upload globally).
Idempotency key store for upload retries.
Outbox + Relay for async event propagation.
Full version vector conflict detection (replace last-write-wins).

Graduation criteria: 1M users, upload P99 < 500ms, sync latency P99 < 5s, conflict detection working.

Stage 3: Scale (1M → 50M users) #

Adds:

Cassandra for FileChangeLog and FileVersions (write throughput beyond Postgres single-shard capacity).
Kafka for all async event propagation (replace polling-based Outbox with Kafka consumers).
Postgres sharding by user_id (horizontal scale for file metadata).
Multi-region active-active (US-EAST-1 + EU-WEST-1; data sovereignty for EU users).
Reconnect storm protection (exponential backoff enforcement + API Gateway rate limiting).
S3 Intelligent-Tiering for cost optimization.
CDN integration for block downloads.
Quota reconciliation background job (async exact computation).

Graduation criteria: 50M users, upload availability 99.9%, sync propagation P99 < 5s globally.

Stage 4: Maturity (50M+ users) #

Adds:

Differential sync at sub-block level (rsync-style rolling checksum for large files with small edits — reduce block upload count further).
CRDT-based collaborative editing for specific file types (Google Docs-style; Conflict object becomes less necessary for text files).
Tiered storage with user-controlled cold archive (files not accessed for 1 year move to Glacier with restore-on-demand).
Zero-knowledge encryption option (client-side encryption; server stores ciphertext; server cannot read blocks; dedup becomes per-user only, not global, since ciphertext of same plaintext differs per user key).
Federated deployment for enterprise (on-premises Dropbox Business with same protocol; client handles sync identically).

Trade-off noted for zero-knowledge encryption: Global block dedup (Step 5’s U1 invariant) is incompatible with per-user client-side encryption keys. If user A and user B both encrypt the same file, the resulting blocks differ (different keys → different ciphertext → different hashes). Global dedup collapses to zero. The system must choose: dedup (store plaintext on server) or zero-knowledge (sacrifice global dedup). This is a product decision, not a technical limitation.

Dropbox: System Design #

Ordering Principle #

Step 1 — Problem Normalization #

Step 2 — Object Extraction #

Primary Objects #

Derived / Rejected Objects #

Four Purity Tests per Object #

File #

Block #

Conflict #

SharePermission #

Step 3 — Axis Assignment #

Step 4 — Invariant Extraction #

Eligibility Invariants #

Ordering Invariants #

Accounting Invariants #

Uniqueness / Idempotency Invariants #

Propagation Invariants #

Access-Control Invariants #

Step 5 — Design Point Derivation #

Step 6 — Mechanism Selection #

6.1 Invariant Type → Mechanism Family #

6.2 Ownership × Evolution → Concurrency Mechanism #

6.3 Mechanical Derivation — Four Key Paths #

Path A: File Write Conflict Detection #

Path B: Block Dedup #

Path C: Sync Notification #

Path D: Share Permission Enforcement #

6.4 Required Combinations Applied #

Step 7 — Axiomatic Validation #

Step 8 — Algorithm Design #

8.1 Block Upload (Delta Sync) Algorithm #

8.2 Conflict Detection State Machine #

8.3 Sync Pull Algorithm (Device on Reconnect) #

8.4 Quota Enforcement Algorithm #

Step 9 — Logical Data Model #

Tables #

files #

file_versions #

file_change_log #

blocks (index table — not the block content itself) #

devices #

conflicts #

share_permissions #

idempotency_keys #

outbox #

users #

Step 10 — Technology Landscape #

Step 11 — Deployment Topology #

Services #

Service Responsibilities #

Failure Domains #

Step 12 — Consistency Model #

Step 13 — Scaling Model #

Scaling Dimensions #

Block-Level Dedup Savings #

Partition Strategy for FileChangeLog #

Step 14 — Failure Model #

Step 15 — SLOs #

Step 16 — Operational Parameters #

Step 17 — Runbooks #

Runbook 1: High Upload Failure Rate #

Runbook 2: Sync Propagation Latency Spike #

Runbook 3: Permission Revocation Not Propagating #

Runbook 4: Reconnect Storm After Outage #

Runbook 5: Orphaned Block Cleanup #

Step 18 — Observability #

Metrics #

Traces #

Logs #

Dashboards #

Step 19 — Cost Model #

Cost Drivers #

Cost Optimization Levers #

Step 20 — Evolution Stages #

Stage 1: MVP (0 → 100K users) #

Stage 2: Growth (100K → 1M users) #

Stage 3: Scale (1M → 50M users) #

Stage 4: Maturity (50M+ users) #

`files` #

`file_versions` #

`file_change_log` #

`blocks` (index table — not the block content itself) #

`devices` #

`conflicts` #

`share_permissions` #

`idempotency_keys` #

`outbox` #

`users` #