Table of Contents
Dropbox: System Design #
Derived using the 20-step derivation framework. Every step produces an explicit output artifact. No hand-wavy steps.
Ordering Principle #
Product requirements (upload, sync, share)
→ normalize into operations over state (Step 1)
→ extract primary objects (Step 2)
→ assign ownership, ordering, evolution (Step 3)
→ extract invariants (Step 4)
→ derive minimal DPs from invariants (Step 5)
→ select concrete mechanisms (Step 6)
→ validate independence and source-of-truth (Step 7)
→ specify exact algorithms (Step 8)
→ define logical data model (Step 9)
→ map to technology landscape (Step 10)
→ define deployment topology (Step 11)
→ classify consistency per path (Step 12)
→ identify scaling dimensions and hotspots (Step 13)
→ enumerate failure modes (Step 14)
→ define SLOs (Step 15)
→ define operational parameters (Step 16)
→ write runbooks (Step 17)
→ define observability (Step 18)
→ estimate costs (Step 19)
→ plan evolution (Step 20)
Step 1 — Problem Normalization #
Goal: Convert product-language requirements into precise operations over state.
| Original Requirement | Actor | Operation | State Touched |
|---|---|---|---|
| User uploads a file | User | create-or-overwrite | File (block_list, version); Block (content, hash) |
| User downloads a file | Client | read | File (block_list); Block (content by hash) |
| File auto-syncs across all user devices | System | read change log + apply delta | FileChangeLog (events); SyncCursor (offset per device); Device local state |
| Conflict resolution when offline edits diverge | System | detect divergence + create Conflict object | FileVersion (version vectors); Conflict (process object) |
| User shares a file/folder with another user | User | create relationship | SharePermission (grantee, resource, access level) |
| Shared user accesses the file | Grantee | eligibility check + read | SharePermission; File; Block |
| View version history of a file | User | read projection | FileVersion (append-only log) |
| Delta transfer (don’t re-upload unchanged bytes) | Client | compute local hashes, read missing list, upload only missing blocks | Block (content-addressed); File (block_list diff) |
| Work offline, sync on reconnect | Client | buffer local events offline; replay on reconnect | FileChangeLog (local event buffer); SyncCursor (replayed from offset) |
| Revoke share permission | Owner | delete relationship | SharePermission |
| Delete a file | User | state transition | File (active → deleted); FileChangeLog (deletion event appended) |
| Restore a deleted file | User | state transition + overwrite | File (deleted → active, block_list restored from FileVersion) |
Hidden write exposures:
- “Auto-sync” is not a simple read. It requires the system to track which device has seen which change (SyncCursor is an offset), detect delta (which blocks changed), and apply the delta to local state. Three operations, not one.
- “Conflict resolution” hides a process object (Conflict) with its own state machine. The conflict is created when divergence is detected, then resolved (manually or automatically) into a new FileVersion.
- “Version history” is a derived view over the FileVersion append-only log — not primary state to be designed independently.
Step 2 — Object Extraction #
Goal: Identify the minimal set of primary state objects. Apply all four purity tests.
Primary Objects #
| Object | Class | Justification |
|---|---|---|
| File | Stable entity | Long-lived, has identity (file_id), evolves via block_list overwrites and state transitions (active/deleted) |
| Block | Event | Immutable once stored; content-addressed by SHA-256(content); never mutated |
| FileVersion | Event | Immutable snapshot of a file’s block_list at a point in time; append-only |
| FileChangeLog | Event stream | Append-only log of all mutations to a File (upload, delete, restore, conflict-resolve) |
| Device | Stable entity | Tracks per-device presence and sync state |
| Conflict | Process object | Created when version vectors diverge; has state machine (open → resolved); must persist across the resolution lifecycle |
| SharePermission | Relationship object | An edge between a User and a File/Folder with access level and its own lifecycle (active → revoked) |
| User | Stable entity | Account identity, quota, plan |
Derived / Rejected Objects #
| Candidate | Problem | Disposition |
|---|---|---|
| SyncCursor | Derivable: it is a pointer (offset) into FileChangeLog per device. If stored as mutable primary state alongside the log, there is dual truth — the log AND the cursor both describe “what this device has seen.” Correct: SyncCursor is a materialized offset bookmark, not primary state. | Derived view — offset into FileChangeLog per (device_id, file_id) |
| VersionHistory | Derivable from FileVersion records filtered by file_id, ordered by created_at | Derived projection |
| StorageQuotaUsed | Derivable from File records owned by user × block sizes | Derived projection (cached) |
| FolderContents | Derivable from File records with parent_folder_id | Derived projection |
Four Purity Tests per Object #
File #
- Ownership purity: Written by the owning user (uploads, deletes, restores) and by the conflict-resolution path. These are distinct operations with distinct guards — ownership is clear. ✓
- Evolution purity: Overwrite of block_list on each new upload version; state machine for active/deleted. These are on different fields and different guards. The split into File (current state) + FileVersion (history log) keeps each pure. ✓
- Ordering purity: FileVersions are totally ordered by version number within file_id. ✓
- Non-derivability: The current block_list of a File cannot be derived without knowing which version is current — that pointer lives in File. ✓
Block #
- Ownership purity: Written by the upload service, never modified, never deleted (content-addressed storage). Single writer under append semantics. ✓
- Evolution purity: Append-only. Once a block with a given hash exists, it never changes. ✓
- Ordering purity: No meaningful ordering — blocks are a content-addressed set. Hash is the identity. ✓
- Non-derivability: Block content cannot be reconstructed without the actual bytes. ✓
Conflict #
- Ownership purity: Created by the sync service (when divergence detected); resolved by user or auto-resolver. Two distinct writers, but at distinct lifecycle phases — creation is system-only, resolution is user-or-system. ✓
- Evolution purity: State machine:
detected → open → resolved. Each transition has a guard. ✓ - Ordering purity: Causal lifecycle order — transitions must follow valid paths. ✓
- Non-derivability: A Conflict is not derivable from FileVersions alone. The Conflict object carries user choice, resolution metadata, and the resulting FileVersion reference. ✓
SharePermission #
- Ownership purity: Written by the file/folder owner (grant, revoke). Single writer per permission. ✓
- Evolution purity: State machine:
active → revoked. Revocation is terminal. ✓ - Ordering purity: No meaningful ordering within a grantee’s permission set. ✓
- Non-derivability: Access rights cannot be derived from File metadata alone. ✓
Step 3 — Axis Assignment #
Goal: For every primary object, define ownership, evolution, and ordering (bound to scope).
Object: File
Ownership: Multi-writer, one winner per file_id (multiple devices of same user may write; only one write wins per CAS)
Evolution: Overwrite (block_list replaced on new version); State machine (active/deleted)
Ordering: Total order on version_number within file_id
Object: Block
Ownership: Single writer (upload service); content-addressed so identity is the hash
Evolution: Append-only (immutable once stored)
Ordering: No meaningful order (set semantics; hash = identity)
Object: FileVersion
Ownership: System-only (created by upload service and conflict resolver; users never write directly)
Evolution: Append-only (immutable snapshot)
Ordering: Total order by version_number within file_id
Object: FileChangeLog
Ownership: System-only (written by upload, delete, restore, conflict-resolve services)
Evolution: Append-only (events are immutable)
Ordering: Total order by sequence_number within file_id
Object: Device
Ownership: Single writer per device_id (device registers itself; service updates last_seen)
Evolution: Overwrite (last_seen, sync_offset fields updated in place)
Ordering: No meaningful order across devices
Object: Conflict
Ownership: Multi-writer across lifecycle phases: system creates, user or auto-resolver resolves
Evolution: State machine (detected → open → resolved)
Ordering: Causal lifecycle order (transitions must follow valid paths)
Object: SharePermission
Ownership: Single writer (the file/folder owner)
Evolution: State machine (active → revoked)
Ordering: No meaningful order within a grantee's permission set
Object: User
Ownership: Single writer (user self-writes profile; billing system writes quota)
Evolution: Overwrite (mutable fields: name, email, plan, quota_used_bytes)
Ordering: No meaningful order
Circuit topology insight: The Dropbox sync system is a transmission line. Think of two capacitors (local device state, server state) connected through a sync medium. File changes are charge flowing to equalize. Delta sync is minimizing the charge transfer needed. A conflict is what happens when both capacitors are charged differently during isolation (offline period) — they cannot simply merge charge; the system must detect divergence and arbitrate.
Step 4 — Invariant Extraction #
Goal: Convert requirements into precise, testable invariants. These are implementation-independent.
Eligibility Invariants #
E1 — Upload eligibility: A user may upload a file only if their quota_used_bytes + new_file_bytes ≤ quota_limit_bytes.
E2 — Access eligibility: A user may read a file only if: (a) they own the file, OR (b) there exists a SharePermission record where grantee_id = user_id AND resource_id covers the file AND status = active AND access_level ∈ {read, write}.
E3 — Write eligibility on shared file: A user may upload a new version of a file they do not own only if: there exists a SharePermission where access_level = write AND status = active.
E4 — Delete eligibility: Only the owner may delete a file. Grantees with write access may not delete.
Ordering Invariants #
O1 — Version monotonicity: For any file_id, if version V exists, version V+1 must have created_at > V.created_at. No two FileVersions for the same file may have the same version_number.
O2 — Change log monotonicity: For any file_id, FileChangeLog sequence numbers are strictly monotonically increasing. Events are never reordered or deleted.
Accounting Invariants #
A1 — Quota consistency: User.quota_used_bytes = SUM(Block.size_bytes for all blocks reachable from active Files owned by user). This must hold after every upload and delete. (In practice, computed asynchronously; the synchronous enforcement is a pessimistic quota check at upload time against a cached counter.)
A2 — Block reference integrity: Every block_hash in any File.block_list must have a corresponding Block record in the block store. A file must never reference a block that has not been durably committed.
Uniqueness / Idempotency Invariants #
U1 — Block global dedup: There is at most one Block record for any given content_hash. If two uploads produce the same hash, only one Block is stored. Both uploads succeed, but the bytes are stored once.
U2 — Idempotent block upload: Uploading a block with the same content_hash twice must be a no-op. The second upload must not corrupt or duplicate the block.
U3 — Upload idempotency: A client may retry an upload request with the same idempotency key and receive the same result without creating duplicate FileVersions.
Propagation Invariants #
P1 — Sync completeness: If a file changes on device A and device B is online and subscribed, device B must eventually receive the change event. “Eventually” is bounded by SLO (Step 15).
P2 — Conflict detection completeness: If two devices modify the same file while one is offline, the system must detect the divergence when the offline device reconnects and create a Conflict object. The divergence must not be silently overwritten.
P3 — Deletion propagation: If a file is deleted on device A, device B must receive a deletion event. Device B must not continue to serve the file as live after the event is processed.
Access-Control Invariants #
AC1 — Permission revocation is immediate: Once a SharePermission is revoked (status = revoked), all subsequent access attempts by the grantee must be denied. No caching of permissions beyond a bounded TTL (configurable; must be ≤ 60 seconds in strong mode).
AC2 — Grantee cannot escalate: A grantee with read access cannot perform write operations. Access level enforcement is the sole authority — no capability tokens that can be forged client-side.
Step 5 — Design Point Derivation #
Goal: For each invariant cluster, derive the minimal enforcing mechanism. One DP per cluster; no over-engineering.
| Invariant Cluster | Design Point | Reasoning |
|---|---|---|
| E1 (quota), A1 (quota consistency) | Quota gate with cached counter + pessimistic check | Exact enforcement requires a distributed counter; exact counter is expensive at upload scale. Pessimistic: pre-check cached quota_used_bytes; deduct on commit; reconcile asynchronously. Over-quota risk is bounded by cache staleness window. |
| E2, E3, E4, AC1, AC2 (access control) | Permission check service with bounded-TTL cache | Permissions are read at every file access; must be fast. Cache permissions with TTL ≤ 60s for strong mode. Revocation propagates via cache invalidation event. Source of truth is SharePermission table. |
| O1 (version monotonicity), U3 (upload idempotency) | CAS on (file_id, current_version_number) + Idempotency Key store | Each upload atomically increments version_number; if CAS fails, caller retries with fresh version. Idempotency key prevents duplicate FileVersions on retry. |
| U1, U2 (block dedup + idempotency) | Content-addressed block store keyed by SHA-256 hash | Hash = identity = idempotency key. Existence check before upload eliminates duplicate bytes. No separate dedup index needed — the hash IS the key. |
| O2 (log monotonicity) | Single-partition append-only log per file_id | Sequence numbers assigned by the log; append is atomic. Consumers read from offset. |
| P1 (sync propagation) | Change event stream + per-device subscription + push notification | FileChangeLog events are published; devices subscribe; server pushes or client polls from stored offset. |
| P2 (conflict detection) | Version vector per (file_id, device_id); divergence detection on reconnect | Each device tracks a version vector. On reconnect, server compares device’s vector against server’s vector. Divergence → Conflict object created. |
| P3 (deletion propagation) | Deletion event in FileChangeLog + device processes event | Deletion is an event appended to the log. All subscribed devices receive it and mark local copy as deleted. Soft-delete on server (retain block_list for restore); hard-delete after TTL or explicit purge. |
| AC1 (revocation immediacy) | Cache invalidation event on revocation + TTL backstop | On revoke, publish invalidation to permission cache. TTL ≤ 60s ensures stale cache expires even if invalidation is lost. |
| A2 (block reference integrity) | Two-phase commit for file metadata + block store, OR: commit blocks first, then commit file metadata atomically | Block must exist before File references it. Write blocks first; write file metadata only after all blocks are confirmed durable. This is a sequencing invariant, not a 2PC invariant. |
Step 6 — Mechanism Selection #
Goal: Mechanical bridge from invariant + axis → concrete implementation mechanism. Apply the full derivation table for four key paths.
6.1 Invariant Type → Mechanism Family #
| Invariant Type | Mechanism Family |
|---|---|
| Eligibility | Guard + atomic state check |
| Ordering | Sequence number + CAS |
| Accounting | Pessimistic counter + async reconciliation |
| Uniqueness/Idempotency | Content hash as key + existence check |
| Propagation | Event stream + subscription |
| Access-control | Permission store + TTL cache + invalidation |
6.2 Ownership × Evolution → Concurrency Mechanism #
| Object | Ownership | Evolution | Table Lookup Result |
|---|---|---|---|
| File | Multi-writer, one winner | Overwrite + state machine | CAS on (file_id, version_number) |
| Block | Single writer | Append-only | No concurrency mechanism needed; hash prevents collision by design |
| FileVersion | System-only | Append-only | Idempotency key on (file_id, version_number) |
| Conflict | Multi-writer across lifecycle phases | State machine | CAS on (conflict_id, status) |
| SharePermission | Single writer | State machine | CAS on (permission_id, status) |
6.3 Mechanical Derivation — Four Key Paths #
Path A: File Write Conflict Detection #
Invariant driving this: P2 (divergence must be detected) + O1 (version monotonicity).
Ownership × Evolution: File is multi-writer (multiple devices of same user may write while offline), overwrite + state machine.
Table lookup: Multi-writer + overwrite → CAS on version.
Q1 (scope): Divergence spans multiple devices (cross-service/cross-process). Not within a single service. But it is also not cross-region in the primary case. Scope = cross-device within one user’s account. The conflict is detected at the server on reconnect. Mechanism: CAS on server-side version vector, not distributed 2PC (devices do not coordinate with each other).
Q2 (failure): What if the device crashes mid-upload? → Idempotency Key. What if the network partitions during upload? → CAS detects stale version on retry; client must re-read current state.
Q3 (data): Version vectors are not commutative in the way CRDT requires (concurrent overwrites of a file are not mergeable without user intent). Content-addressed blocks help detect what changed but not resolve which change wins. → CAS + version vector.
Q4 (access): Reads » Writes (most devices read the synced state; conflicts are rare). Current state is read on every sync. → CQRS-lite: write path updates the version vector; read path serves from a read replica.
Q5 (coupling): The Conflict object must be created when divergence is detected — this is an async-guaranteed propagation from the sync service to the conflict service. → Outbox + Relay pattern. The sync service writes a “conflict_detected” event to its outbox table in the same transaction as updating the file’s version vector; a relay publishes it to the conflict service.
Required combination: CAS + Idempotency Key always (Step 6.4 rule).
Resulting mechanism: Version vector per (file_id, device_id) stored in FileChangeLog. On upload from device D, server computes current version vector. If device D’s local version vector and the server’s vector are concurrent (neither dominates), a Conflict is created via Outbox + Relay. If device D’s vector dominates (no offline divergence), the upload proceeds with CAS on version_number.
Conflict detection algorithm sketch:
function detect_conflict(device_id, file_id, device_version_vector):
server_vv = read_version_vector(file_id)
if dominates(device_vv, server_vv):
# device is ahead — fast-forward server
return ACCEPT
elif dominates(server_vv, device_vv):
# server is ahead — device needs to pull
return STALE
else:
# concurrent — conflict
return CONFLICT
Path B: Block Dedup #
Invariant driving this: U1 (global dedup — at most one block per hash) + U2 (idempotency of block upload).
Ownership × Evolution: Block is single writer, append-only. But it is a global dedup — across all users.
Q1 (scope): Global dedup is cross-user. The block store is a single global namespace keyed by content_hash. Scope: global, single-writer-per-key (two clients uploading the same block race, but both should succeed idempotently).
Q2 (failure): If the uploading client crashes mid-block-upload, the block is incomplete. The client must retry. → Idempotency Key = content_hash. The server checks: does a block with this hash exist? If yes, skip upload. If no (or partial), upload resumes from offset (resumable upload protocol).
Q3 (data): Blocks are content-addressed. Content-addressed → hash = natural idempotency key (Q3 rule from framework). This eliminates the need for a separate idempotency key store for block uploads. The hash IS the key.
Q4 (access): Block uploads (writes) are less frequent than block downloads (reads), but block existence checks happen on every upload attempt for every block in the manifest. → Content-addressed object store with O(1) existence check (HTTP HEAD on S3 key = hash).
Q5 (coupling): Block existence must be confirmed before File metadata references it (A2). This is an in-transaction-sequencing coupling, not a saga. Write block first; commit file metadata only after block is durable. No async coupling needed here.
Resulting mechanism: S3 (or equivalent) with key = SHA-256 hex of block content. Upload protocol:
- Client computes SHA-256(block_content) locally.
- Client sends manifest (list of hashes) to server.
- Server performs bulk existence check (S3 HEAD per hash, or batched lookup in a block index table).
- Server returns list of missing hashes.
- Client uploads only missing blocks (PUT to S3 pre-signed URL per hash).
- Server confirms receipt (S3 HEAD again or use S3 event notification).
- Server commits File metadata with new block_list.
The dedup ratio is high for common file types. A 10MB PDF split into 4MB blocks: if block 1 is identical to a block already stored by another user, it is not re-uploaded. Cross-user dedup is a storage multiplier.
Path C: Sync Notification #
Invariant driving this: P1 (sync completeness) + P3 (deletion propagation).
Ownership × Evolution: FileChangeLog is system-only, append-only.
Q1 (scope): Propagation from the server to many devices. This is a fan-out problem: one file change → N device notifications. Scope: cross-service (sync service → push notification service → devices).
Q2 (failure): Device is offline when event is published. → The device must catch up from its stored offset (SyncCursor = offset in FileChangeLog). The event log is durable; the device resumes from its last-processed offset on reconnect. This is the “async-guaranteed” pattern. Missing push notification is survivable because the device polls on reconnect from its offset.
Q3 (data): Events are not commutative (they must be replayed in order). → Not CRDT. The log is totally ordered per file_id (O2).
Q4 (access): Write (device uploads change) triggers fan-out read to many devices. → Fan-out on Write: when a change is committed, publish to a per-user change channel. Devices subscribed to the channel receive the notification. Devices not connected store their offset and catch up later.
Q5 (coupling): Sync notification must be async-guaranteed: the upload service must not block on device notification delivery. → Outbox + Relay: upload service writes FileChangeLog event to its own DB in the same transaction as committing the new FileVersion; a relay publishes to the notification channel (Redis pub/sub or Kafka topic).
Resulting mechanism:
- FileChangeLog stored in Cassandra (high-write append, per-file_id partition key).
- On commit, relay publishes event to Kafka topic
file-changes(partitioned by user_id for ordering). - Notification service consumes Kafka, pushes to device via WebSocket or long-poll.
- Device stores
sync_cursor = last_processed_sequence_numberlocally and in Device table. - On reconnect, device sends its cursor; server streams all events since that cursor from FileChangeLog.
- Reconnect storm mitigation: exponential backoff + full jitter on reconnect attempt after outage (per AWS Architecture Blog recommendation).
Path D: Share Permission Enforcement #
Invariant driving this: E2, E3, E4 (access eligibility) + AC1 (revocation immediacy) + AC2 (no escalation).
Ownership × Evolution: SharePermission is single writer per record, state machine (active → revoked).
Q1 (scope): Permission check is within-service (the file service checks permissions before serving content). Cross-service for access from different tools (e.g., a linked app). Within the core service: permission check on every read/write request.
Q2 (failure): What if the permission service is slow or unavailable? → Circuit Breaker: if the permission service is down, fail closed (deny access), not open (allow all). This preserves AC1 and AC2 under degradation.
Q3 (data): Permissions are read » written (many reads per grant/revoke event). → Cache-Aside: cache SharePermission records in Redis keyed by (grantee_id, resource_id). TTL ≤ 60s for strong mode. On revocation, publish a cache invalidation event (via Kafka or Redis pub/sub) to all file service instances.
Q4 (access): read » write → Cache-Aside is correct. The cache entry is a materialized permission check result: {can_read: true, can_write: false}.
Q5 (coupling): Revocation must propagate with bounded delay (≤ 60s, per AC1). This is async-guaranteed propagation. → Outbox + Relay for revocation events: SharePermission service writes revocation event to outbox; relay publishes to cache invalidation channel; all file service instances invalidate their local cache entry for that (grantee_id, resource_id) pair.
CAS + Lease requirement (Step 6.4):
- CAS on SharePermission.status: the revocation is a CAS from
activetorevoked. Prevents double-revocation or race between grant and revoke. - No lease needed here because the single-writer invariant holds (only the owner writes).
Resulting mechanism:
- PostgreSQL
share_permissionstable withstatuscolumn. - Revocation:
UPDATE share_permissions SET status='revoked', revoked_at=NOW() WHERE permission_id=? AND status='active'(CAS-equivalent via conditional update). - Cache: Redis hash
perm:{grantee_id}:{resource_id}→{can_read, can_write, ttl}. - Cache invalidation: on revocation, write to outbox; relay publishes
perm.invalidate:{grantee_id}:{resource_id}to Redis pub/sub; all file service instances subscribe and delete their cache entry. - Circuit Breaker: if Redis is unavailable, fall back to direct DB query. If DB is unavailable, fail closed.
6.4 Required Combinations Applied #
| Path | CAS Required? | Idempotency Key Required? | Lease Required? | Fencing Token? |
|---|---|---|---|---|
| File write conflict | YES — CAS on version_number | YES — upload idempotency key | NO (no crash-holder scenario) | NO |
| Block dedup | NO (hash is the identity; existence check is idempotent by design) | YES — content_hash IS the idempotency key | NO | NO |
| Sync notification | NO | YES — sequence_number prevents duplicate processing | NO | NO |
| Share permission | YES — conditional update on status | YES — idempotency key on grant/revoke request | NO | NO |
Step 7 — Axiomatic Validation #
Goal: Source-of-truth table. No dual truth.
| State Question | Source of Truth | Notes |
|---|---|---|
| What bytes does this file contain? | Block store (S3), keyed by SHA-256 hash | Block content is immutable; hash is the identity |
| What is the current block_list of a file? | files table, block_list column (Postgres) | Updated on every new version via CAS |
| What versions has this file had? | file_versions table (Cassandra or Postgres append-only) | Immutable append log; never updated in place |
| What events have happened to this file? | file_change_log table (Cassandra, per file_id partition) | Immutable event log; source of truth for sync |
| What has device D seen? | devices table, sync_cursor column (Postgres or Redis) | This is a bookmark, not a duplicate of the log. The log is source of truth; the cursor is a pointer. NO DUAL TRUTH. |
| Does user U have access to file F? | share_permissions table (Postgres) | Redis cache is a read cache, NOT a source of truth. Invalidated on revocation. |
| How much storage has user U used? | users table, quota_used_bytes column | Cached counter; asynchronously reconciled against actual block references |
| Is there a conflict on file F? | conflicts table (Postgres) | Conflict object is primary state with its own lifecycle |
Dual truth check:
The SyncCursor is the canonical example of dual-truth risk. If both the FileChangeLog and a separate SyncCursor table claim to be the source of truth for “what device D has seen,” updates to both must be kept in sync — which is exactly the dual-truth problem. The correct design is: FileChangeLog is the source of truth for what happened; SyncCursor is a bookmark (offset) stored in Device or a separate cursor table, pointing into the log. If the cursor is lost, it can be reset to 0 (full resync) or to a known checkpoint. The log is never lost.
Validation result: No dual truth found. Each state question has exactly one source of truth.
Step 8 — Algorithm Design #
Goal: Pseudocode for every write path and state machines.
8.1 Block Upload (Delta Sync) Algorithm #
// Client-side: compute local manifest
function compute_manifest(file_path) -> List[BlockManifestEntry]:
blocks = split_file_into_blocks(file_path, block_size=4MB)
manifest = []
for block in blocks:
hash = sha256(block.content)
manifest.append({hash: hash, offset: block.offset, size: block.size})
return manifest
// Client sends manifest to server
// Server checks which blocks are missing
function check_missing_blocks(file_id, manifest) -> List[hash]:
missing = []
for entry in manifest:
if not block_store.exists(entry.hash): // S3 HEAD request
missing.append(entry.hash)
return missing
// Client uploads only missing blocks
function upload_missing_blocks(missing_hashes, manifest):
for hash in missing_hashes:
block = find_block_in_manifest(manifest, hash)
presigned_url = get_presigned_upload_url(hash)
http_put(presigned_url, block.content)
// Retry with exponential backoff on failure
// Hash = idempotency key; re-uploading same hash is safe
// Server commits new file version
function commit_file_version(file_id, user_id, manifest, idempotency_key):
// Idempotency check
if idempotency_store.exists(idempotency_key):
return idempotency_store.get_result(idempotency_key)
// CAS on version_number
current = db.select("SELECT version_number FROM files WHERE file_id=?", file_id)
new_version = current.version_number + 1
block_list = [entry.hash for entry in manifest]
// Verify all blocks exist before committing
for hash in block_list:
assert block_store.exists(hash), "Block not durable: " + hash
// Atomic commit: new FileVersion + update File.block_list + append to FileChangeLog
db.transaction():
db.execute("""
UPDATE files
SET block_list=?, version_number=?, updated_at=NOW()
WHERE file_id=? AND version_number=?
""", block_list, new_version, file_id, current.version_number)
// CAS: if version_number changed since read, this UPDATE affects 0 rows → retry
if db.rows_affected == 0:
raise ConcurrentWriteError("Retry with fresh version")
db.execute("""
INSERT INTO file_versions (file_id, version_number, block_list, created_at, created_by_device)
VALUES (?, ?, ?, NOW(), ?)
""", file_id, new_version, block_list, device_id)
db.execute("""
INSERT INTO file_change_log (file_id, event_type, version_number, created_at)
VALUES (?, 'upload', ?, NOW())
""", file_id, new_version)
// Outbox entry for sync notification relay
db.execute("""
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('file_changed', ?, NOW())
""", json({file_id, new_version, user_id}))
idempotency_store.put(idempotency_key, {version_number: new_version})
return {version_number: new_version}
8.2 Conflict Detection State Machine #
States: detected → open → resolved
Transitions:
NONE → detected : trigger = sync service detects concurrent version vectors
detected → open : trigger = Conflict object created in DB, user notified
open → resolved : trigger = user chooses a winner (or auto-resolver picks)
resolved → NONE : terminal (new FileVersion created, Conflict closed)
function on_device_reconnect(device_id, file_id, device_version_vector):
server_vv = read_version_vector(file_id)
relation = compare_version_vectors(device_vv, server_vv)
if relation == DOMINATES:
// device is ahead of server — accept device's version
commit_file_version(file_id, device_version)
elif relation == DOMINATED:
// server is ahead — tell device to pull
return {action: "pull", server_version: server_vv}
elif relation == CONCURRENT:
// Conflict: neither dominates
conflict_id = create_conflict(file_id, device_id, device_vv, server_vv)
// Outbox entry → notify user
return {action: "conflict", conflict_id: conflict_id}
function resolve_conflict(conflict_id, resolution_choice):
// CAS on conflict status
conflict = db.select("SELECT * FROM conflicts WHERE conflict_id=? AND status='open'", conflict_id)
if conflict is None:
return ALREADY_RESOLVED
if resolution_choice == KEEP_SERVER:
winning_block_list = conflict.server_block_list
elif resolution_choice == KEEP_DEVICE:
winning_block_list = conflict.device_block_list
elif resolution_choice == KEEP_BOTH:
// Create a copy of the device version with a suffix (e.g., "file (conflicted copy).txt")
create_conflict_copy(conflict.file_id, conflict.device_block_list)
winning_block_list = conflict.server_block_list
db.transaction():
db.execute("UPDATE conflicts SET status='resolved' WHERE conflict_id=? AND status='open'", conflict_id)
commit_file_version(conflict.file_id, winning_block_list, idempotency_key=conflict_id+"_resolve")
8.3 Sync Pull Algorithm (Device on Reconnect) #
function sync_on_reconnect(device_id, file_id):
// Read stored cursor (last processed sequence number)
cursor = db.select("SELECT sync_cursor FROM devices WHERE device_id=?", device_id)
last_seq = cursor.sync_cursor ?? 0
// Read all events since last_seq
events = db.select("""
SELECT * FROM file_change_log
WHERE file_id=? AND sequence_number > ?
ORDER BY sequence_number ASC
""", file_id, last_seq)
for event in events:
apply_event_to_local_state(device_id, event)
// Update cursor atomically after each event
db.execute("""
UPDATE devices SET sync_cursor=? WHERE device_id=? AND sync_cursor=?
""", event.sequence_number, device_id, last_seq)
last_seq = event.sequence_number
function apply_event_to_local_state(device_id, event):
if event.type == 'upload':
delta = compute_block_delta(local_block_list, event.block_list)
download_missing_blocks(delta.missing_blocks)
update_local_file(event.block_list)
elif event.type == 'delete':
mark_local_file_deleted(event.file_id)
elif event.type == 'restore':
delta = compute_block_delta(local_block_list, event.block_list)
download_missing_blocks(delta.missing_blocks)
update_local_file(event.block_list)
8.4 Quota Enforcement Algorithm #
function check_quota_before_upload(user_id, new_file_bytes):
// Pessimistic check against cached counter
current_usage = redis.get("quota:" + user_id) ?? db.select("SELECT quota_used_bytes FROM users WHERE user_id=?", user_id)
quota_limit = db.select("SELECT quota_limit_bytes FROM users WHERE user_id=?", user_id)
if current_usage + new_file_bytes > quota_limit:
raise QuotaExceededError()
// Reserve space optimistically
redis.incrby("quota:" + user_id, new_file_bytes)
// On upload failure, release reservation
// On upload success, no-op (counter already incremented)
// Async reconciliation job computes exact usage periodically
Step 9 — Logical Data Model #
Goal: Schema with partition keys derived from invariant scope.
Tables #
files #
CREATE TABLE files (
file_id UUID PRIMARY KEY,
owner_user_id UUID NOT NULL REFERENCES users(user_id),
parent_folder_id UUID REFERENCES folders(folder_id),
name TEXT NOT NULL,
block_list TEXT[] NOT NULL, -- ordered list of SHA-256 hashes
version_number BIGINT NOT NULL DEFAULT 1,
status TEXT NOT NULL DEFAULT 'active', -- active | deleted
size_bytes BIGINT NOT NULL,
content_hash TEXT, -- hash of full file (optional, for quick comparison)
created_at TIMESTAMPTZ NOT NULL,
updated_at TIMESTAMPTZ NOT NULL,
CONSTRAINT files_status_check CHECK (status IN ('active', 'deleted'))
);
-- Partition key for access: owner_user_id (all files by user)
-- CAS key: (file_id, version_number) for optimistic concurrency
CREATE INDEX files_owner_idx ON files(owner_user_id, status);
CREATE INDEX files_folder_idx ON files(parent_folder_id, status);
file_versions #
-- Append-only; never updated. Cassandra or Postgres.
CREATE TABLE file_versions (
file_id UUID NOT NULL,
version_number BIGINT NOT NULL,
block_list TEXT[] NOT NULL,
size_bytes BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
created_by_device UUID,
created_by_user UUID NOT NULL,
PRIMARY KEY (file_id, version_number)
);
-- Partition key: file_id (all versions of one file co-located)
-- Ordering: version_number ASC within file_id
file_change_log #
-- Append-only event log. Cassandra preferred for high-write throughput.
CREATE TABLE file_change_log (
file_id UUID NOT NULL,
sequence_number BIGINT NOT NULL, -- monotonically increasing within file_id
event_type TEXT NOT NULL, -- upload | delete | restore | conflict_resolved | shared | unshared
version_number BIGINT, -- which FileVersion this event references (if applicable)
actor_device_id UUID,
actor_user_id UUID NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
metadata JSONB,
PRIMARY KEY (file_id, sequence_number)
);
-- Partition key: file_id → all events for one file co-located, ordered by sequence_number
blocks (index table — not the block content itself) #
-- Block content stored in S3. This table is an index for existence checks and metadata.
CREATE TABLE blocks (
content_hash TEXT PRIMARY KEY, -- SHA-256 hex
size_bytes BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
reference_count BIGINT NOT NULL DEFAULT 1 -- for GC: when 0, eligible for deletion
);
devices #
CREATE TABLE devices (
device_id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users(user_id),
device_name TEXT,
platform TEXT, -- mac | windows | linux | ios | android | web
sync_cursor BIGINT NOT NULL DEFAULT 0, -- offset into file_change_log per user
last_seen_at TIMESTAMPTZ,
registered_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX devices_user_idx ON devices(user_id);
Note on SyncCursor: The sync_cursor in the devices table is a bookmark into the FileChangeLog, not a duplicate of the log. It is the device’s “read position.” The log is the source of truth; this cursor is how the device knows where to resume. If lost, resync from 0.
conflicts #
CREATE TABLE conflicts (
conflict_id UUID PRIMARY KEY,
file_id UUID NOT NULL REFERENCES files(file_id),
status TEXT NOT NULL DEFAULT 'open', -- open | resolved
server_version_number BIGINT NOT NULL,
device_id UUID NOT NULL,
device_block_list TEXT[] NOT NULL,
server_block_list TEXT[] NOT NULL,
resolution TEXT, -- keep_server | keep_device | keep_both
resolved_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL,
CONSTRAINT conflicts_status_check CHECK (status IN ('open', 'resolved'))
);
CREATE INDEX conflicts_file_idx ON conflicts(file_id, status);
share_permissions #
CREATE TABLE share_permissions (
permission_id UUID PRIMARY KEY,
resource_id UUID NOT NULL, -- file_id or folder_id
resource_type TEXT NOT NULL, -- file | folder
grantor_user_id UUID NOT NULL REFERENCES users(user_id),
grantee_user_id UUID NOT NULL REFERENCES users(user_id),
access_level TEXT NOT NULL, -- read | write | admin
status TEXT NOT NULL DEFAULT 'active', -- active | revoked
created_at TIMESTAMPTZ NOT NULL,
revoked_at TIMESTAMPTZ,
CONSTRAINT sp_status_check CHECK (status IN ('active', 'revoked')),
CONSTRAINT sp_access_check CHECK (access_level IN ('read', 'write', 'admin'))
);
CREATE INDEX sp_grantee_idx ON share_permissions(grantee_user_id, status);
CREATE INDEX sp_resource_idx ON share_permissions(resource_id, status);
idempotency_keys #
CREATE TABLE idempotency_keys (
idempotency_key TEXT PRIMARY KEY,
result JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ NOT NULL -- TTL 24h
);
outbox #
CREATE TABLE outbox (
outbox_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
published_at TIMESTAMPTZ,
status TEXT NOT NULL DEFAULT 'pending' -- pending | published
);
CREATE INDEX outbox_pending_idx ON outbox(status, created_at) WHERE status='pending';
users #
CREATE TABLE users (
user_id UUID PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
plan TEXT NOT NULL DEFAULT 'free', -- free | plus | professional | business
quota_limit_bytes BIGINT NOT NULL DEFAULT 2147483648, -- 2 GB default
quota_used_bytes BIGINT NOT NULL DEFAULT 0, -- cached counter; reconciled async
created_at TIMESTAMPTZ NOT NULL
);
Step 10 — Technology Landscape #
Goal: Capability → shape → product mapping.
| Capability Needed | Shape Required | Product Selected | Reason |
|---|---|---|---|
| Block storage (immutable, content-addressed, globally durable) | Object store: key = hash, value = bytes; S3-compatible; high throughput PUT/GET | AWS S3 | Industry standard; 11-nines durability; CDN-compatible; multi-region replication; pre-signed URLs eliminate proxy hop |
| File metadata (current state, version, status) | Relational: ACID transactions, CAS via optimistic locking, foreign keys | PostgreSQL | Strong consistency; supports CAS via conditional UPDATE; row-level locking; JSONB for block_list |
| File change log + version history (high-write append, large volume, time-series) | Wide-column store: append-only, partition by file_id, time-ordered | Apache Cassandra | Optimized for append-heavy workloads; tunable consistency; partition key = file_id for co-location |
| Sync notification (real-time push to connected devices) | Pub/sub with persistence; consumer groups for multiple devices | Apache Kafka | Ordered per partition (user_id); consumer offset = device sync cursor; replay from any offset; durable |
| Permission cache (low-latency read, TTL, pub/sub invalidation) | In-memory cache with TTL and pub/sub | Redis | O(1) GET/SET; TTL support; Redis pub/sub for invalidation events; Sentinel/Cluster for HA |
| Quota counter (high-throughput increment/decrement) | In-memory atomic counter with persistence | Redis | INCRBY/DECRBY atomic; persist to Postgres on write; reconcile async |
| Conflict management (process object with state machine, strong consistency) | Relational: ACID, CAS on status | PostgreSQL | Same cluster as files; conflict references file; transactional integrity |
| Idempotency key store (short-lived, key-value, TTL 24h) | Key-value with TTL | Redis (or Postgres with TTL index) | Redis for speed; Postgres if durability is required across restarts |
| CDN (block download acceleration for frequently accessed files) | Edge cache with origin pull | Cloudflare CDN | Global PoPs; origin = S3; cache key = block hash (content-addressed = perfect cache hit rate for identical blocks) |
| Outbox relay | Background job consuming outbox table | Custom relay + Kafka producer | Polls outbox table; publishes events to Kafka; marks published |
Step 11 — Deployment Topology #
Goal: Service boundaries and failure domains.
Services #
┌────────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ Desktop (Mac/Win/Linux) Mobile (iOS/Android) Web (Browser) │
└────────────────────────────┬───────────────────────────────────────┘
│ HTTPS + WebSocket
┌────────────────────────────▼───────────────────────────────────────┐
│ API Gateway │
│ Rate limiting, auth, TLS termination, routing │
└──┬──────────┬──────────┬──────────┬───────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌────────────────┐
│Upload│ │Sync │ │Share │ │Notification │
│Svc │ │Svc │ │Svc │ │Svc (WebSocket) │
└──┬───┘ └──┬───┘ └──┬───┘ └──────┬─────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ Data Layer │
│ │
│ PostgreSQL (files, conflicts, shares, users) │
│ Cassandra (file_change_log, file_versions) │
│ Redis (permission cache, quota counters, │
│ pub/sub, idempotency keys) │
│ Kafka (file-changes topic, sync events) │
│ S3 (block content) │
│ Cloudflare CDN (block download acceleration) │
└─────────────────────────────────────────────────┘
Service Responsibilities #
| Service | Responsibility | Failure Domain |
|---|---|---|
| Upload Service | Manifest validation, block existence check, block upload coordination, commit FileVersion, append FileChangeLog, write outbox | Stateless; horizontally scaled; failure = upload fails, client retries |
| Sync Service | Reconnect handling, version vector comparison, conflict detection, streaming events from FileChangeLog to device | Stateless; horizontally scaled; failure = device retries from stored cursor |
| Share Service | Grant/revoke permissions, check eligibility, publish invalidation events | Stateless; failure = permission operation fails, client retries |
| Notification Service | WebSocket connections to devices, receive Kafka events, push to connected devices | Stateful (WebSocket connections); failure = device falls back to poll |
| Outbox Relay | Poll outbox table, publish to Kafka, mark published | Background job; failure = events delayed, not lost |
| Conflict Resolver | Receive conflict events from Kafka, present to user, apply resolution | Stateless; failure = conflict remains open until user next opens app |
Failure Domains #
- US-EAST-1 failure: Failover to US-WEST-2 (standby). PostgreSQL primary in US-EAST-1; read replicas in US-WEST-2. S3 cross-region replication. Cassandra multi-region. Kafka MirrorMaker.
- Single service failure: Other services unaffected. Clients experience partial degradation (e.g., no push notifications if Notification Service is down, but sync still works on reconnect via Sync Service).
- Redis failure: Fall back to direct Postgres permission check. Quota counter reconciled from Postgres.
- Kafka failure: Outbox relay accumulates pending events. Devices fall back to poll-based sync.
Step 12 — Consistency Model #
Goal: Per-path classification of consistency guarantees.
| Operation Path | Consistency Model | Justification |
|---|---|---|
| Block upload (PUT to S3) | Eventual (S3 strong read-after-write for the same key since 2020) | S3 offers strong read-after-write consistency for PUT; replication to other regions is eventual |
| File metadata commit (CAS on version_number) | Strong (serializable via Postgres transaction) | CAS requires seeing the latest version; Postgres serializable transaction ensures this |
| FileChangeLog append | Eventual (Cassandra quorum write + quorum read = strong per partition) | Quorum read guarantees reading latest write; events are ordered by sequence_number within partition |
| Permission check (cache hit) | Eventual (bounded by TTL ≤ 60s) | Cache may be up to 60s stale after revocation; acceptable per AC1 SLO |
| Permission check (cache miss → DB) | Strong (Postgres primary read) | Cache miss goes to source of truth |
| Sync event delivery | Eventual (Kafka delivery guarantee: at-least-once) | Events may be delivered more than once; consumers deduplicate by sequence_number |
| Quota check (Redis counter) | Eventual (cached counter; reconciled async) | May over-allow up to cache staleness window; bounded over-quota by reconciliation job |
| Conflict detection | Strong (version vector comparison at server; Postgres transaction on conflict creation) | Conflict must not be silently dropped; strong consistency required |
Step 13 — Scaling Model #
Goal: Scale type per dimension, hotspots, and strategy.
Scaling Dimensions #
| Dimension | Scale Type | Hotspot Risk | Strategy |
|---|---|---|---|
| Block storage (bytes) | Horizontal partition by hash prefix | None (content-addressed; uniform distribution) | S3 auto-scales; no action needed |
| File metadata (rows) | Horizontal sharding by user_id | Popular users (celebrity accounts in B2B context) | Shard Postgres by user_id; per-shard PostgreSQL instance |
| FileChangeLog (events/sec) | Write-scale by file_id partition in Cassandra | Hot file (shared file edited frequently) | Cassandra partition by file_id; consistent hashing; increase replication factor for hot partitions |
| Upload throughput (Gbps) | Horizontal: more Upload Service instances; S3 transfer acceleration | None (direct-to-S3 bypasses proxy) | Pre-signed URLs: client uploads directly to S3, bypassing service tier entirely; service only handles manifest and commit |
| Sync notification fanout | Write fan-out to N devices | User with thousands of devices (enterprise) | Cap devices per user; batch notification; Kafka partition by user_id; Notification Service shards by user_id |
| Permission reads | Read-heavy; cache-dominated | None after cache warm-up | Redis cluster; permission cache hit rate > 99% after warm-up |
| Reconnect storm (after outage) | Thundering herd | All devices reconnect simultaneously after region recovery | Exponential backoff + full jitter on client; rate limiting at API Gateway; Sync Service queue with backpressure |
Block-Level Dedup Savings #
Global dedup across users reduces storage by an estimated 30-70% for typical enterprise workloads (common file types like PDFs, Office documents, and images with identical blocks). The dedup ratio compounds with block size: 4MB blocks capture document-level dedup; 64KB blocks capture within-document dedup but increase manifest overhead.
Partition Strategy for FileChangeLog #
The FileChangeLog in Cassandra is partitioned by file_id. This means all events for one file are co-located, and reads for sync (streaming events from an offset) are efficient. The risk is a hot partition for a file with very high event rate (e.g., a frequently edited shared document with many collaborators). Mitigation: rate-limit writes per file_id at the Upload Service level.
Step 14 — Failure Model #
Goal: Per failure type enumeration with detection, impact, and mitigation.
| Failure | Detection | Impact | Mitigation |
|---|---|---|---|
| Block upload fails mid-stream (network drop) | Client receives error or timeout | Partial block upload; file commit not attempted | Resumable upload: client stores upload offset locally; resumes from offset on retry; block hash = idempotency key ensures no corruption |
| File commit fails after blocks uploaded (DB crash) | Client receives 500 or timeout | Blocks uploaded to S3 but file metadata not committed; orphaned blocks | Idempotency key: client retries commit; server checks idempotency key store; if not found, re-runs commit; orphaned blocks cleaned by GC job after TTL |
| Sync service crash during sync | Client receives disconnect | Device’s sync_cursor not advanced past the crash point | Client stores cursor locally; on reconnect, resumes from last local cursor; server replays from that point |
| Kafka partition leader failure | Kafka replication detects; leader re-elected | Brief delay in sync notification delivery | Kafka auto-elects new leader; producers retry; at-least-once delivery guaranteed; consumers deduplicate |
| Redis cache failure | Health check + circuit breaker | Permission cache misses; fallback to Postgres | Circuit breaker opens; all permission reads go to Postgres primary; higher latency but correct; Redis recovers and cache warms |
| Reconnect storm after regional outage | Spike in sync request rate; API Gateway metrics | Upload Service and Sync Service CPU spike; latency increase; possible retry cascade | Exponential backoff + full jitter on client; API Gateway rate limiting per user; Sync Service autoscaling with pre-provisioned capacity |
| Ghost file in device cache (file deleted on server, still cached locally) | Deletion event in FileChangeLog not yet processed by device | Device shows deleted file as available | Deletion event is durable in FileChangeLog; device processes on reconnect; TTL on local cache entries (24h backstop); device validates file existence on open |
| Version vector clock skew (device clock wrong) | Conflict detection produces false positive | Spurious conflict object created | Use server-assigned sequence numbers for ordering; do not trust device clocks for version comparison; version vectors use sequence numbers, not wall clock |
| S3 regional outage | S3 health checks; upload failures | Uploads fail; downloads may serve from CDN cache for recently accessed blocks | CDN caches recently downloaded blocks; uploads queue locally on device and retry when S3 recovers; read-only mode for cached files |
| Database primary failure (Postgres) | pg_stat_replication lag monitoring; health check timeout | All writes fail; reads fall to replica (stale) | Automatic failover to replica (Patroni or RDS Multi-AZ); failover time ≈ 30s; uploads fail-fast during failover; clients retry |
| Conflict resolution failure (resolver crashes) | Conflict stuck in ‘open’ state | User cannot resolve conflict | Conflict remains in ‘open’ state; user can retry resolution; no data loss (both versions preserved) |
| Quota counter corruption (Redis restart) | Quota reconciliation job detects mismatch | Users may upload beyond quota limit until reconciliation | Reconciliation job runs every 15 minutes; computes exact usage from file records; corrects Redis counter and Postgres field |
Step 15 — SLOs #
| SLO | Target | Measurement |
|---|---|---|
| Upload availability | 99.9% (43.8 min/month downtime) | HTTP success rate for upload commits, measured per 5-min window |
| Upload P99 latency (small file <1MB) | < 500ms | Time from first HTTP byte to commit response |
| Upload P99 latency (large file 1GB, delta = 0 new bytes) | < 2s | Manifest check + commit without block uploads |
| Download P99 latency (cache hit at CDN) | < 50ms | Time from request to first byte of first block |
| Download P99 latency (cache miss, S3 origin) | < 500ms | Time from request to first byte, S3 origin |
| Sync propagation latency (device online, file changed on another device) | < 5s end-to-end (P99) | Time from commit on device A to event received on device B |
| Permission revocation propagation | < 60s (P99) | Time from revocation commit to all file service instances invalidating cache |
| Conflict detection latency | < 2s from reconnect to conflict object created | Time from device reconnect to conflict notification received |
| Version history availability | 99.9% | HTTP success rate for version list API |
| Data durability (uploaded blocks) | 11-nines (≥ 99.999999999%) | Inherited from S3; supplemented by cross-region replication |
Step 16 — Operational Parameters #
| Parameter | Default Value | Tunable? | Notes |
|---|---|---|---|
| Block size | 4 MB | Yes (1MB–64MB) | Larger blocks = better dedup ratio; smaller blocks = finer-grained delta; 4MB is a common sweet spot |
| Block hash algorithm | SHA-256 | No | Cryptographically secure; collision probability negligible at any realistic block count |
| Idempotency key TTL | 24 hours | Yes | After 24h, a retry of an old upload will create a new version |
| Permission cache TTL | 60 seconds (strong mode); 300 seconds (eventual mode) | Yes | Trade-off: lower TTL = faster revocation propagation; higher TTL = lower permission check latency |
| FileChangeLog retention | 180 days | Yes | After 180 days, old events are archived to cold storage (S3 Glacier) |
| Version history retention | Per plan (Free: 30 days; Plus: 180 days; Professional: 365 days; Business: unlimited) | Per plan | Older versions archived or deleted per plan |
| Reconnect backoff base | 1 second | Yes | Exponential backoff with full jitter; cap at 60 seconds |
| Reconnect backoff cap | 60 seconds | Yes | Prevents thundering herd from collapsing to fixed interval |
| Upload max file size | 50 GB (Business plan) | Yes | Enforced at API Gateway |
| Max devices per user | 100 | Yes | Caps notification fan-out overhead |
| Quota reconciliation interval | 15 minutes | Yes | How often the background job recomputes exact usage from DB |
| Block GC delay | 7 days after last reference | Yes | After all files referencing a block are deleted, block is eligible for S3 deletion after 7 days |
Step 17 — Runbooks #
Runbook 1: High Upload Failure Rate #
Trigger: Upload success rate drops below 99% for 5 consecutive minutes.
Diagnosis sequence:
- Check API Gateway error logs: distinguish 4xx (client errors, e.g., quota exceeded) from 5xx (service errors).
- If 5xx: check Upload Service logs for DB connection errors, S3 timeout errors, or CAS contention errors.
- If DB CAS contention high: check
pg_stat_activityfor long-running transactions blocking thefilestable. - If S3 errors: check AWS S3 Service Health Dashboard; check S3 error rate in CloudWatch.
- If Upload Service crash: check pod restart count in Kubernetes; check OOM events.
Mitigation:
- DB contention: kill long-running queries; identify and fix the offending query.
- S3 outage: activate read-only mode (downloads continue from CDN; uploads queue locally on device).
- Upload Service crash: Kubernetes auto-restarts; clients retry with exponential backoff.
Runbook 2: Sync Propagation Latency Spike #
Trigger: P99 sync latency exceeds 10 seconds for 5 minutes.
Diagnosis sequence:
- Check Kafka consumer group lag for
file-changestopic. High lag = notification delivery is behind. - Check Notification Service throughput and connection count. Dropped WebSocket connections?
- Check Sync Service queue depth. Reconnect storm?
- Check FileChangeLog read latency in Cassandra. Hot partition?
Mitigation:
- Kafka consumer lag: scale up Notification Service replicas; increase Kafka partition count.
- Reconnect storm: check if a regional incident just recovered; engage rate limiting at API Gateway.
- Cassandra hot partition: identify the hot file_id; rate-limit uploads to that file_id at Upload Service.
Runbook 3: Permission Revocation Not Propagating #
Trigger: Alert from permission audit job: grantee can still access file 5 minutes after revocation.
Diagnosis sequence:
- Verify revocation is committed in Postgres
share_permissionstable (status = 'revoked'). - Check Redis cache entry for
(grantee_id, resource_id): is TTL still active? - Check Outbox table: is the revocation event still in
pendingstatus (Outbox relay stalled)? - Check Kafka: did the
perm.invalidatemessage reach the file service instances?
Mitigation:
- Cache entry stale: force TTL expiry via Redis DEL on the cache key (manual intervention).
- Outbox relay stalled: restart outbox relay service; it will republish pending events.
- Kafka consumer stalled: restart notification service consumer group.
Runbook 4: Reconnect Storm After Outage #
Trigger: Sync request rate spikes to > 10x normal immediately after service recovery.
Actions (in order):
- Verify API Gateway rate limiting is active: 100 sync requests/user/minute.
- Check that clients are using exponential backoff + jitter (not fixed-interval retry).
- If Sync Service CPU > 80%: scale up horizontally (Kubernetes HPA should trigger; verify).
- If Cassandra read latency spikes due to backlog: temporarily increase read timeout; increase Cassandra read replica count.
- Monitor until request rate normalizes (typically 5-10 minutes with proper backoff).
Runbook 5: Orphaned Block Cleanup #
Trigger: Block GC job reports orphaned blocks (blocks in S3 with reference_count = 0 for > 7 days).
Actions:
- Verify
reference_countcalculation: run reconciliation query acrossfiles.block_listandblockstable. - If reference_count is correct (truly unreferenced): schedule S3 deletion via lifecycle policy.
- Log block hashes before deletion for audit trail.
- Do NOT delete blocks with reference_count > 0 (even if files are in
deletedstate — blocks may be referenced by FileVersions for restore purposes).
Step 18 — Observability #
Metrics #
| Metric | Type | Labels | Alert Threshold |
|---|---|---|---|
upload.requests.total | Counter | status (success/failure), error_type | — |
upload.latency.seconds | Histogram | percentile | P99 > 2s for 5min |
upload.bytes.total | Counter | — | — |
block.dedup.ratio | Gauge | — | < 0.3 (unexpectedly low dedup) |
sync.propagation.latency.seconds | Histogram | — | P99 > 10s for 5min |
kafka.consumer.lag | Gauge | consumer_group, topic | > 10000 messages |
permission.cache.hit_rate | Gauge | — | < 0.95 for 5min |
conflict.open.count | Gauge | — | > 1000 open conflicts |
quota.check.failures.total | Counter | — | > 100/min |
db.connection_pool.waiting | Gauge | db_name | > 10 waiting |
s3.error.rate | Gauge | operation | > 0.01 (1%) |
device.reconnect.rate | Gauge | — | > 10x baseline |
Traces #
- Every upload request carries a trace ID from client through Upload Service → S3 → DB commit → Outbox.
- Every sync event carries the FileChangeLog sequence_number as a trace tag.
- Conflict detection traces: version vector comparison and conflict creation are traced as a single span.
Logs #
| Log Event | Fields | Purpose |
|---|---|---|
upload.committed | file_id, version_number, user_id, device_id, block_count, new_block_count, duration_ms | Audit + dedup stats |
conflict.detected | file_id, device_id, device_vv, server_vv, conflict_id | Conflict rate monitoring |
conflict.resolved | conflict_id, resolution, resolved_by | Resolution tracking |
permission.granted | permission_id, grantor, grantee, resource_id, access_level | Access audit |
permission.revoked | permission_id, grantor, grantee, resource_id, revoked_at | Access audit |
quota.exceeded | user_id, current_bytes, limit_bytes, file_size_bytes | Quota enforcement |
sync.resumed | device_id, cursor_from, cursor_to, events_replayed | Sync health |
Dashboards #
- Upload health: request rate, success rate, P50/P95/P99 latency, block upload rate, dedup ratio.
- Sync health: propagation latency, Kafka consumer lag, reconnect rate, conflict rate.
- Storage: total blocks stored, bytes stored, dedup savings (bytes not uploaded due to dedup), orphaned blocks.
- Access control: permission grant/revoke rate, cache hit rate, revocation propagation latency.
- Infrastructure: DB connection pool, Cassandra read/write latency, Redis memory usage, S3 error rate.
Step 19 — Cost Model #
Cost Drivers #
| Component | Unit Cost (approximate) | Volume (hypothetical 10M users) | Monthly Cost Estimate |
|---|---|---|---|
| S3 block storage | $0.023/GB/month | 10M users × 2GB average = 20PB; with 50% dedup = 10PB | $230,000 |
| S3 PUT requests (block uploads) | $0.005 per 1000 PUT | 10M uploads/day × 20 blocks avg = 200M PUT/day = 6B/month | $30,000 |
| S3 GET requests (block downloads) | $0.0004 per 1000 GET | 50M downloads/day × 20 blocks avg = 1B GET/day = 30B/month | $12,000 |
| Cloudflare CDN | ~$0.01/GB (egress) | 30B GET/month; CDN hit rate 80%; 20% from S3 = 6B × 20KB avg = 120TB egress from S3 | $1,200 (S3 egress) + Cloudflare plan |
| PostgreSQL (RDS Multi-AZ db.r6g.4xlarge) | ~$1,200/month per instance | 2 instances (primary + replica) | $2,400 |
| Cassandra (3 nodes × i3.4xlarge) | ~$1,200/month per node | 3 nodes in US-EAST-1; 3 in US-WEST-2 | $7,200 |
| Redis (ElastiCache r6g.xlarge) | ~$250/month | 2 instances (primary + replica) | $500 |
| Kafka (MSK 3 brokers × kafka.m5.2xlarge) | ~$400/month per broker | 3 brokers | $1,200 |
| Compute (ECS/Kubernetes, Upload + Sync + Share + Notification services) | ~$0.05/vCPU-hour | 50 vCPU average | $1,800 |
| Data transfer (S3 to EC2, same region) | Free | — | $0 |
| Total estimated | ~$286,000/month |
Cost Optimization Levers #
- Dedup ratio is the biggest lever: Global cross-user block dedup at 4MB block size typically achieves 40-60% storage reduction for enterprise document workloads. Each percentage point = $2,300/month at this scale.
- CDN cache hit rate: Blocks are content-addressed and immutable → perfect CDN cache invalidation (never invalidate; TTL = infinite). High cache hit rate dramatically reduces S3 GET requests.
- S3 Intelligent-Tiering: Blocks not accessed for 30 days automatically move to cheaper storage tiers ($0.0125/GB vs. $0.023/GB).
- Cassandra compression: LZ4 compression on FileChangeLog reduces storage by ~3x; Cassandra native.
- Right-size Cassandra: At lower scales, Cassandra can be replaced with Postgres (fewer moving parts; lower operational cost).
Step 20 — Evolution Stages #
Stage 1: MVP (0 → 100K users) #
What to build:
- Single-region deployment (US-EAST-1).
- Upload Service + Sync Service as a monolith.
- PostgreSQL for everything (files, file_versions, file_change_log, share_permissions).
- S3 for block storage.
- No real-time push: device polls every 30 seconds.
- Basic conflict detection: last-write-wins (simpler; document caveat to users).
- No global block dedup: per-user dedup only.
What to defer:
- Cassandra (Postgres handles the write load at this scale).
- Kafka (polling is sufficient).
- Global dedup (per-user dedup is simpler and sufficient).
- Redis (permissions checked directly from Postgres on every request).
- Multi-region.
Graduation criteria: 100K users, upload P99 < 1s, sync delay < 30s (polling interval).
Stage 2: Growth (100K → 1M users) #
Adds:
- Separate Upload, Sync, Share, and Notification services (decompose monolith by service boundary).
- Redis for permission cache (reduce Postgres read load).
- WebSocket-based push notifications (reduce sync latency from 30s to < 5s).
- Global block dedup (same content hash = skip upload globally).
- Idempotency key store for upload retries.
- Outbox + Relay for async event propagation.
- Full version vector conflict detection (replace last-write-wins).
Graduation criteria: 1M users, upload P99 < 500ms, sync latency P99 < 5s, conflict detection working.
Stage 3: Scale (1M → 50M users) #
Adds:
- Cassandra for FileChangeLog and FileVersions (write throughput beyond Postgres single-shard capacity).
- Kafka for all async event propagation (replace polling-based Outbox with Kafka consumers).
- Postgres sharding by user_id (horizontal scale for file metadata).
- Multi-region active-active (US-EAST-1 + EU-WEST-1; data sovereignty for EU users).
- Reconnect storm protection (exponential backoff enforcement + API Gateway rate limiting).
- S3 Intelligent-Tiering for cost optimization.
- CDN integration for block downloads.
- Quota reconciliation background job (async exact computation).
Graduation criteria: 50M users, upload availability 99.9%, sync propagation P99 < 5s globally.
Stage 4: Maturity (50M+ users) #
Adds:
- Differential sync at sub-block level (rsync-style rolling checksum for large files with small edits — reduce block upload count further).
- CRDT-based collaborative editing for specific file types (Google Docs-style; Conflict object becomes less necessary for text files).
- Tiered storage with user-controlled cold archive (files not accessed for 1 year move to Glacier with restore-on-demand).
- Zero-knowledge encryption option (client-side encryption; server stores ciphertext; server cannot read blocks; dedup becomes per-user only, not global, since ciphertext of same plaintext differs per user key).
- Federated deployment for enterprise (on-premises Dropbox Business with same protocol; client handles sync identically).
Trade-off noted for zero-knowledge encryption: Global block dedup (Step 5’s U1 invariant) is incompatible with per-user client-side encryption keys. If user A and user B both encrypt the same file, the resulting blocks differ (different keys → different ciphertext → different hashes). Global dedup collapses to zero. The system must choose: dedup (store plaintext on server) or zero-knowledge (sacrifice global dedup). This is a product decision, not a technical limitation.
There's no articles to list here yet.