Skip to main content
  1. System Design Components/

Dropbox: System Design #

Derived using the 20-step derivation framework. Every step produces an explicit output artifact. No hand-wavy steps.


Ordering Principle #

Product requirements (upload, sync, share)
  → normalize into operations over state          (Step 1)
  → extract primary objects                       (Step 2)
  → assign ownership, ordering, evolution         (Step 3)
  → extract invariants                            (Step 4)
  → derive minimal DPs from invariants            (Step 5)
  → select concrete mechanisms                    (Step 6)
  → validate independence and source-of-truth     (Step 7)
  → specify exact algorithms                      (Step 8)
  → define logical data model                     (Step 9)
  → map to technology landscape                   (Step 10)
  → define deployment topology                    (Step 11)
  → classify consistency per path                 (Step 12)
  → identify scaling dimensions and hotspots      (Step 13)
  → enumerate failure modes                       (Step 14)
  → define SLOs                                   (Step 15)
  → define operational parameters                 (Step 16)
  → write runbooks                                (Step 17)
  → define observability                          (Step 18)
  → estimate costs                                (Step 19)
  → plan evolution                                (Step 20)

Step 1 — Problem Normalization #

Goal: Convert product-language requirements into precise operations over state.

Original RequirementActorOperationState Touched
User uploads a fileUsercreate-or-overwriteFile (block_list, version); Block (content, hash)
User downloads a fileClientreadFile (block_list); Block (content by hash)
File auto-syncs across all user devicesSystemread change log + apply deltaFileChangeLog (events); SyncCursor (offset per device); Device local state
Conflict resolution when offline edits divergeSystemdetect divergence + create Conflict objectFileVersion (version vectors); Conflict (process object)
User shares a file/folder with another userUsercreate relationshipSharePermission (grantee, resource, access level)
Shared user accesses the fileGranteeeligibility check + readSharePermission; File; Block
View version history of a fileUserread projectionFileVersion (append-only log)
Delta transfer (don’t re-upload unchanged bytes)Clientcompute local hashes, read missing list, upload only missing blocksBlock (content-addressed); File (block_list diff)
Work offline, sync on reconnectClientbuffer local events offline; replay on reconnectFileChangeLog (local event buffer); SyncCursor (replayed from offset)
Revoke share permissionOwnerdelete relationshipSharePermission
Delete a fileUserstate transitionFile (active → deleted); FileChangeLog (deletion event appended)
Restore a deleted fileUserstate transition + overwriteFile (deleted → active, block_list restored from FileVersion)

Hidden write exposures:

  • “Auto-sync” is not a simple read. It requires the system to track which device has seen which change (SyncCursor is an offset), detect delta (which blocks changed), and apply the delta to local state. Three operations, not one.
  • “Conflict resolution” hides a process object (Conflict) with its own state machine. The conflict is created when divergence is detected, then resolved (manually or automatically) into a new FileVersion.
  • “Version history” is a derived view over the FileVersion append-only log — not primary state to be designed independently.

Step 2 — Object Extraction #

Goal: Identify the minimal set of primary state objects. Apply all four purity tests.

Primary Objects #

ObjectClassJustification
FileStable entityLong-lived, has identity (file_id), evolves via block_list overwrites and state transitions (active/deleted)
BlockEventImmutable once stored; content-addressed by SHA-256(content); never mutated
FileVersionEventImmutable snapshot of a file’s block_list at a point in time; append-only
FileChangeLogEvent streamAppend-only log of all mutations to a File (upload, delete, restore, conflict-resolve)
DeviceStable entityTracks per-device presence and sync state
ConflictProcess objectCreated when version vectors diverge; has state machine (open → resolved); must persist across the resolution lifecycle
SharePermissionRelationship objectAn edge between a User and a File/Folder with access level and its own lifecycle (active → revoked)
UserStable entityAccount identity, quota, plan

Derived / Rejected Objects #

CandidateProblemDisposition
SyncCursorDerivable: it is a pointer (offset) into FileChangeLog per device. If stored as mutable primary state alongside the log, there is dual truth — the log AND the cursor both describe “what this device has seen.” Correct: SyncCursor is a materialized offset bookmark, not primary state.Derived view — offset into FileChangeLog per (device_id, file_id)
VersionHistoryDerivable from FileVersion records filtered by file_id, ordered by created_atDerived projection
StorageQuotaUsedDerivable from File records owned by user × block sizesDerived projection (cached)
FolderContentsDerivable from File records with parent_folder_idDerived projection

Four Purity Tests per Object #

File #

  1. Ownership purity: Written by the owning user (uploads, deletes, restores) and by the conflict-resolution path. These are distinct operations with distinct guards — ownership is clear. ✓
  2. Evolution purity: Overwrite of block_list on each new upload version; state machine for active/deleted. These are on different fields and different guards. The split into File (current state) + FileVersion (history log) keeps each pure. ✓
  3. Ordering purity: FileVersions are totally ordered by version number within file_id. ✓
  4. Non-derivability: The current block_list of a File cannot be derived without knowing which version is current — that pointer lives in File. ✓

Block #

  1. Ownership purity: Written by the upload service, never modified, never deleted (content-addressed storage). Single writer under append semantics. ✓
  2. Evolution purity: Append-only. Once a block with a given hash exists, it never changes. ✓
  3. Ordering purity: No meaningful ordering — blocks are a content-addressed set. Hash is the identity. ✓
  4. Non-derivability: Block content cannot be reconstructed without the actual bytes. ✓

Conflict #

  1. Ownership purity: Created by the sync service (when divergence detected); resolved by user or auto-resolver. Two distinct writers, but at distinct lifecycle phases — creation is system-only, resolution is user-or-system. ✓
  2. Evolution purity: State machine: detected → open → resolved. Each transition has a guard. ✓
  3. Ordering purity: Causal lifecycle order — transitions must follow valid paths. ✓
  4. Non-derivability: A Conflict is not derivable from FileVersions alone. The Conflict object carries user choice, resolution metadata, and the resulting FileVersion reference. ✓

SharePermission #

  1. Ownership purity: Written by the file/folder owner (grant, revoke). Single writer per permission. ✓
  2. Evolution purity: State machine: active → revoked. Revocation is terminal. ✓
  3. Ordering purity: No meaningful ordering within a grantee’s permission set. ✓
  4. Non-derivability: Access rights cannot be derived from File metadata alone. ✓

Step 3 — Axis Assignment #

Goal: For every primary object, define ownership, evolution, and ordering (bound to scope).

Object: File
  Ownership:   Multi-writer, one winner per file_id (multiple devices of same user may write; only one write wins per CAS)
  Evolution:   Overwrite (block_list replaced on new version); State machine (active/deleted)
  Ordering:    Total order on version_number within file_id

Object: Block
  Ownership:   Single writer (upload service); content-addressed so identity is the hash
  Evolution:   Append-only (immutable once stored)
  Ordering:    No meaningful order (set semantics; hash = identity)

Object: FileVersion
  Ownership:   System-only (created by upload service and conflict resolver; users never write directly)
  Evolution:   Append-only (immutable snapshot)
  Ordering:    Total order by version_number within file_id

Object: FileChangeLog
  Ownership:   System-only (written by upload, delete, restore, conflict-resolve services)
  Evolution:   Append-only (events are immutable)
  Ordering:    Total order by sequence_number within file_id

Object: Device
  Ownership:   Single writer per device_id (device registers itself; service updates last_seen)
  Evolution:   Overwrite (last_seen, sync_offset fields updated in place)
  Ordering:    No meaningful order across devices

Object: Conflict
  Ownership:   Multi-writer across lifecycle phases: system creates, user or auto-resolver resolves
  Evolution:   State machine (detected → open → resolved)
  Ordering:    Causal lifecycle order (transitions must follow valid paths)

Object: SharePermission
  Ownership:   Single writer (the file/folder owner)
  Evolution:   State machine (active → revoked)
  Ordering:    No meaningful order within a grantee's permission set

Object: User
  Ownership:   Single writer (user self-writes profile; billing system writes quota)
  Evolution:   Overwrite (mutable fields: name, email, plan, quota_used_bytes)
  Ordering:    No meaningful order

Circuit topology insight: The Dropbox sync system is a transmission line. Think of two capacitors (local device state, server state) connected through a sync medium. File changes are charge flowing to equalize. Delta sync is minimizing the charge transfer needed. A conflict is what happens when both capacitors are charged differently during isolation (offline period) — they cannot simply merge charge; the system must detect divergence and arbitrate.


Step 4 — Invariant Extraction #

Goal: Convert requirements into precise, testable invariants. These are implementation-independent.

Eligibility Invariants #

E1 — Upload eligibility: A user may upload a file only if their quota_used_bytes + new_file_bytes ≤ quota_limit_bytes.

E2 — Access eligibility: A user may read a file only if: (a) they own the file, OR (b) there exists a SharePermission record where grantee_id = user_id AND resource_id covers the file AND status = active AND access_level ∈ {read, write}.

E3 — Write eligibility on shared file: A user may upload a new version of a file they do not own only if: there exists a SharePermission where access_level = write AND status = active.

E4 — Delete eligibility: Only the owner may delete a file. Grantees with write access may not delete.

Ordering Invariants #

O1 — Version monotonicity: For any file_id, if version V exists, version V+1 must have created_at > V.created_at. No two FileVersions for the same file may have the same version_number.

O2 — Change log monotonicity: For any file_id, FileChangeLog sequence numbers are strictly monotonically increasing. Events are never reordered or deleted.

Accounting Invariants #

A1 — Quota consistency: User.quota_used_bytes = SUM(Block.size_bytes for all blocks reachable from active Files owned by user). This must hold after every upload and delete. (In practice, computed asynchronously; the synchronous enforcement is a pessimistic quota check at upload time against a cached counter.)

A2 — Block reference integrity: Every block_hash in any File.block_list must have a corresponding Block record in the block store. A file must never reference a block that has not been durably committed.

Uniqueness / Idempotency Invariants #

U1 — Block global dedup: There is at most one Block record for any given content_hash. If two uploads produce the same hash, only one Block is stored. Both uploads succeed, but the bytes are stored once.

U2 — Idempotent block upload: Uploading a block with the same content_hash twice must be a no-op. The second upload must not corrupt or duplicate the block.

U3 — Upload idempotency: A client may retry an upload request with the same idempotency key and receive the same result without creating duplicate FileVersions.

Propagation Invariants #

P1 — Sync completeness: If a file changes on device A and device B is online and subscribed, device B must eventually receive the change event. “Eventually” is bounded by SLO (Step 15).

P2 — Conflict detection completeness: If two devices modify the same file while one is offline, the system must detect the divergence when the offline device reconnects and create a Conflict object. The divergence must not be silently overwritten.

P3 — Deletion propagation: If a file is deleted on device A, device B must receive a deletion event. Device B must not continue to serve the file as live after the event is processed.

Access-Control Invariants #

AC1 — Permission revocation is immediate: Once a SharePermission is revoked (status = revoked), all subsequent access attempts by the grantee must be denied. No caching of permissions beyond a bounded TTL (configurable; must be ≤ 60 seconds in strong mode).

AC2 — Grantee cannot escalate: A grantee with read access cannot perform write operations. Access level enforcement is the sole authority — no capability tokens that can be forged client-side.


Step 5 — Design Point Derivation #

Goal: For each invariant cluster, derive the minimal enforcing mechanism. One DP per cluster; no over-engineering.

Invariant ClusterDesign PointReasoning
E1 (quota), A1 (quota consistency)Quota gate with cached counter + pessimistic checkExact enforcement requires a distributed counter; exact counter is expensive at upload scale. Pessimistic: pre-check cached quota_used_bytes; deduct on commit; reconcile asynchronously. Over-quota risk is bounded by cache staleness window.
E2, E3, E4, AC1, AC2 (access control)Permission check service with bounded-TTL cachePermissions are read at every file access; must be fast. Cache permissions with TTL ≤ 60s for strong mode. Revocation propagates via cache invalidation event. Source of truth is SharePermission table.
O1 (version monotonicity), U3 (upload idempotency)CAS on (file_id, current_version_number) + Idempotency Key storeEach upload atomically increments version_number; if CAS fails, caller retries with fresh version. Idempotency key prevents duplicate FileVersions on retry.
U1, U2 (block dedup + idempotency)Content-addressed block store keyed by SHA-256 hashHash = identity = idempotency key. Existence check before upload eliminates duplicate bytes. No separate dedup index needed — the hash IS the key.
O2 (log monotonicity)Single-partition append-only log per file_idSequence numbers assigned by the log; append is atomic. Consumers read from offset.
P1 (sync propagation)Change event stream + per-device subscription + push notificationFileChangeLog events are published; devices subscribe; server pushes or client polls from stored offset.
P2 (conflict detection)Version vector per (file_id, device_id); divergence detection on reconnectEach device tracks a version vector. On reconnect, server compares device’s vector against server’s vector. Divergence → Conflict object created.
P3 (deletion propagation)Deletion event in FileChangeLog + device processes eventDeletion is an event appended to the log. All subscribed devices receive it and mark local copy as deleted. Soft-delete on server (retain block_list for restore); hard-delete after TTL or explicit purge.
AC1 (revocation immediacy)Cache invalidation event on revocation + TTL backstopOn revoke, publish invalidation to permission cache. TTL ≤ 60s ensures stale cache expires even if invalidation is lost.
A2 (block reference integrity)Two-phase commit for file metadata + block store, OR: commit blocks first, then commit file metadata atomicallyBlock must exist before File references it. Write blocks first; write file metadata only after all blocks are confirmed durable. This is a sequencing invariant, not a 2PC invariant.

Step 6 — Mechanism Selection #

Goal: Mechanical bridge from invariant + axis → concrete implementation mechanism. Apply the full derivation table for four key paths.

6.1 Invariant Type → Mechanism Family #

Invariant TypeMechanism Family
EligibilityGuard + atomic state check
OrderingSequence number + CAS
AccountingPessimistic counter + async reconciliation
Uniqueness/IdempotencyContent hash as key + existence check
PropagationEvent stream + subscription
Access-controlPermission store + TTL cache + invalidation

6.2 Ownership × Evolution → Concurrency Mechanism #

ObjectOwnershipEvolutionTable Lookup Result
FileMulti-writer, one winnerOverwrite + state machineCAS on (file_id, version_number)
BlockSingle writerAppend-onlyNo concurrency mechanism needed; hash prevents collision by design
FileVersionSystem-onlyAppend-onlyIdempotency key on (file_id, version_number)
ConflictMulti-writer across lifecycle phasesState machineCAS on (conflict_id, status)
SharePermissionSingle writerState machineCAS on (permission_id, status)

6.3 Mechanical Derivation — Four Key Paths #


Path A: File Write Conflict Detection #

Invariant driving this: P2 (divergence must be detected) + O1 (version monotonicity).

Ownership × Evolution: File is multi-writer (multiple devices of same user may write while offline), overwrite + state machine.

Table lookup: Multi-writer + overwrite → CAS on version.

Q1 (scope): Divergence spans multiple devices (cross-service/cross-process). Not within a single service. But it is also not cross-region in the primary case. Scope = cross-device within one user’s account. The conflict is detected at the server on reconnect. Mechanism: CAS on server-side version vector, not distributed 2PC (devices do not coordinate with each other).

Q2 (failure): What if the device crashes mid-upload? → Idempotency Key. What if the network partitions during upload? → CAS detects stale version on retry; client must re-read current state.

Q3 (data): Version vectors are not commutative in the way CRDT requires (concurrent overwrites of a file are not mergeable without user intent). Content-addressed blocks help detect what changed but not resolve which change wins. → CAS + version vector.

Q4 (access): Reads » Writes (most devices read the synced state; conflicts are rare). Current state is read on every sync. → CQRS-lite: write path updates the version vector; read path serves from a read replica.

Q5 (coupling): The Conflict object must be created when divergence is detected — this is an async-guaranteed propagation from the sync service to the conflict service. → Outbox + Relay pattern. The sync service writes a “conflict_detected” event to its outbox table in the same transaction as updating the file’s version vector; a relay publishes it to the conflict service.

Required combination: CAS + Idempotency Key always (Step 6.4 rule).

Resulting mechanism: Version vector per (file_id, device_id) stored in FileChangeLog. On upload from device D, server computes current version vector. If device D’s local version vector and the server’s vector are concurrent (neither dominates), a Conflict is created via Outbox + Relay. If device D’s vector dominates (no offline divergence), the upload proceeds with CAS on version_number.

Conflict detection algorithm sketch:

function detect_conflict(device_id, file_id, device_version_vector):
    server_vv = read_version_vector(file_id)
    if dominates(device_vv, server_vv):
        # device is ahead — fast-forward server
        return ACCEPT
    elif dominates(server_vv, device_vv):
        # server is ahead — device needs to pull
        return STALE
    else:
        # concurrent — conflict
        return CONFLICT

Path B: Block Dedup #

Invariant driving this: U1 (global dedup — at most one block per hash) + U2 (idempotency of block upload).

Ownership × Evolution: Block is single writer, append-only. But it is a global dedup — across all users.

Q1 (scope): Global dedup is cross-user. The block store is a single global namespace keyed by content_hash. Scope: global, single-writer-per-key (two clients uploading the same block race, but both should succeed idempotently).

Q2 (failure): If the uploading client crashes mid-block-upload, the block is incomplete. The client must retry. → Idempotency Key = content_hash. The server checks: does a block with this hash exist? If yes, skip upload. If no (or partial), upload resumes from offset (resumable upload protocol).

Q3 (data): Blocks are content-addressed. Content-addressed → hash = natural idempotency key (Q3 rule from framework). This eliminates the need for a separate idempotency key store for block uploads. The hash IS the key.

Q4 (access): Block uploads (writes) are less frequent than block downloads (reads), but block existence checks happen on every upload attempt for every block in the manifest. → Content-addressed object store with O(1) existence check (HTTP HEAD on S3 key = hash).

Q5 (coupling): Block existence must be confirmed before File metadata references it (A2). This is an in-transaction-sequencing coupling, not a saga. Write block first; commit file metadata only after block is durable. No async coupling needed here.

Resulting mechanism: S3 (or equivalent) with key = SHA-256 hex of block content. Upload protocol:

  1. Client computes SHA-256(block_content) locally.
  2. Client sends manifest (list of hashes) to server.
  3. Server performs bulk existence check (S3 HEAD per hash, or batched lookup in a block index table).
  4. Server returns list of missing hashes.
  5. Client uploads only missing blocks (PUT to S3 pre-signed URL per hash).
  6. Server confirms receipt (S3 HEAD again or use S3 event notification).
  7. Server commits File metadata with new block_list.

The dedup ratio is high for common file types. A 10MB PDF split into 4MB blocks: if block 1 is identical to a block already stored by another user, it is not re-uploaded. Cross-user dedup is a storage multiplier.


Path C: Sync Notification #

Invariant driving this: P1 (sync completeness) + P3 (deletion propagation).

Ownership × Evolution: FileChangeLog is system-only, append-only.

Q1 (scope): Propagation from the server to many devices. This is a fan-out problem: one file change → N device notifications. Scope: cross-service (sync service → push notification service → devices).

Q2 (failure): Device is offline when event is published. → The device must catch up from its stored offset (SyncCursor = offset in FileChangeLog). The event log is durable; the device resumes from its last-processed offset on reconnect. This is the “async-guaranteed” pattern. Missing push notification is survivable because the device polls on reconnect from its offset.

Q3 (data): Events are not commutative (they must be replayed in order). → Not CRDT. The log is totally ordered per file_id (O2).

Q4 (access): Write (device uploads change) triggers fan-out read to many devices. → Fan-out on Write: when a change is committed, publish to a per-user change channel. Devices subscribed to the channel receive the notification. Devices not connected store their offset and catch up later.

Q5 (coupling): Sync notification must be async-guaranteed: the upload service must not block on device notification delivery. → Outbox + Relay: upload service writes FileChangeLog event to its own DB in the same transaction as committing the new FileVersion; a relay publishes to the notification channel (Redis pub/sub or Kafka topic).

Resulting mechanism:

  • FileChangeLog stored in Cassandra (high-write append, per-file_id partition key).
  • On commit, relay publishes event to Kafka topic file-changes (partitioned by user_id for ordering).
  • Notification service consumes Kafka, pushes to device via WebSocket or long-poll.
  • Device stores sync_cursor = last_processed_sequence_number locally and in Device table.
  • On reconnect, device sends its cursor; server streams all events since that cursor from FileChangeLog.
  • Reconnect storm mitigation: exponential backoff + full jitter on reconnect attempt after outage (per AWS Architecture Blog recommendation).

Path D: Share Permission Enforcement #

Invariant driving this: E2, E3, E4 (access eligibility) + AC1 (revocation immediacy) + AC2 (no escalation).

Ownership × Evolution: SharePermission is single writer per record, state machine (active → revoked).

Q1 (scope): Permission check is within-service (the file service checks permissions before serving content). Cross-service for access from different tools (e.g., a linked app). Within the core service: permission check on every read/write request.

Q2 (failure): What if the permission service is slow or unavailable? → Circuit Breaker: if the permission service is down, fail closed (deny access), not open (allow all). This preserves AC1 and AC2 under degradation.

Q3 (data): Permissions are read » written (many reads per grant/revoke event). → Cache-Aside: cache SharePermission records in Redis keyed by (grantee_id, resource_id). TTL ≤ 60s for strong mode. On revocation, publish a cache invalidation event (via Kafka or Redis pub/sub) to all file service instances.

Q4 (access): read » write → Cache-Aside is correct. The cache entry is a materialized permission check result: {can_read: true, can_write: false}.

Q5 (coupling): Revocation must propagate with bounded delay (≤ 60s, per AC1). This is async-guaranteed propagation. → Outbox + Relay for revocation events: SharePermission service writes revocation event to outbox; relay publishes to cache invalidation channel; all file service instances invalidate their local cache entry for that (grantee_id, resource_id) pair.

CAS + Lease requirement (Step 6.4):

  • CAS on SharePermission.status: the revocation is a CAS from active to revoked. Prevents double-revocation or race between grant and revoke.
  • No lease needed here because the single-writer invariant holds (only the owner writes).

Resulting mechanism:

  • PostgreSQL share_permissions table with status column.
  • Revocation: UPDATE share_permissions SET status='revoked', revoked_at=NOW() WHERE permission_id=? AND status='active' (CAS-equivalent via conditional update).
  • Cache: Redis hash perm:{grantee_id}:{resource_id}{can_read, can_write, ttl}.
  • Cache invalidation: on revocation, write to outbox; relay publishes perm.invalidate:{grantee_id}:{resource_id} to Redis pub/sub; all file service instances subscribe and delete their cache entry.
  • Circuit Breaker: if Redis is unavailable, fall back to direct DB query. If DB is unavailable, fail closed.

6.4 Required Combinations Applied #

PathCAS Required?Idempotency Key Required?Lease Required?Fencing Token?
File write conflictYES — CAS on version_numberYES — upload idempotency keyNO (no crash-holder scenario)NO
Block dedupNO (hash is the identity; existence check is idempotent by design)YES — content_hash IS the idempotency keyNONO
Sync notificationNOYES — sequence_number prevents duplicate processingNONO
Share permissionYES — conditional update on statusYES — idempotency key on grant/revoke requestNONO

Step 7 — Axiomatic Validation #

Goal: Source-of-truth table. No dual truth.

State QuestionSource of TruthNotes
What bytes does this file contain?Block store (S3), keyed by SHA-256 hashBlock content is immutable; hash is the identity
What is the current block_list of a file?files table, block_list column (Postgres)Updated on every new version via CAS
What versions has this file had?file_versions table (Cassandra or Postgres append-only)Immutable append log; never updated in place
What events have happened to this file?file_change_log table (Cassandra, per file_id partition)Immutable event log; source of truth for sync
What has device D seen?devices table, sync_cursor column (Postgres or Redis)This is a bookmark, not a duplicate of the log. The log is source of truth; the cursor is a pointer. NO DUAL TRUTH.
Does user U have access to file F?share_permissions table (Postgres)Redis cache is a read cache, NOT a source of truth. Invalidated on revocation.
How much storage has user U used?users table, quota_used_bytes columnCached counter; asynchronously reconciled against actual block references
Is there a conflict on file F?conflicts table (Postgres)Conflict object is primary state with its own lifecycle

Dual truth check:

The SyncCursor is the canonical example of dual-truth risk. If both the FileChangeLog and a separate SyncCursor table claim to be the source of truth for “what device D has seen,” updates to both must be kept in sync — which is exactly the dual-truth problem. The correct design is: FileChangeLog is the source of truth for what happened; SyncCursor is a bookmark (offset) stored in Device or a separate cursor table, pointing into the log. If the cursor is lost, it can be reset to 0 (full resync) or to a known checkpoint. The log is never lost.

Validation result: No dual truth found. Each state question has exactly one source of truth.


Step 8 — Algorithm Design #

Goal: Pseudocode for every write path and state machines.

8.1 Block Upload (Delta Sync) Algorithm #

// Client-side: compute local manifest
function compute_manifest(file_path) -> List[BlockManifestEntry]:
    blocks = split_file_into_blocks(file_path, block_size=4MB)
    manifest = []
    for block in blocks:
        hash = sha256(block.content)
        manifest.append({hash: hash, offset: block.offset, size: block.size})
    return manifest

// Client sends manifest to server
// Server checks which blocks are missing
function check_missing_blocks(file_id, manifest) -> List[hash]:
    missing = []
    for entry in manifest:
        if not block_store.exists(entry.hash):  // S3 HEAD request
            missing.append(entry.hash)
    return missing

// Client uploads only missing blocks
function upload_missing_blocks(missing_hashes, manifest):
    for hash in missing_hashes:
        block = find_block_in_manifest(manifest, hash)
        presigned_url = get_presigned_upload_url(hash)
        http_put(presigned_url, block.content)
        // Retry with exponential backoff on failure
        // Hash = idempotency key; re-uploading same hash is safe

// Server commits new file version
function commit_file_version(file_id, user_id, manifest, idempotency_key):
    // Idempotency check
    if idempotency_store.exists(idempotency_key):
        return idempotency_store.get_result(idempotency_key)

    // CAS on version_number
    current = db.select("SELECT version_number FROM files WHERE file_id=?", file_id)
    new_version = current.version_number + 1

    block_list = [entry.hash for entry in manifest]

    // Verify all blocks exist before committing
    for hash in block_list:
        assert block_store.exists(hash), "Block not durable: " + hash

    // Atomic commit: new FileVersion + update File.block_list + append to FileChangeLog
    db.transaction():
        db.execute("""
            UPDATE files
            SET block_list=?, version_number=?, updated_at=NOW()
            WHERE file_id=? AND version_number=?
        """, block_list, new_version, file_id, current.version_number)
        // CAS: if version_number changed since read, this UPDATE affects 0 rows → retry

        if db.rows_affected == 0:
            raise ConcurrentWriteError("Retry with fresh version")

        db.execute("""
            INSERT INTO file_versions (file_id, version_number, block_list, created_at, created_by_device)
            VALUES (?, ?, ?, NOW(), ?)
        """, file_id, new_version, block_list, device_id)

        db.execute("""
            INSERT INTO file_change_log (file_id, event_type, version_number, created_at)
            VALUES (?, 'upload', ?, NOW())
        """, file_id, new_version)

        // Outbox entry for sync notification relay
        db.execute("""
            INSERT INTO outbox (event_type, payload, created_at)
            VALUES ('file_changed', ?, NOW())
        """, json({file_id, new_version, user_id}))

    idempotency_store.put(idempotency_key, {version_number: new_version})
    return {version_number: new_version}

8.2 Conflict Detection State Machine #

States: detected → open → resolved

Transitions:
  NONE → detected    : trigger = sync service detects concurrent version vectors
  detected → open    : trigger = Conflict object created in DB, user notified
  open → resolved    : trigger = user chooses a winner (or auto-resolver picks)
  resolved → NONE    : terminal (new FileVersion created, Conflict closed)

function on_device_reconnect(device_id, file_id, device_version_vector):
    server_vv = read_version_vector(file_id)

    relation = compare_version_vectors(device_vv, server_vv)

    if relation == DOMINATES:
        // device is ahead of server — accept device's version
        commit_file_version(file_id, device_version)

    elif relation == DOMINATED:
        // server is ahead — tell device to pull
        return {action: "pull", server_version: server_vv}

    elif relation == CONCURRENT:
        // Conflict: neither dominates
        conflict_id = create_conflict(file_id, device_id, device_vv, server_vv)
        // Outbox entry → notify user
        return {action: "conflict", conflict_id: conflict_id}

function resolve_conflict(conflict_id, resolution_choice):
    // CAS on conflict status
    conflict = db.select("SELECT * FROM conflicts WHERE conflict_id=? AND status='open'", conflict_id)
    if conflict is None:
        return ALREADY_RESOLVED

    if resolution_choice == KEEP_SERVER:
        winning_block_list = conflict.server_block_list
    elif resolution_choice == KEEP_DEVICE:
        winning_block_list = conflict.device_block_list
    elif resolution_choice == KEEP_BOTH:
        // Create a copy of the device version with a suffix (e.g., "file (conflicted copy).txt")
        create_conflict_copy(conflict.file_id, conflict.device_block_list)
        winning_block_list = conflict.server_block_list

    db.transaction():
        db.execute("UPDATE conflicts SET status='resolved' WHERE conflict_id=? AND status='open'", conflict_id)
        commit_file_version(conflict.file_id, winning_block_list, idempotency_key=conflict_id+"_resolve")

8.3 Sync Pull Algorithm (Device on Reconnect) #

function sync_on_reconnect(device_id, file_id):
    // Read stored cursor (last processed sequence number)
    cursor = db.select("SELECT sync_cursor FROM devices WHERE device_id=?", device_id)
    last_seq = cursor.sync_cursor ?? 0

    // Read all events since last_seq
    events = db.select("""
        SELECT * FROM file_change_log
        WHERE file_id=? AND sequence_number > ?
        ORDER BY sequence_number ASC
    """, file_id, last_seq)

    for event in events:
        apply_event_to_local_state(device_id, event)

        // Update cursor atomically after each event
        db.execute("""
            UPDATE devices SET sync_cursor=? WHERE device_id=? AND sync_cursor=?
        """, event.sequence_number, device_id, last_seq)
        last_seq = event.sequence_number

function apply_event_to_local_state(device_id, event):
    if event.type == 'upload':
        delta = compute_block_delta(local_block_list, event.block_list)
        download_missing_blocks(delta.missing_blocks)
        update_local_file(event.block_list)
    elif event.type == 'delete':
        mark_local_file_deleted(event.file_id)
    elif event.type == 'restore':
        delta = compute_block_delta(local_block_list, event.block_list)
        download_missing_blocks(delta.missing_blocks)
        update_local_file(event.block_list)

8.4 Quota Enforcement Algorithm #

function check_quota_before_upload(user_id, new_file_bytes):
    // Pessimistic check against cached counter
    current_usage = redis.get("quota:" + user_id) ?? db.select("SELECT quota_used_bytes FROM users WHERE user_id=?", user_id)
    quota_limit = db.select("SELECT quota_limit_bytes FROM users WHERE user_id=?", user_id)

    if current_usage + new_file_bytes > quota_limit:
        raise QuotaExceededError()

    // Reserve space optimistically
    redis.incrby("quota:" + user_id, new_file_bytes)
    // On upload failure, release reservation
    // On upload success, no-op (counter already incremented)
    // Async reconciliation job computes exact usage periodically

Step 9 — Logical Data Model #

Goal: Schema with partition keys derived from invariant scope.

Tables #

files #

CREATE TABLE files (
    file_id         UUID PRIMARY KEY,
    owner_user_id   UUID NOT NULL REFERENCES users(user_id),
    parent_folder_id UUID REFERENCES folders(folder_id),
    name            TEXT NOT NULL,
    block_list      TEXT[] NOT NULL,         -- ordered list of SHA-256 hashes
    version_number  BIGINT NOT NULL DEFAULT 1,
    status          TEXT NOT NULL DEFAULT 'active',  -- active | deleted
    size_bytes      BIGINT NOT NULL,
    content_hash    TEXT,                    -- hash of full file (optional, for quick comparison)
    created_at      TIMESTAMPTZ NOT NULL,
    updated_at      TIMESTAMPTZ NOT NULL,
    CONSTRAINT files_status_check CHECK (status IN ('active', 'deleted'))
);
-- Partition key for access: owner_user_id (all files by user)
-- CAS key: (file_id, version_number) for optimistic concurrency
CREATE INDEX files_owner_idx ON files(owner_user_id, status);
CREATE INDEX files_folder_idx ON files(parent_folder_id, status);

file_versions #

-- Append-only; never updated. Cassandra or Postgres.
CREATE TABLE file_versions (
    file_id         UUID NOT NULL,
    version_number  BIGINT NOT NULL,
    block_list      TEXT[] NOT NULL,
    size_bytes      BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    created_by_device UUID,
    created_by_user UUID NOT NULL,
    PRIMARY KEY (file_id, version_number)
);
-- Partition key: file_id (all versions of one file co-located)
-- Ordering: version_number ASC within file_id

file_change_log #

-- Append-only event log. Cassandra preferred for high-write throughput.
CREATE TABLE file_change_log (
    file_id         UUID NOT NULL,
    sequence_number BIGINT NOT NULL,         -- monotonically increasing within file_id
    event_type      TEXT NOT NULL,           -- upload | delete | restore | conflict_resolved | shared | unshared
    version_number  BIGINT,                  -- which FileVersion this event references (if applicable)
    actor_device_id UUID,
    actor_user_id   UUID NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    metadata        JSONB,
    PRIMARY KEY (file_id, sequence_number)
);
-- Partition key: file_id → all events for one file co-located, ordered by sequence_number

blocks (index table — not the block content itself) #

-- Block content stored in S3. This table is an index for existence checks and metadata.
CREATE TABLE blocks (
    content_hash    TEXT PRIMARY KEY,        -- SHA-256 hex
    size_bytes      BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    reference_count BIGINT NOT NULL DEFAULT 1  -- for GC: when 0, eligible for deletion
);

devices #

CREATE TABLE devices (
    device_id       UUID PRIMARY KEY,
    user_id         UUID NOT NULL REFERENCES users(user_id),
    device_name     TEXT,
    platform        TEXT,                    -- mac | windows | linux | ios | android | web
    sync_cursor     BIGINT NOT NULL DEFAULT 0,  -- offset into file_change_log per user
    last_seen_at    TIMESTAMPTZ,
    registered_at   TIMESTAMPTZ NOT NULL
);
CREATE INDEX devices_user_idx ON devices(user_id);

Note on SyncCursor: The sync_cursor in the devices table is a bookmark into the FileChangeLog, not a duplicate of the log. It is the device’s “read position.” The log is the source of truth; this cursor is how the device knows where to resume. If lost, resync from 0.

conflicts #

CREATE TABLE conflicts (
    conflict_id         UUID PRIMARY KEY,
    file_id             UUID NOT NULL REFERENCES files(file_id),
    status              TEXT NOT NULL DEFAULT 'open',  -- open | resolved
    server_version_number BIGINT NOT NULL,
    device_id           UUID NOT NULL,
    device_block_list   TEXT[] NOT NULL,
    server_block_list   TEXT[] NOT NULL,
    resolution          TEXT,               -- keep_server | keep_device | keep_both
    resolved_at         TIMESTAMPTZ,
    created_at          TIMESTAMPTZ NOT NULL,
    CONSTRAINT conflicts_status_check CHECK (status IN ('open', 'resolved'))
);
CREATE INDEX conflicts_file_idx ON conflicts(file_id, status);

share_permissions #

CREATE TABLE share_permissions (
    permission_id   UUID PRIMARY KEY,
    resource_id     UUID NOT NULL,           -- file_id or folder_id
    resource_type   TEXT NOT NULL,           -- file | folder
    grantor_user_id UUID NOT NULL REFERENCES users(user_id),
    grantee_user_id UUID NOT NULL REFERENCES users(user_id),
    access_level    TEXT NOT NULL,           -- read | write | admin
    status          TEXT NOT NULL DEFAULT 'active',  -- active | revoked
    created_at      TIMESTAMPTZ NOT NULL,
    revoked_at      TIMESTAMPTZ,
    CONSTRAINT sp_status_check CHECK (status IN ('active', 'revoked')),
    CONSTRAINT sp_access_check CHECK (access_level IN ('read', 'write', 'admin'))
);
CREATE INDEX sp_grantee_idx ON share_permissions(grantee_user_id, status);
CREATE INDEX sp_resource_idx ON share_permissions(resource_id, status);

idempotency_keys #

CREATE TABLE idempotency_keys (
    idempotency_key TEXT PRIMARY KEY,
    result          JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    expires_at      TIMESTAMPTZ NOT NULL     -- TTL 24h
);

outbox #

CREATE TABLE outbox (
    outbox_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type      TEXT NOT NULL,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL,
    published_at    TIMESTAMPTZ,
    status          TEXT NOT NULL DEFAULT 'pending'  -- pending | published
);
CREATE INDEX outbox_pending_idx ON outbox(status, created_at) WHERE status='pending';

users #

CREATE TABLE users (
    user_id             UUID PRIMARY KEY,
    email               TEXT UNIQUE NOT NULL,
    plan                TEXT NOT NULL DEFAULT 'free',  -- free | plus | professional | business
    quota_limit_bytes   BIGINT NOT NULL DEFAULT 2147483648,  -- 2 GB default
    quota_used_bytes    BIGINT NOT NULL DEFAULT 0,           -- cached counter; reconciled async
    created_at          TIMESTAMPTZ NOT NULL
);

Step 10 — Technology Landscape #

Goal: Capability → shape → product mapping.

Capability NeededShape RequiredProduct SelectedReason
Block storage (immutable, content-addressed, globally durable)Object store: key = hash, value = bytes; S3-compatible; high throughput PUT/GETAWS S3Industry standard; 11-nines durability; CDN-compatible; multi-region replication; pre-signed URLs eliminate proxy hop
File metadata (current state, version, status)Relational: ACID transactions, CAS via optimistic locking, foreign keysPostgreSQLStrong consistency; supports CAS via conditional UPDATE; row-level locking; JSONB for block_list
File change log + version history (high-write append, large volume, time-series)Wide-column store: append-only, partition by file_id, time-orderedApache CassandraOptimized for append-heavy workloads; tunable consistency; partition key = file_id for co-location
Sync notification (real-time push to connected devices)Pub/sub with persistence; consumer groups for multiple devicesApache KafkaOrdered per partition (user_id); consumer offset = device sync cursor; replay from any offset; durable
Permission cache (low-latency read, TTL, pub/sub invalidation)In-memory cache with TTL and pub/subRedisO(1) GET/SET; TTL support; Redis pub/sub for invalidation events; Sentinel/Cluster for HA
Quota counter (high-throughput increment/decrement)In-memory atomic counter with persistenceRedisINCRBY/DECRBY atomic; persist to Postgres on write; reconcile async
Conflict management (process object with state machine, strong consistency)Relational: ACID, CAS on statusPostgreSQLSame cluster as files; conflict references file; transactional integrity
Idempotency key store (short-lived, key-value, TTL 24h)Key-value with TTLRedis (or Postgres with TTL index)Redis for speed; Postgres if durability is required across restarts
CDN (block download acceleration for frequently accessed files)Edge cache with origin pullCloudflare CDNGlobal PoPs; origin = S3; cache key = block hash (content-addressed = perfect cache hit rate for identical blocks)
Outbox relayBackground job consuming outbox tableCustom relay + Kafka producerPolls outbox table; publishes events to Kafka; marks published

Step 11 — Deployment Topology #

Goal: Service boundaries and failure domains.

Services #

┌────────────────────────────────────────────────────────────────────┐
│                         Client Layer                               │
│  Desktop (Mac/Win/Linux)  Mobile (iOS/Android)  Web (Browser)     │
└────────────────────────────┬───────────────────────────────────────┘
                             │ HTTPS + WebSocket
┌────────────────────────────▼───────────────────────────────────────┐
│                         API Gateway                                │
│  Rate limiting, auth, TLS termination, routing                    │
└──┬──────────┬──────────┬──────────┬───────────────────────────────┘
   │          │          │          │
   ▼          ▼          ▼          ▼
┌──────┐  ┌──────┐  ┌──────┐  ┌────────────────┐
│Upload│  │Sync  │  │Share │  │Notification    │
│Svc   │  │Svc   │  │Svc   │  │Svc (WebSocket) │
└──┬───┘  └──┬───┘  └──┬───┘  └──────┬─────────┘
   │          │          │             │
   ▼          ▼          ▼             ▼
┌─────────────────────────────────────────────────┐
│                  Data Layer                      │
│                                                  │
│  PostgreSQL (files, conflicts, shares, users)    │
│  Cassandra (file_change_log, file_versions)      │
│  Redis (permission cache, quota counters,        │
│          pub/sub, idempotency keys)              │
│  Kafka (file-changes topic, sync events)         │
│  S3 (block content)                              │
│  Cloudflare CDN (block download acceleration)    │
└─────────────────────────────────────────────────┘

Service Responsibilities #

ServiceResponsibilityFailure Domain
Upload ServiceManifest validation, block existence check, block upload coordination, commit FileVersion, append FileChangeLog, write outboxStateless; horizontally scaled; failure = upload fails, client retries
Sync ServiceReconnect handling, version vector comparison, conflict detection, streaming events from FileChangeLog to deviceStateless; horizontally scaled; failure = device retries from stored cursor
Share ServiceGrant/revoke permissions, check eligibility, publish invalidation eventsStateless; failure = permission operation fails, client retries
Notification ServiceWebSocket connections to devices, receive Kafka events, push to connected devicesStateful (WebSocket connections); failure = device falls back to poll
Outbox RelayPoll outbox table, publish to Kafka, mark publishedBackground job; failure = events delayed, not lost
Conflict ResolverReceive conflict events from Kafka, present to user, apply resolutionStateless; failure = conflict remains open until user next opens app

Failure Domains #

  • US-EAST-1 failure: Failover to US-WEST-2 (standby). PostgreSQL primary in US-EAST-1; read replicas in US-WEST-2. S3 cross-region replication. Cassandra multi-region. Kafka MirrorMaker.
  • Single service failure: Other services unaffected. Clients experience partial degradation (e.g., no push notifications if Notification Service is down, but sync still works on reconnect via Sync Service).
  • Redis failure: Fall back to direct Postgres permission check. Quota counter reconciled from Postgres.
  • Kafka failure: Outbox relay accumulates pending events. Devices fall back to poll-based sync.

Step 12 — Consistency Model #

Goal: Per-path classification of consistency guarantees.

Operation PathConsistency ModelJustification
Block upload (PUT to S3)Eventual (S3 strong read-after-write for the same key since 2020)S3 offers strong read-after-write consistency for PUT; replication to other regions is eventual
File metadata commit (CAS on version_number)Strong (serializable via Postgres transaction)CAS requires seeing the latest version; Postgres serializable transaction ensures this
FileChangeLog appendEventual (Cassandra quorum write + quorum read = strong per partition)Quorum read guarantees reading latest write; events are ordered by sequence_number within partition
Permission check (cache hit)Eventual (bounded by TTL ≤ 60s)Cache may be up to 60s stale after revocation; acceptable per AC1 SLO
Permission check (cache miss → DB)Strong (Postgres primary read)Cache miss goes to source of truth
Sync event deliveryEventual (Kafka delivery guarantee: at-least-once)Events may be delivered more than once; consumers deduplicate by sequence_number
Quota check (Redis counter)Eventual (cached counter; reconciled async)May over-allow up to cache staleness window; bounded over-quota by reconciliation job
Conflict detectionStrong (version vector comparison at server; Postgres transaction on conflict creation)Conflict must not be silently dropped; strong consistency required

Step 13 — Scaling Model #

Goal: Scale type per dimension, hotspots, and strategy.

Scaling Dimensions #

DimensionScale TypeHotspot RiskStrategy
Block storage (bytes)Horizontal partition by hash prefixNone (content-addressed; uniform distribution)S3 auto-scales; no action needed
File metadata (rows)Horizontal sharding by user_idPopular users (celebrity accounts in B2B context)Shard Postgres by user_id; per-shard PostgreSQL instance
FileChangeLog (events/sec)Write-scale by file_id partition in CassandraHot file (shared file edited frequently)Cassandra partition by file_id; consistent hashing; increase replication factor for hot partitions
Upload throughput (Gbps)Horizontal: more Upload Service instances; S3 transfer accelerationNone (direct-to-S3 bypasses proxy)Pre-signed URLs: client uploads directly to S3, bypassing service tier entirely; service only handles manifest and commit
Sync notification fanoutWrite fan-out to N devicesUser with thousands of devices (enterprise)Cap devices per user; batch notification; Kafka partition by user_id; Notification Service shards by user_id
Permission readsRead-heavy; cache-dominatedNone after cache warm-upRedis cluster; permission cache hit rate > 99% after warm-up
Reconnect storm (after outage)Thundering herdAll devices reconnect simultaneously after region recoveryExponential backoff + full jitter on client; rate limiting at API Gateway; Sync Service queue with backpressure

Block-Level Dedup Savings #

Global dedup across users reduces storage by an estimated 30-70% for typical enterprise workloads (common file types like PDFs, Office documents, and images with identical blocks). The dedup ratio compounds with block size: 4MB blocks capture document-level dedup; 64KB blocks capture within-document dedup but increase manifest overhead.

Partition Strategy for FileChangeLog #

The FileChangeLog in Cassandra is partitioned by file_id. This means all events for one file are co-located, and reads for sync (streaming events from an offset) are efficient. The risk is a hot partition for a file with very high event rate (e.g., a frequently edited shared document with many collaborators). Mitigation: rate-limit writes per file_id at the Upload Service level.


Step 14 — Failure Model #

Goal: Per failure type enumeration with detection, impact, and mitigation.

FailureDetectionImpactMitigation
Block upload fails mid-stream (network drop)Client receives error or timeoutPartial block upload; file commit not attemptedResumable upload: client stores upload offset locally; resumes from offset on retry; block hash = idempotency key ensures no corruption
File commit fails after blocks uploaded (DB crash)Client receives 500 or timeoutBlocks uploaded to S3 but file metadata not committed; orphaned blocksIdempotency key: client retries commit; server checks idempotency key store; if not found, re-runs commit; orphaned blocks cleaned by GC job after TTL
Sync service crash during syncClient receives disconnectDevice’s sync_cursor not advanced past the crash pointClient stores cursor locally; on reconnect, resumes from last local cursor; server replays from that point
Kafka partition leader failureKafka replication detects; leader re-electedBrief delay in sync notification deliveryKafka auto-elects new leader; producers retry; at-least-once delivery guaranteed; consumers deduplicate
Redis cache failureHealth check + circuit breakerPermission cache misses; fallback to PostgresCircuit breaker opens; all permission reads go to Postgres primary; higher latency but correct; Redis recovers and cache warms
Reconnect storm after regional outageSpike in sync request rate; API Gateway metricsUpload Service and Sync Service CPU spike; latency increase; possible retry cascadeExponential backoff + full jitter on client; API Gateway rate limiting per user; Sync Service autoscaling with pre-provisioned capacity
Ghost file in device cache (file deleted on server, still cached locally)Deletion event in FileChangeLog not yet processed by deviceDevice shows deleted file as availableDeletion event is durable in FileChangeLog; device processes on reconnect; TTL on local cache entries (24h backstop); device validates file existence on open
Version vector clock skew (device clock wrong)Conflict detection produces false positiveSpurious conflict object createdUse server-assigned sequence numbers for ordering; do not trust device clocks for version comparison; version vectors use sequence numbers, not wall clock
S3 regional outageS3 health checks; upload failuresUploads fail; downloads may serve from CDN cache for recently accessed blocksCDN caches recently downloaded blocks; uploads queue locally on device and retry when S3 recovers; read-only mode for cached files
Database primary failure (Postgres)pg_stat_replication lag monitoring; health check timeoutAll writes fail; reads fall to replica (stale)Automatic failover to replica (Patroni or RDS Multi-AZ); failover time ≈ 30s; uploads fail-fast during failover; clients retry
Conflict resolution failure (resolver crashes)Conflict stuck in ‘open’ stateUser cannot resolve conflictConflict remains in ‘open’ state; user can retry resolution; no data loss (both versions preserved)
Quota counter corruption (Redis restart)Quota reconciliation job detects mismatchUsers may upload beyond quota limit until reconciliationReconciliation job runs every 15 minutes; computes exact usage from file records; corrects Redis counter and Postgres field

Step 15 — SLOs #

SLOTargetMeasurement
Upload availability99.9% (43.8 min/month downtime)HTTP success rate for upload commits, measured per 5-min window
Upload P99 latency (small file <1MB)< 500msTime from first HTTP byte to commit response
Upload P99 latency (large file 1GB, delta = 0 new bytes)< 2sManifest check + commit without block uploads
Download P99 latency (cache hit at CDN)< 50msTime from request to first byte of first block
Download P99 latency (cache miss, S3 origin)< 500msTime from request to first byte, S3 origin
Sync propagation latency (device online, file changed on another device)< 5s end-to-end (P99)Time from commit on device A to event received on device B
Permission revocation propagation< 60s (P99)Time from revocation commit to all file service instances invalidating cache
Conflict detection latency< 2s from reconnect to conflict object createdTime from device reconnect to conflict notification received
Version history availability99.9%HTTP success rate for version list API
Data durability (uploaded blocks)11-nines (≥ 99.999999999%)Inherited from S3; supplemented by cross-region replication

Step 16 — Operational Parameters #

ParameterDefault ValueTunable?Notes
Block size4 MBYes (1MB–64MB)Larger blocks = better dedup ratio; smaller blocks = finer-grained delta; 4MB is a common sweet spot
Block hash algorithmSHA-256NoCryptographically secure; collision probability negligible at any realistic block count
Idempotency key TTL24 hoursYesAfter 24h, a retry of an old upload will create a new version
Permission cache TTL60 seconds (strong mode); 300 seconds (eventual mode)YesTrade-off: lower TTL = faster revocation propagation; higher TTL = lower permission check latency
FileChangeLog retention180 daysYesAfter 180 days, old events are archived to cold storage (S3 Glacier)
Version history retentionPer plan (Free: 30 days; Plus: 180 days; Professional: 365 days; Business: unlimited)Per planOlder versions archived or deleted per plan
Reconnect backoff base1 secondYesExponential backoff with full jitter; cap at 60 seconds
Reconnect backoff cap60 secondsYesPrevents thundering herd from collapsing to fixed interval
Upload max file size50 GB (Business plan)YesEnforced at API Gateway
Max devices per user100YesCaps notification fan-out overhead
Quota reconciliation interval15 minutesYesHow often the background job recomputes exact usage from DB
Block GC delay7 days after last referenceYesAfter all files referencing a block are deleted, block is eligible for S3 deletion after 7 days

Step 17 — Runbooks #

Runbook 1: High Upload Failure Rate #

Trigger: Upload success rate drops below 99% for 5 consecutive minutes.

Diagnosis sequence:

  1. Check API Gateway error logs: distinguish 4xx (client errors, e.g., quota exceeded) from 5xx (service errors).
  2. If 5xx: check Upload Service logs for DB connection errors, S3 timeout errors, or CAS contention errors.
  3. If DB CAS contention high: check pg_stat_activity for long-running transactions blocking the files table.
  4. If S3 errors: check AWS S3 Service Health Dashboard; check S3 error rate in CloudWatch.
  5. If Upload Service crash: check pod restart count in Kubernetes; check OOM events.

Mitigation:

  • DB contention: kill long-running queries; identify and fix the offending query.
  • S3 outage: activate read-only mode (downloads continue from CDN; uploads queue locally on device).
  • Upload Service crash: Kubernetes auto-restarts; clients retry with exponential backoff.

Runbook 2: Sync Propagation Latency Spike #

Trigger: P99 sync latency exceeds 10 seconds for 5 minutes.

Diagnosis sequence:

  1. Check Kafka consumer group lag for file-changes topic. High lag = notification delivery is behind.
  2. Check Notification Service throughput and connection count. Dropped WebSocket connections?
  3. Check Sync Service queue depth. Reconnect storm?
  4. Check FileChangeLog read latency in Cassandra. Hot partition?

Mitigation:

  • Kafka consumer lag: scale up Notification Service replicas; increase Kafka partition count.
  • Reconnect storm: check if a regional incident just recovered; engage rate limiting at API Gateway.
  • Cassandra hot partition: identify the hot file_id; rate-limit uploads to that file_id at Upload Service.

Runbook 3: Permission Revocation Not Propagating #

Trigger: Alert from permission audit job: grantee can still access file 5 minutes after revocation.

Diagnosis sequence:

  1. Verify revocation is committed in Postgres share_permissions table (status = 'revoked').
  2. Check Redis cache entry for (grantee_id, resource_id): is TTL still active?
  3. Check Outbox table: is the revocation event still in pending status (Outbox relay stalled)?
  4. Check Kafka: did the perm.invalidate message reach the file service instances?

Mitigation:

  • Cache entry stale: force TTL expiry via Redis DEL on the cache key (manual intervention).
  • Outbox relay stalled: restart outbox relay service; it will republish pending events.
  • Kafka consumer stalled: restart notification service consumer group.

Runbook 4: Reconnect Storm After Outage #

Trigger: Sync request rate spikes to > 10x normal immediately after service recovery.

Actions (in order):

  1. Verify API Gateway rate limiting is active: 100 sync requests/user/minute.
  2. Check that clients are using exponential backoff + jitter (not fixed-interval retry).
  3. If Sync Service CPU > 80%: scale up horizontally (Kubernetes HPA should trigger; verify).
  4. If Cassandra read latency spikes due to backlog: temporarily increase read timeout; increase Cassandra read replica count.
  5. Monitor until request rate normalizes (typically 5-10 minutes with proper backoff).

Runbook 5: Orphaned Block Cleanup #

Trigger: Block GC job reports orphaned blocks (blocks in S3 with reference_count = 0 for > 7 days).

Actions:

  1. Verify reference_count calculation: run reconciliation query across files.block_list and blocks table.
  2. If reference_count is correct (truly unreferenced): schedule S3 deletion via lifecycle policy.
  3. Log block hashes before deletion for audit trail.
  4. Do NOT delete blocks with reference_count > 0 (even if files are in deleted state — blocks may be referenced by FileVersions for restore purposes).

Step 18 — Observability #

Metrics #

MetricTypeLabelsAlert Threshold
upload.requests.totalCounterstatus (success/failure), error_type
upload.latency.secondsHistogrampercentileP99 > 2s for 5min
upload.bytes.totalCounter
block.dedup.ratioGauge< 0.3 (unexpectedly low dedup)
sync.propagation.latency.secondsHistogramP99 > 10s for 5min
kafka.consumer.lagGaugeconsumer_group, topic> 10000 messages
permission.cache.hit_rateGauge< 0.95 for 5min
conflict.open.countGauge> 1000 open conflicts
quota.check.failures.totalCounter> 100/min
db.connection_pool.waitingGaugedb_name> 10 waiting
s3.error.rateGaugeoperation> 0.01 (1%)
device.reconnect.rateGauge> 10x baseline

Traces #

  • Every upload request carries a trace ID from client through Upload Service → S3 → DB commit → Outbox.
  • Every sync event carries the FileChangeLog sequence_number as a trace tag.
  • Conflict detection traces: version vector comparison and conflict creation are traced as a single span.

Logs #

Log EventFieldsPurpose
upload.committedfile_id, version_number, user_id, device_id, block_count, new_block_count, duration_msAudit + dedup stats
conflict.detectedfile_id, device_id, device_vv, server_vv, conflict_idConflict rate monitoring
conflict.resolvedconflict_id, resolution, resolved_byResolution tracking
permission.grantedpermission_id, grantor, grantee, resource_id, access_levelAccess audit
permission.revokedpermission_id, grantor, grantee, resource_id, revoked_atAccess audit
quota.exceededuser_id, current_bytes, limit_bytes, file_size_bytesQuota enforcement
sync.resumeddevice_id, cursor_from, cursor_to, events_replayedSync health

Dashboards #

  1. Upload health: request rate, success rate, P50/P95/P99 latency, block upload rate, dedup ratio.
  2. Sync health: propagation latency, Kafka consumer lag, reconnect rate, conflict rate.
  3. Storage: total blocks stored, bytes stored, dedup savings (bytes not uploaded due to dedup), orphaned blocks.
  4. Access control: permission grant/revoke rate, cache hit rate, revocation propagation latency.
  5. Infrastructure: DB connection pool, Cassandra read/write latency, Redis memory usage, S3 error rate.

Step 19 — Cost Model #

Cost Drivers #

ComponentUnit Cost (approximate)Volume (hypothetical 10M users)Monthly Cost Estimate
S3 block storage$0.023/GB/month10M users × 2GB average = 20PB; with 50% dedup = 10PB$230,000
S3 PUT requests (block uploads)$0.005 per 1000 PUT10M uploads/day × 20 blocks avg = 200M PUT/day = 6B/month$30,000
S3 GET requests (block downloads)$0.0004 per 1000 GET50M downloads/day × 20 blocks avg = 1B GET/day = 30B/month$12,000
Cloudflare CDN~$0.01/GB (egress)30B GET/month; CDN hit rate 80%; 20% from S3 = 6B × 20KB avg = 120TB egress from S3$1,200 (S3 egress) + Cloudflare plan
PostgreSQL (RDS Multi-AZ db.r6g.4xlarge)~$1,200/month per instance2 instances (primary + replica)$2,400
Cassandra (3 nodes × i3.4xlarge)~$1,200/month per node3 nodes in US-EAST-1; 3 in US-WEST-2$7,200
Redis (ElastiCache r6g.xlarge)~$250/month2 instances (primary + replica)$500
Kafka (MSK 3 brokers × kafka.m5.2xlarge)~$400/month per broker3 brokers$1,200
Compute (ECS/Kubernetes, Upload + Sync + Share + Notification services)~$0.05/vCPU-hour50 vCPU average$1,800
Data transfer (S3 to EC2, same region)Free$0
Total estimated~$286,000/month

Cost Optimization Levers #

  1. Dedup ratio is the biggest lever: Global cross-user block dedup at 4MB block size typically achieves 40-60% storage reduction for enterprise document workloads. Each percentage point = $2,300/month at this scale.
  2. CDN cache hit rate: Blocks are content-addressed and immutable → perfect CDN cache invalidation (never invalidate; TTL = infinite). High cache hit rate dramatically reduces S3 GET requests.
  3. S3 Intelligent-Tiering: Blocks not accessed for 30 days automatically move to cheaper storage tiers ($0.0125/GB vs. $0.023/GB).
  4. Cassandra compression: LZ4 compression on FileChangeLog reduces storage by ~3x; Cassandra native.
  5. Right-size Cassandra: At lower scales, Cassandra can be replaced with Postgres (fewer moving parts; lower operational cost).

Step 20 — Evolution Stages #

Stage 1: MVP (0 → 100K users) #

What to build:

  • Single-region deployment (US-EAST-1).
  • Upload Service + Sync Service as a monolith.
  • PostgreSQL for everything (files, file_versions, file_change_log, share_permissions).
  • S3 for block storage.
  • No real-time push: device polls every 30 seconds.
  • Basic conflict detection: last-write-wins (simpler; document caveat to users).
  • No global block dedup: per-user dedup only.

What to defer:

  • Cassandra (Postgres handles the write load at this scale).
  • Kafka (polling is sufficient).
  • Global dedup (per-user dedup is simpler and sufficient).
  • Redis (permissions checked directly from Postgres on every request).
  • Multi-region.

Graduation criteria: 100K users, upload P99 < 1s, sync delay < 30s (polling interval).


Stage 2: Growth (100K → 1M users) #

Adds:

  • Separate Upload, Sync, Share, and Notification services (decompose monolith by service boundary).
  • Redis for permission cache (reduce Postgres read load).
  • WebSocket-based push notifications (reduce sync latency from 30s to < 5s).
  • Global block dedup (same content hash = skip upload globally).
  • Idempotency key store for upload retries.
  • Outbox + Relay for async event propagation.
  • Full version vector conflict detection (replace last-write-wins).

Graduation criteria: 1M users, upload P99 < 500ms, sync latency P99 < 5s, conflict detection working.


Stage 3: Scale (1M → 50M users) #

Adds:

  • Cassandra for FileChangeLog and FileVersions (write throughput beyond Postgres single-shard capacity).
  • Kafka for all async event propagation (replace polling-based Outbox with Kafka consumers).
  • Postgres sharding by user_id (horizontal scale for file metadata).
  • Multi-region active-active (US-EAST-1 + EU-WEST-1; data sovereignty for EU users).
  • Reconnect storm protection (exponential backoff enforcement + API Gateway rate limiting).
  • S3 Intelligent-Tiering for cost optimization.
  • CDN integration for block downloads.
  • Quota reconciliation background job (async exact computation).

Graduation criteria: 50M users, upload availability 99.9%, sync propagation P99 < 5s globally.


Stage 4: Maturity (50M+ users) #

Adds:

  • Differential sync at sub-block level (rsync-style rolling checksum for large files with small edits — reduce block upload count further).
  • CRDT-based collaborative editing for specific file types (Google Docs-style; Conflict object becomes less necessary for text files).
  • Tiered storage with user-controlled cold archive (files not accessed for 1 year move to Glacier with restore-on-demand).
  • Zero-knowledge encryption option (client-side encryption; server stores ciphertext; server cannot read blocks; dedup becomes per-user only, not global, since ciphertext of same plaintext differs per user key).
  • Federated deployment for enterprise (on-premises Dropbox Business with same protocol; client handles sync identically).

Trade-off noted for zero-knowledge encryption: Global block dedup (Step 5’s U1 invariant) is incompatible with per-user client-side encryption keys. If user A and user B both encrypt the same file, the resulting blocks differ (different keys → different ciphertext → different hashes). Global dedup collapses to zero. The system must choose: dedup (store plaintext on server) or zero-knowledge (sacrifice global dedup). This is a product decision, not a technical limitation.

There's no articles to list here yet.