Skip to main content
  1. System Design Components/

Distributed Key-Value Store Analysis Note

Distributed Key-Value Store Analysis Note #

This note captures the full step-by-step analysis from the infra interview recipe for a distributed key-value store, including the concrete request-flow layer that makes the design operationally clear.

Step 1 — Normalize #

Assume the baseline prompt is:

  • design a distributed key-value store
  • clients can Put, Get, and Delete
  • store should scale across nodes
  • data should survive node failure

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Client writes value for keyClientoverwrite stateS1
update target
KeyValueEntry
C1
Client reads value for keyClientread sourceS1
read source target
KeyValueEntry
R1
Client deletes keyClientstate transitionS1
update target
KeyValueEntry
C1
System replicates key update to replicasSystemasync processS1
hidden write target
ReplicaState
C1
System routes key to partition ownerSystemread sourceS1
read source target
PartitionMap
C1
System reassigns partition after node failureSystemstate transitionS1
update target
PartitionOwnership
C1
Client reads metadata about cluster/partitionsClientread projectionS1
read projection target
ClusterStatusView
R2

Notes on normalization:

  • Put is overwrite state
    • current value for a key is the main truth
  • Get is read source
    • unless later we allow stale/cache reads
  • Delete is state transition
    • because a key usually moves from present to tombstoned/deleted state
  • replication is async process
    • internal propagation path
  • partition reassignment is state transition
    • ownership changes over time

Likely C1 paths:

  • write key
  • delete key
  • replication
  • partition ownership changes

Likely R1:

  • get key

Step 2 — Critical Path Selection #

Now select the correctness-critical paths.

RequirementPriority classWhy
Write value for keyC1current value truth must not be lost or corrupted
Read value for keyR1core serving path
Delete keyC1delete/tombstone semantics affect truth
Replicate key updateC1durability and replica convergence depend on it
Route key to partition ownerC1wrong routing can send writes to wrong authority
Reassign partition after node failureC1failover correctness depends on valid ownership transfer
Read cluster/partition statusR2operational, not core product path

Baseline critical paths:

  • P1 put key
  • P2 delete key
  • P3 replicate mutation
  • P4 route to partition owner
  • P5 reassign partition ownership

Main R1 path:

  • P6 get key

This is not just a CRUD prompt. The real infra correctness comes from:

  • authoritative ownership of a key/partition
  • current value semantics per key
  • replication and failover correctness

Step 3 — Primary State Extraction #

For a distributed key-value store, the minimal primary state is the current key state plus partition ownership and replication state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
KeyValueEntrydirect nounYeskeep as candidateentityYesserviceoverwriteinstancekey
Tombstonelifecycle objectYessplit candidateeventNoserviceappend-onlyinstancekey
ReplicaStatehidden write targetYeskeep as candidateprocessYesservicestate machinerelationpartition_id + replica_id
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstancepartition_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectionkeyspace partitions
ClusterStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectioncluster
ClientSessionhidden write targetNoreject as UI artifactprocessNoderivedoverwriteinstancesession_id

Important modeling choices:

KeyValueEntry #

This is obviously primary:

  • current value for a key
  • possibly fields:
    • value
    • version
    • deleted
    • last_updated

Tombstone #

I would usually fold delete semantics into KeyValueEntry rather than keep Tombstone as a separate primary object in the minimal design.

So the cleaner minimal model is:

  • KeyValueEntry with:
    • present / deleted state
    • version/timestamp

That means:

  • Tombstone is not necessary as a separate primary object in the baseline

ReplicaState #

Needed because:

  • replicas may be follower, catching up, active, stale, failed
  • replication and failover correctness depend on replica lifecycle

PartitionOwnership #

Needed because:

  • one node or replica set must be authoritative for a partition at a time

PartitionMap #

This is also primary because:

  • routing depends on it
  • wrong map means wrong authority

Minimal strict primary set:

  • KeyValueEntry
  • PartitionOwnership
  • PartitionMap
  • ReplicaState

Step 4 — Hard Invariants #

For a distributed key-value store, the hard invariants are about authoritative ownership, valid overwrite/delete semantics, and replica convergence.

PathTierTypeInvariant templateInvariant statement
P1
write path
Put key
HARDeligibilityeligibility templateAction put_key is valid only if request is routed to current authoritative partition owner and version precondition, if any, holds on key at decision time.
P1
write path
Put key
HARDorderingordering templateInstances writes for key are ordered by authoritative commit order within key.
P2
write path
Delete key
HARDeligibilityeligibility templateAction delete_key is valid only if request is routed to current authoritative partition owner and current key state allows deletion on key at decision time.
P2
write path
Delete key
HARDorderingordering templateInstances delete and overwrite mutations for key are ordered by authoritative commit order within key.
P3
write path
Replicate mutation
HARDaccountingaccounting templateReplica state for partition_id equals authoritative committed mutation stream modulo bounded replication lag.
P4
write path
Route to partition owner
HARDuniquenessuniqueness templateKey partition_id maps to at most one logical outcome current authoritative owner within partition_id.
P5
write path
Reassign partition
HARDeligibilityeligibility templateAction reassign_partition is valid only if current owner is failed or relinquished and new owner is eligible and sufficiently caught up holds on partition_id at decision time.
P6
read path
Get key
HARD or SOFTfreshnessfreshness templateRead path get_key reflects authoritative key state within configured consistency bound.

Important note on the last invariant:

Get can be:

  • HARD freshness if the store promises strong reads
  • SOFT freshness if the store allows bounded-stale follower reads

Since the prompt didn’t specify yet, keep this open until execution context.

What matters most:

1. Single authoritative owner per partition #

This prevents split-brain writes.

2. Total authoritative order per key #

Even if the store is distributed, each key must have one authoritative mutation order.

3. Replication correctness #

Replicas must converge to the committed authoritative stream.

4. Reassignment safety #

Failover cannot promote a stale replica that would lose committed writes.

Step 5 — Execution Context #

We now declare the baseline assumptions that constrain valid mechanisms.

FieldValueWhy
Topologysingle service distributedone logical KV service spread across many nodes
Write coordination scopeper object scopecorrectness is per key and per partition ownership scope
Read consistency targetstrong onlysafest baseline unless prompt explicitly allows stale reads
Holder modelnodepartition ownership is held by nodes or replica leaders
Compensation acceptable?Nolost or conflicting writes cannot be repaired later by compensation

Derived implications:

  • holder_may_crash = true

    • nodes can fail while owning partitions
  • cross_service_write = false

    • baseline keeps storage, replication, and routing within one logical system
  • bounded_staleness_allowed = false

    • for baseline Get, unless interviewer later asks for follower reads
  • cross_service_atomicity_required = false

    • we do not need multi-service transactions in the baseline
  • exclusive_claim_required = true

    • partition ownership must be exclusive
  • guarded_by_current_state = true

    • deletes, reassignment, and versioned puts depend on current state

This pushes us toward:

  • one authoritative writer per partition
  • guarded overwrite/delete for key state
  • exclusive claim for partition ownership
  • ordered replication from authoritative owner to replicas

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

PathWhyWrite shape
P1 put keycurrent key value is replaced, often with version preconditionoverwrite current value
P2 delete keydelete/tombstone changes current key state under current-state rulesguarded state transition
P3 replicate mutationreplicas consume committed mutation streamappend-only event
P4 route to partition ownerpartition owner is unique current authorityexclusive claim
P5 reassign partitionvalid only if current ownership/failover state allows itguarded state transition

6B. Base Mechanism #

PathWrite shapeBase mechanismRequired companions
P1 put keyoverwrite current valuesingle writer per partitionversion, commit index
P2 delete keyguarded state transitionCAS on (state, version) or leader-applied guarded transitiontombstone/version
P3 replicate mutationappend-only eventappend logcommit index, replica ack/progress
P4 route to partition ownerexclusive claimleasefencing token, heartbeat
P5 reassign partitionguarded state transitionCAS on (state, version)fencing token, replica catch-up check

Why these fit:

Put key #

The clean infra realization is:

  • leader for partition is the single writer
  • so put is not a distributed CAS across all replicas
  • leader serializes overwrite order for that partition

Delete key #

Delete is safer modeled as a guarded transition because:

  • delete vs overwrite ordering matters
  • tombstone/version semantics matter
  • stale deletes must not silently win

Replication #

Replication is naturally:

  • append committed mutations to the partition’s ordered stream
  • replicas follow that stream

Partition ownership #

This is the classic exclusive claim problem:

  • only one node may be authoritative leader for a partition

Reassignment #

Promotion/failover is valid only if:

  • current owner is gone or relinquished
  • candidate replica is eligible/caught up

So it is a guarded transition.

Canonical substrate implied:

  • partitioned leader-based KV store
  • one leader per partition
  • append-only mutation log per partition
  • replicas follow the leader’s commit stream
  • key updates applied by partition leader
  • partition leadership managed with lease/fencing semantics

Step 7 — Read Model / Source of Truth #

For a distributed KV store, truth is mostly direct source state plus replication progress state.

ConceptTruthRead pathRebuild path
C1
source concept
Current value for key
KeyValueEntry on authoritative partition ownerread source directlyauthoritative partition state / replay from committed log
C2
source concept
Partition ownership
PartitionOwnershipread source directlyauthoritative ownership store
C3
source concept
Partition map
PartitionMapread source directlyauthoritative routing metadata
C4
status concept
Replica catch-up state
ReplicaStateread source directlyrecompute from commit index and replica progress
C5
projection concept
Cluster status view
derived from ownership and replica statematerialized viewrecompute from primary metadata

Important distinction:

For the correctness-critical baseline:

  • Get(key) should read from authoritative source state
  • routing should read from authoritative partition map
  • failover should read authoritative replica progress / ownership state

Only operational views like:

  • cluster health
  • status dashboards
  • capacity views

should be treated as projections.

Optional variant:

If interviewer later allows follower reads:

  • Get(key) can become:
    • read source directly from leader for strong reads
    • or bounded-stale read from replica for relaxed consistency

But baseline source-of-truth remains:

  • leader or authoritative partition state

Step 8 — Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 put keyretry with request id or version preconditionpartition leader serializes competing writes; stale precondition losescommitted write survives leader crash if replicated past commit pointreplication to followers may lag but committed truth is preservedold leader blocked by fencing token
P2 delete keyretry with version/tombstone preconditionstale delete loses guarded transitioncommitted tombstone survives crash if replicated past commit pointfollower lag acceptable until caught upstale leader blocked by fencing token
P3 replicate mutationfollower retries from last known offset/indexreplicas independently catch up from committed streamleader crash stops new replication until failover; committed log remains authoritativemissing replica append retried from commit indexstale replica cannot become authority if not caught up
P4 route to partition ownerretry after refreshing partition maponly one valid owner/leader should existif owner changed, refreshed map points to new leadern/astale owner rejected by lease epoch/fencing
P5 reassign partitionretry failover transition safelyonly one reassignment wins current ownership transitionpromoted leader crash triggers later reassignmentn/aprevious leader fenced and must not accept writes

What matters most:

1. Stale leader protection #

This is the main safety issue.

Bad case:

  • old leader still accepts writes after failover
  • new leader also accepts writes
  • split brain

Solution:

  • lease epoch / fencing token
  • only current owner epoch may commit writes

2. Commit point vs replica lag #

A leader crash is safe only if:

  • committed mutations are known and preserved
  • failover chooses a replica that is sufficiently caught up

3. Retry semantics #

Client retries should not create ambiguity:

  • either use request ids
  • or require version preconditions / last-write rules
  • or document that repeated puts can produce repeated overwrites safely

4. Delete semantics #

Delete often needs tombstones/versioning so that:

  • stale replicas or stale writes do not resurrect deleted keys incorrectly

Step 9 — Scale Adjustments #

HotspotTypeFirst response
hot partition leadercontention hotspotincrease partition count or rebalance hot key ranges
hot keyswrite throughput hotspotisolate hot keys, use finer partitioning, or special-case caching if semantics allow
replication lagwrite throughput hotspottune replication pipeline, batch log shipping, and control follower catch-up
partition-map churn during failurescontention hotspotslow reassignment, use stable hashing/partitioning, and separate membership from hot data path
strong-read load on leadersread hotspotadd follower reads only for relaxed-consistency paths or add more partitions
storage growth from logs/tombstonesstorage growth hotspotcompaction, snapshotting, and tombstone cleanup after safe retention window

What scales well:

A leader-based distributed KV store scales by:

  • partitioning keyspace
  • making each partition have one authoritative writer
  • replicating per partition independently

So throughput grows primarily with:

  • number of partitions
  • number of healthy leaders
  • replication pipeline capacity

What fails first if designed badly:

  • one global leader for all keys
  • no partitioning
  • failover to stale replicas
  • unlimited tombstone/log growth
  • hot keys sharing the same coarse partition forever

Canonical design conclusion:

  • primary state:
    • KeyValueEntry
    • PartitionOwnership
    • PartitionMap
    • ReplicaState
  • critical invariants:
    • one authoritative owner per partition
    • one authoritative mutation order per key within partition
    • replicas converge from committed mutation stream
    • failover only promotes sufficiently up-to-date replicas
  • mechanisms:
    • single writer per partition
    • append log
    • lease
    • guarded transition for delete/failover
  • reads:
    • source reads from authoritative owner by default
    • operational views are projections

Polished interview answer:

“I’d design this as a partitioned, leader-based distributed KV store. Each partition has one authoritative owner that serializes writes for keys in that partition. Key state is the source of truth, partition ownership is lease-backed and fenced, and replication happens through an ordered append stream from the leader to replicas. Put is an overwrite on current key state applied by the partition leader, Delete is a guarded transition with tombstone/version semantics, and failover is a guarded ownership transition that only promotes a sufficiently caught-up replica. By default, reads come from authoritative source state, and the main scaling levers are more partitions, balanced leadership, and efficient replication.”

Concrete Substrate #

I’ll choose a leader-based replicated KV store as the concrete baseline, because that matches the mechanical derivation we already reached:

  • single writer per partition
  • append log
  • lease
  • guarded failover

Concrete substrate:

  • keyspace partitioned by hash/range into partitions
  • each partition is a small replicated group
  • one leader per partition
  • followers replicate the partition’s ordered mutation log
  • a metadata/config service tracks:
    • partition map
    • current leader per partition
    • replica membership

Concrete tech family:

  • service in Go or Java
  • per-partition replicated log using Raft
  • local durable storage with RocksDB or LSM-backed engine
  • cluster membership / metadata either:
    • built into the same Raft control plane
    • or separate small metadata service like etcd

That is a definite substrate.

Operation Layer #

1. Put key #

API

  • Put(key, value, expected_version?)

Initiator

  • client

Entry point

  • client-facing gateway or any storage node

Authoritative decider

  • current leader of partition(key)

Precondition

  • request routed to current leader
  • if expected_version is present, current version must match

Transition

  • leader appends PUT(key, value, new_version) to partition log
  • once committed, leader updates KeyValueEntry

Response

  • {version, commit_index}

Physical responder

  • leader, or front node after leader response

Failure cases

  • stale routing -> redirect or retry with newer partition map
  • leader unavailable -> retry
  • version mismatch -> conflict

Sequence

  1. client sends Put
  2. entry node resolves partition(key)
  3. request forwarded to partition leader
  4. leader validates version precondition
  5. leader appends mutation to Raft log
  6. majority commits
  7. leader applies write to RocksDB
  8. leader responds success

2. Get key #

API

  • Get(key, consistency = strong | stale-ok)

Initiator

  • client

Entry point

  • gateway or storage node

Authoritative decider

  • leader for strong reads
  • selected replica for stale-ok reads

Precondition

  • none

Transition

  • none

Response

  • {value, version, deleted?, commit_index}

Failure cases

  • stale map -> reroute
  • replica may reject strong read if not leader

Sequence

  1. client sends Get
  2. entry node resolves partition
  3. for strong read, route to leader
  4. leader serves committed state from RocksDB/memtable/cache
  5. return value/version

3. Delete key #

API

  • Delete(key, expected_version?)

Initiator

  • client

Entry point

  • gateway or any node

Authoritative decider

  • partition leader

Precondition

  • request routed to current leader
  • version/state precondition if supplied

Transition

  • leader appends DELETE(key, tombstone_version) to log
  • committed state becomes deleted/tombstoned

Response

  • {version, commit_index}

Failure cases

  • stale leader
  • version mismatch
  • tombstone cleanup must not race with lagging replicas

Sequence

  1. client sends Delete
  2. route to leader
  3. leader validates precondition
  4. leader appends tombstone mutation
  5. majority commits
  6. leader applies tombstone state
  7. reply success

4. Replicate mutation #

API

  • internal Raft replication RPC, e.g. AppendEntries

Initiator

  • partition leader

Entry point

  • follower replicas

Authoritative decider

  • leader for proposed log entries
  • quorum for commit

Precondition

  • follower log matches previous index/term at append boundary

Transition

  • follower appends missing log entries
  • commit index advances after quorum success

Response

  • follower ack / next index hint

Failure cases

  • follower behind -> backfill
  • follower divergence -> log repair by Raft
  • leader crash -> new election

Sequence

  1. leader appends local entry
  2. leader sends AppendEntries to followers
  3. followers validate prev index/term
  4. followers append
  5. quorum ack
  6. leader marks committed
  7. followers later apply committed entries to local state

5. Reassign partition / leader failover #

API

  • internal election + reassignment flow, for example:
    • StartElection(partition_id)
    • TransferLeadership(partition_id) or Raft election RPCs

Initiator

  • system

Entry point

  • partition replicas / control plane

Authoritative decider

  • quorum of partition replica group

Precondition

  • current leader failed or leadership transfer initiated
  • candidate is eligible and sufficiently up to date

Transition

  • new leader epoch/term established
  • PartitionOwnership updated
  • old leader fenced

Response

  • new leader advertised in partition map / metadata

Failure cases

  • split vote
  • stale node tries to keep serving
  • promotion of lagging replica avoided by quorum protocol

Sequence

  1. leader fails or loses lease
  2. replicas detect timeout
  3. candidate starts election
  4. quorum elects new leader
  5. new leader begins serving writes
  6. routing metadata updates
  7. old leader, if it returns, is fenced by higher term

Entry Point vs Decider vs Responder #

This is the distinction that usually needs to be made explicit.

PathEntry pointAuthoritative deciderPhysical responderLogical responder
Putgateway / any nodepartition leaderleader or front nodeKV store
Get stronggateway / any nodepartition leaderleader or front nodeKV store
Get stale-okgateway / any nodechosen replicacontacted replicaKV store
Deletegateway / any nodepartition leaderleader or front nodeKV store
replication RPCleaderfollower + quorum commit logicfollower ackpartition replica group
failoverreplica/control planepartition quorumnewly elected leader/control planeKV store

Concrete HLD #

Main components:

  • client gateway / router
    • resolves key -> partition
    • forwards to current leader or replica
  • partition leader
    • authoritative writer for that partition
    • validates versioned writes
    • appends to replicated log
  • partition followers
    • replicate log
    • may serve stale reads if allowed
  • local storage engine
    • RocksDB/LSM-backed key state + WAL
  • partition metadata / config service
    • partition map
    • replica membership
    • leader identity

Concrete Technologies #

A strong concrete interview answer:

  • Go or Java service
  • Raft per partition group for replication and leader election
  • RocksDB on each replica for local durable key state
  • metadata/config managed either:
    • by a separate small etcd control plane, or
    • by a dedicated metadata Raft group

Short version you can say:

“I’d implement this as a partitioned leader-based KV store. Each partition is a small Raft group. The leader is the single writer for keys in that partition, writes are appended to the partition log and committed on quorum before being applied to RocksDB, deletes are tombstoned, and failover is handled by Raft leader election with term-based fencing.”