Distributed Key-Value Store Analysis Note
Table of Contents
Distributed Key-Value Store Analysis Note #
This note captures the full step-by-step analysis from the infra interview recipe for a distributed key-value store, including the concrete request-flow layer that makes the design operationally clear.
Step 1 — Normalize #
Assume the baseline prompt is:
- design a distributed key-value store
- clients can
Put,Get, andDelete - store should scale across nodes
- data should survive node failure
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Client writes value for key | Client | overwrite state | S1update targetKeyValueEntry | C1 |
| Client reads value for key | Client | read source | S1read source targetKeyValueEntry | R1 |
| Client deletes key | Client | state transition | S1update targetKeyValueEntry | C1 |
| System replicates key update to replicas | System | async process | S1hidden write targetReplicaState | C1 |
| System routes key to partition owner | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns partition after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
| Client reads metadata about cluster/partitions | Client | read projection | S1read projection targetClusterStatusView | R2 |
Notes on normalization:
Putisoverwrite state- current value for a key is the main truth
Getisread source- unless later we allow stale/cache reads
Deleteisstate transition- because a key usually moves from present to tombstoned/deleted state
- replication is
async process- internal propagation path
- partition reassignment is
state transition- ownership changes over time
Likely C1 paths:
- write key
- delete key
- replication
- partition ownership changes
Likely R1:
- get key
Step 2 — Critical Path Selection #
Now select the correctness-critical paths.
| Requirement | Priority class | Why |
|---|---|---|
| Write value for key | C1 | current value truth must not be lost or corrupted |
| Read value for key | R1 | core serving path |
| Delete key | C1 | delete/tombstone semantics affect truth |
| Replicate key update | C1 | durability and replica convergence depend on it |
| Route key to partition owner | C1 | wrong routing can send writes to wrong authority |
| Reassign partition after node failure | C1 | failover correctness depends on valid ownership transfer |
| Read cluster/partition status | R2 | operational, not core product path |
Baseline critical paths:
P1put keyP2delete keyP3replicate mutationP4route to partition ownerP5reassign partition ownership
Main R1 path:
P6get key
This is not just a CRUD prompt. The real infra correctness comes from:
- authoritative ownership of a key/partition
- current value semantics per key
- replication and failover correctness
Step 3 — Primary State Extraction #
For a distributed key-value store, the minimal primary state is the current key state plus partition ownership and replication state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| KeyValueEntry | direct noun | Yes | keep as candidate | entity | Yes | service | overwrite | instance | key |
| Tombstone | lifecycle object | Yes | split candidate | event | No | service | append-only | instance | key |
| ReplicaState | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | relation | partition_id + replica_id |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | partition_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | keyspace partitions |
| ClusterStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | cluster |
| ClientSession | hidden write target | No | reject as UI artifact | process | No | derived | overwrite | instance | session_id |
Important modeling choices:
KeyValueEntry #
This is obviously primary:
- current value for a key
- possibly fields:
valueversiondeletedlast_updated
Tombstone #
I would usually fold delete semantics into KeyValueEntry rather than keep Tombstone as a separate primary object in the minimal design.
So the cleaner minimal model is:
KeyValueEntrywith:- present / deleted state
- version/timestamp
That means:
Tombstoneis not necessary as a separate primary object in the baseline
ReplicaState #
Needed because:
- replicas may be follower, catching up, active, stale, failed
- replication and failover correctness depend on replica lifecycle
PartitionOwnership #
Needed because:
- one node or replica set must be authoritative for a partition at a time
PartitionMap #
This is also primary because:
- routing depends on it
- wrong map means wrong authority
Minimal strict primary set:
KeyValueEntryPartitionOwnershipPartitionMapReplicaState
Step 4 — Hard Invariants #
For a distributed key-value store, the hard invariants are about authoritative ownership, valid overwrite/delete semantics, and replica convergence.
| Path | Tier | Type | Invariant template | Invariant statement |
|---|---|---|---|---|
P1write pathPut key | HARD | eligibility | eligibility template | Action put_key is valid only if request is routed to current authoritative partition owner and version precondition, if any, holds on key at decision time. |
P1write pathPut key | HARD | ordering | ordering template | Instances writes for key are ordered by authoritative commit order within key. |
P2write pathDelete key | HARD | eligibility | eligibility template | Action delete_key is valid only if request is routed to current authoritative partition owner and current key state allows deletion on key at decision time. |
P2write pathDelete key | HARD | ordering | ordering template | Instances delete and overwrite mutations for key are ordered by authoritative commit order within key. |
P3write pathReplicate mutation | HARD | accounting | accounting template | Replica state for partition_id equals authoritative committed mutation stream modulo bounded replication lag. |
P4write pathRoute to partition owner | HARD | uniqueness | uniqueness template | Key partition_id maps to at most one logical outcome current authoritative owner within partition_id. |
P5write pathReassign partition | HARD | eligibility | eligibility template | Action reassign_partition is valid only if current owner is failed or relinquished and new owner is eligible and sufficiently caught up holds on partition_id at decision time. |
P6read pathGet key | HARD or SOFT | freshness | freshness template | Read path get_key reflects authoritative key state within configured consistency bound. |
Important note on the last invariant:
Get can be:
HARD freshnessif the store promises strong readsSOFT freshnessif the store allows bounded-stale follower reads
Since the prompt didn’t specify yet, keep this open until execution context.
What matters most:
1. Single authoritative owner per partition #
This prevents split-brain writes.
2. Total authoritative order per key #
Even if the store is distributed, each key must have one authoritative mutation order.
3. Replication correctness #
Replicas must converge to the committed authoritative stream.
4. Reassignment safety #
Failover cannot promote a stale replica that would lose committed writes.
Step 5 — Execution Context #
We now declare the baseline assumptions that constrain valid mechanisms.
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical KV service spread across many nodes |
| Write coordination scope | per object scope | correctness is per key and per partition ownership scope |
| Read consistency target | strong only | safest baseline unless prompt explicitly allows stale reads |
| Holder model | node | partition ownership is held by nodes or replica leaders |
| Compensation acceptable? | No | lost or conflicting writes cannot be repaired later by compensation |
Derived implications:
holder_may_crash = true- nodes can fail while owning partitions
cross_service_write = false- baseline keeps storage, replication, and routing within one logical system
bounded_staleness_allowed = false- for baseline
Get, unless interviewer later asks for follower reads
- for baseline
cross_service_atomicity_required = false- we do not need multi-service transactions in the baseline
exclusive_claim_required = true- partition ownership must be exclusive
guarded_by_current_state = true- deletes, reassignment, and versioned puts depend on current state
This pushes us toward:
- one authoritative writer per partition
- guarded overwrite/delete for key state
- exclusive claim for partition ownership
- ordered replication from authoritative owner to replicas
Step 6 — Deterministic Mechanism Selection #
6A. Write Shape #
| Path | Why | Write shape |
|---|---|---|
P1 put key | current key value is replaced, often with version precondition | overwrite current value |
P2 delete key | delete/tombstone changes current key state under current-state rules | guarded state transition |
P3 replicate mutation | replicas consume committed mutation stream | append-only event |
P4 route to partition owner | partition owner is unique current authority | exclusive claim |
P5 reassign partition | valid only if current ownership/failover state allows it | guarded state transition |
6B. Base Mechanism #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 put key | overwrite current value | single writer per partition | version, commit index |
P2 delete key | guarded state transition | CAS on (state, version) or leader-applied guarded transition | tombstone/version |
P3 replicate mutation | append-only event | append log | commit index, replica ack/progress |
P4 route to partition owner | exclusive claim | lease | fencing token, heartbeat |
P5 reassign partition | guarded state transition | CAS on (state, version) | fencing token, replica catch-up check |
Why these fit:
Put key #
The clean infra realization is:
- leader for partition is the single writer
- so put is not a distributed CAS across all replicas
- leader serializes overwrite order for that partition
Delete key #
Delete is safer modeled as a guarded transition because:
- delete vs overwrite ordering matters
- tombstone/version semantics matter
- stale deletes must not silently win
Replication #
Replication is naturally:
- append committed mutations to the partition’s ordered stream
- replicas follow that stream
Partition ownership #
This is the classic exclusive claim problem:
- only one node may be authoritative leader for a partition
Reassignment #
Promotion/failover is valid only if:
- current owner is gone or relinquished
- candidate replica is eligible/caught up
So it is a guarded transition.
Canonical substrate implied:
- partitioned leader-based KV store
- one leader per partition
- append-only mutation log per partition
- replicas follow the leader’s commit stream
- key updates applied by partition leader
- partition leadership managed with lease/fencing semantics
Step 7 — Read Model / Source of Truth #
For a distributed KV store, truth is mostly direct source state plus replication progress state.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1source conceptCurrent value for key | KeyValueEntry on authoritative partition owner | read source directly | authoritative partition state / replay from committed log |
C2source conceptPartition ownership | PartitionOwnership | read source directly | authoritative ownership store |
C3source conceptPartition map | PartitionMap | read source directly | authoritative routing metadata |
C4status conceptReplica catch-up state | ReplicaState | read source directly | recompute from commit index and replica progress |
C5projection conceptCluster status view | derived from ownership and replica state | materialized view | recompute from primary metadata |
Important distinction:
For the correctness-critical baseline:
Get(key)should read from authoritative source state- routing should read from authoritative partition map
- failover should read authoritative replica progress / ownership state
Only operational views like:
- cluster health
- status dashboards
- capacity views
should be treated as projections.
Optional variant:
If interviewer later allows follower reads:
Get(key)can become:read source directlyfrom leader for strong reads- or bounded-stale read from replica for relaxed consistency
But baseline source-of-truth remains:
- leader or authoritative partition state
Step 8 — Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 put key | retry with request id or version precondition | partition leader serializes competing writes; stale precondition loses | committed write survives leader crash if replicated past commit point | replication to followers may lag but committed truth is preserved | old leader blocked by fencing token |
P2 delete key | retry with version/tombstone precondition | stale delete loses guarded transition | committed tombstone survives crash if replicated past commit point | follower lag acceptable until caught up | stale leader blocked by fencing token |
P3 replicate mutation | follower retries from last known offset/index | replicas independently catch up from committed stream | leader crash stops new replication until failover; committed log remains authoritative | missing replica append retried from commit index | stale replica cannot become authority if not caught up |
P4 route to partition owner | retry after refreshing partition map | only one valid owner/leader should exist | if owner changed, refreshed map points to new leader | n/a | stale owner rejected by lease epoch/fencing |
P5 reassign partition | retry failover transition safely | only one reassignment wins current ownership transition | promoted leader crash triggers later reassignment | n/a | previous leader fenced and must not accept writes |
What matters most:
1. Stale leader protection #
This is the main safety issue.
Bad case:
- old leader still accepts writes after failover
- new leader also accepts writes
- split brain
Solution:
- lease epoch / fencing token
- only current owner epoch may commit writes
2. Commit point vs replica lag #
A leader crash is safe only if:
- committed mutations are known and preserved
- failover chooses a replica that is sufficiently caught up
3. Retry semantics #
Client retries should not create ambiguity:
- either use request ids
- or require version preconditions / last-write rules
- or document that repeated puts can produce repeated overwrites safely
4. Delete semantics #
Delete often needs tombstones/versioning so that:
- stale replicas or stale writes do not resurrect deleted keys incorrectly
Step 9 — Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| hot partition leader | contention hotspot | increase partition count or rebalance hot key ranges |
| hot keys | write throughput hotspot | isolate hot keys, use finer partitioning, or special-case caching if semantics allow |
| replication lag | write throughput hotspot | tune replication pipeline, batch log shipping, and control follower catch-up |
| partition-map churn during failures | contention hotspot | slow reassignment, use stable hashing/partitioning, and separate membership from hot data path |
| strong-read load on leaders | read hotspot | add follower reads only for relaxed-consistency paths or add more partitions |
| storage growth from logs/tombstones | storage growth hotspot | compaction, snapshotting, and tombstone cleanup after safe retention window |
What scales well:
A leader-based distributed KV store scales by:
- partitioning keyspace
- making each partition have one authoritative writer
- replicating per partition independently
So throughput grows primarily with:
- number of partitions
- number of healthy leaders
- replication pipeline capacity
What fails first if designed badly:
- one global leader for all keys
- no partitioning
- failover to stale replicas
- unlimited tombstone/log growth
- hot keys sharing the same coarse partition forever
Canonical design conclusion:
- primary state:
KeyValueEntryPartitionOwnershipPartitionMapReplicaState
- critical invariants:
- one authoritative owner per partition
- one authoritative mutation order per key within partition
- replicas converge from committed mutation stream
- failover only promotes sufficiently up-to-date replicas
- mechanisms:
single writer per partitionappend loglease- guarded transition for delete/failover
- reads:
- source reads from authoritative owner by default
- operational views are projections
Polished interview answer:
“I’d design this as a partitioned, leader-based distributed KV store. Each partition has one authoritative owner that serializes writes for keys in that partition. Key state is the source of truth, partition ownership is lease-backed and fenced, and replication happens through an ordered append stream from the leader to replicas. Put is an overwrite on current key state applied by the partition leader, Delete is a guarded transition with tombstone/version semantics, and failover is a guarded ownership transition that only promotes a sufficiently caught-up replica. By default, reads come from authoritative source state, and the main scaling levers are more partitions, balanced leadership, and efficient replication.”
Concrete Substrate #
I’ll choose a leader-based replicated KV store as the concrete baseline, because that matches the mechanical derivation we already reached:
single writer per partitionappend loglease- guarded failover
Concrete substrate:
- keyspace partitioned by hash/range into
partitions - each partition is a small replicated group
- one leader per partition
- followers replicate the partition’s ordered mutation log
- a metadata/config service tracks:
- partition map
- current leader per partition
- replica membership
Concrete tech family:
- service in Go or Java
- per-partition replicated log using Raft
- local durable storage with RocksDB or LSM-backed engine
- cluster membership / metadata either:
- built into the same Raft control plane
- or separate small metadata service like
etcd
That is a definite substrate.
Operation Layer #
1. Put key #
API
Put(key, value, expected_version?)
Initiator
- client
Entry point
- client-facing gateway or any storage node
Authoritative decider
- current leader of
partition(key)
Precondition
- request routed to current leader
- if
expected_versionis present, current version must match
Transition
- leader appends
PUT(key, value, new_version)to partition log - once committed, leader updates
KeyValueEntry
Response
{version, commit_index}
Physical responder
- leader, or front node after leader response
Failure cases
- stale routing -> redirect or retry with newer partition map
- leader unavailable -> retry
- version mismatch -> conflict
Sequence
- client sends
Put - entry node resolves
partition(key) - request forwarded to partition leader
- leader validates version precondition
- leader appends mutation to Raft log
- majority commits
- leader applies write to RocksDB
- leader responds success
2. Get key #
API
Get(key, consistency = strong | stale-ok)
Initiator
- client
Entry point
- gateway or storage node
Authoritative decider
- leader for strong reads
- selected replica for stale-ok reads
Precondition
- none
Transition
- none
Response
{value, version, deleted?, commit_index}
Failure cases
- stale map -> reroute
- replica may reject strong read if not leader
Sequence
- client sends
Get - entry node resolves partition
- for strong read, route to leader
- leader serves committed state from RocksDB/memtable/cache
- return value/version
3. Delete key #
API
Delete(key, expected_version?)
Initiator
- client
Entry point
- gateway or any node
Authoritative decider
- partition leader
Precondition
- request routed to current leader
- version/state precondition if supplied
Transition
- leader appends
DELETE(key, tombstone_version)to log - committed state becomes deleted/tombstoned
Response
{version, commit_index}
Failure cases
- stale leader
- version mismatch
- tombstone cleanup must not race with lagging replicas
Sequence
- client sends
Delete - route to leader
- leader validates precondition
- leader appends tombstone mutation
- majority commits
- leader applies tombstone state
- reply success
4. Replicate mutation #
API
- internal Raft replication RPC, e.g.
AppendEntries
Initiator
- partition leader
Entry point
- follower replicas
Authoritative decider
- leader for proposed log entries
- quorum for commit
Precondition
- follower log matches previous index/term at append boundary
Transition
- follower appends missing log entries
- commit index advances after quorum success
Response
- follower ack / next index hint
Failure cases
- follower behind -> backfill
- follower divergence -> log repair by Raft
- leader crash -> new election
Sequence
- leader appends local entry
- leader sends
AppendEntriesto followers - followers validate prev index/term
- followers append
- quorum ack
- leader marks committed
- followers later apply committed entries to local state
5. Reassign partition / leader failover #
API
- internal election + reassignment flow, for example:
StartElection(partition_id)TransferLeadership(partition_id)or Raft election RPCs
Initiator
- system
Entry point
- partition replicas / control plane
Authoritative decider
- quorum of partition replica group
Precondition
- current leader failed or leadership transfer initiated
- candidate is eligible and sufficiently up to date
Transition
- new leader epoch/term established
PartitionOwnershipupdated- old leader fenced
Response
- new leader advertised in partition map / metadata
Failure cases
- split vote
- stale node tries to keep serving
- promotion of lagging replica avoided by quorum protocol
Sequence
- leader fails or loses lease
- replicas detect timeout
- candidate starts election
- quorum elects new leader
- new leader begins serving writes
- routing metadata updates
- old leader, if it returns, is fenced by higher term
Entry Point vs Decider vs Responder #
This is the distinction that usually needs to be made explicit.
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
Put | gateway / any node | partition leader | leader or front node | KV store |
Get strong | gateway / any node | partition leader | leader or front node | KV store |
Get stale-ok | gateway / any node | chosen replica | contacted replica | KV store |
Delete | gateway / any node | partition leader | leader or front node | KV store |
| replication RPC | leader | follower + quorum commit logic | follower ack | partition replica group |
| failover | replica/control plane | partition quorum | newly elected leader/control plane | KV store |
Concrete HLD #
Main components:
- client gateway / router
- resolves
key -> partition - forwards to current leader or replica
- resolves
- partition leader
- authoritative writer for that partition
- validates versioned writes
- appends to replicated log
- partition followers
- replicate log
- may serve stale reads if allowed
- local storage engine
- RocksDB/LSM-backed key state + WAL
- partition metadata / config service
- partition map
- replica membership
- leader identity
Concrete Technologies #
A strong concrete interview answer:
- Go or Java service
- Raft per partition group for replication and leader election
- RocksDB on each replica for local durable key state
- metadata/config managed either:
- by a separate small etcd control plane, or
- by a dedicated metadata Raft group
Short version you can say:
“I’d implement this as a partitioned leader-based KV store. Each partition is a small Raft group. The leader is the single writer for keys in that partition, writes are appended to the partition log and committed on quorum before being applied to RocksDB, deletes are tombstoned, and failover is handled by Raft leader election with term-based fencing.”