Table of Contents

Distributed Key-Value Store Analysis Note #

This note captures the full step-by-step analysis from the infra interview recipe for a distributed key-value store, including the concrete request-flow layer that makes the design operationally clear.

Step 1 — Normalize #

Assume the baseline prompt is:

design a distributed key-value store
clients can Put, Get, and Delete
store should scale across nodes
data should survive node failure

Normalize into state-affecting paths.

Requirement	Actor	Operation	State touched	Priority
Client writes value for key	Client	overwrite state	`S1` `update target` `KeyValueEntry`	C1
Client reads value for key	Client	read source	`S1` `read source target` `KeyValueEntry`	R1
Client deletes key	Client	state transition	`S1` `update target` `KeyValueEntry`	C1
System replicates key update to replicas	System	async process	`S1` `hidden write target` `ReplicaState`	C1
System routes key to partition owner	System	read source	`S1` `read source target` `PartitionMap`	C1
System reassigns partition after node failure	System	state transition	`S1` `update target` `PartitionOwnership`	C1
Client reads metadata about cluster/partitions	Client	read projection	`S1` `read projection target` `ClusterStatusView`	R2

Notes on normalization:

Put is overwrite state
- current value for a key is the main truth
Get is read source
- unless later we allow stale/cache reads
Delete is state transition
- because a key usually moves from present to tombstoned/deleted state
replication is async process
- internal propagation path
partition reassignment is state transition
- ownership changes over time

Likely C1 paths:

write key
delete key
replication
partition ownership changes

Likely R1:

get key

Step 2 — Critical Path Selection #

Now select the correctness-critical paths.

Requirement	Priority class	Why
Write value for key	C1	current value truth must not be lost or corrupted
Read value for key	R1	core serving path
Delete key	C1	delete/tombstone semantics affect truth
Replicate key update	C1	durability and replica convergence depend on it
Route key to partition owner	C1	wrong routing can send writes to wrong authority
Reassign partition after node failure	C1	failover correctness depends on valid ownership transfer
Read cluster/partition status	R2	operational, not core product path

Baseline critical paths:

P1 put key
P2 delete key
P3 replicate mutation
P4 route to partition owner
P5 reassign partition ownership

Main R1 path:

P6 get key

This is not just a CRUD prompt. The real infra correctness comes from:

authoritative ownership of a key/partition
current value semantics per key
replication and failover correctness

Step 3 — Primary State Extraction #

For a distributed key-value store, the minimal primary state is the current key state plus partition ownership and replication state.

Candidate object label	Candidate source	Candidate needed for C1/R1?	Candidate decomposition action	Class	Primary?	Owner	Evolution	Scope kind	Scope value
KeyValueEntry	direct noun	Yes	keep as candidate	entity	Yes	service	overwrite	instance	key
Tombstone	lifecycle object	Yes	split candidate	event	No	service	append-only	instance	key
ReplicaState	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	relation	partition_id + replica_id
PartitionOwnership	hidden write target	Yes	keep as candidate	process	Yes	service	state machine	instance	partition_id
PartitionMap	hidden write target	Yes	keep as candidate	entity	Yes	service	overwrite	collection	keyspace partitions
ClusterStatusView	derived read model	No	reject as UI artifact	projection	No	derived	overwrite	collection	cluster
ClientSession	hidden write target	No	reject as UI artifact	process	No	derived	overwrite	instance	session_id

Important modeling choices:

`KeyValueEntry` #

This is obviously primary:

current value for a key
possibly fields:
- value
- version
- deleted
- last_updated

`Tombstone` #

I would usually fold delete semantics into KeyValueEntry rather than keep Tombstone as a separate primary object in the minimal design.

So the cleaner minimal model is:

KeyValueEntry with:
- present / deleted state
- version/timestamp

That means:

Tombstone is not necessary as a separate primary object in the baseline

`ReplicaState` #

Needed because:

replicas may be follower, catching up, active, stale, failed
replication and failover correctness depend on replica lifecycle

`PartitionOwnership` #

Needed because:

one node or replica set must be authoritative for a partition at a time

`PartitionMap` #

This is also primary because:

routing depends on it
wrong map means wrong authority

Minimal strict primary set:

KeyValueEntry
PartitionOwnership
PartitionMap
ReplicaState

Step 4 — Hard Invariants #

For a distributed key-value store, the hard invariants are about authoritative ownership, valid overwrite/delete semantics, and replica convergence.

Path	Tier	Type	Invariant template	Invariant statement
`P1` `write path` `Put key`	HARD	eligibility	eligibility template	Action `put_key` is valid only if `request is routed to current authoritative partition owner and version precondition, if any, holds` on `key` at decision time.
`P1` `write path` `Put key`	HARD	ordering	ordering template	Instances `writes for key` are ordered by `authoritative commit order` within `key`.
`P2` `write path` `Delete key`	HARD	eligibility	eligibility template	Action `delete_key` is valid only if `request is routed to current authoritative partition owner and current key state allows deletion` on `key` at decision time.
`P2` `write path` `Delete key`	HARD	ordering	ordering template	Instances `delete and overwrite mutations for key` are ordered by `authoritative commit order` within `key`.
`P3` `write path` `Replicate mutation`	HARD	accounting	accounting template	Replica state for `partition_id` equals `authoritative committed mutation stream` modulo bounded replication lag.
`P4` `write path` `Route to partition owner`	HARD	uniqueness	uniqueness template	Key `partition_id` maps to at most one logical outcome `current authoritative owner` within `partition_id`.
`P5` `write path` `Reassign partition`	HARD	eligibility	eligibility template	Action `reassign_partition` is valid only if `current owner is failed or relinquished and new owner is eligible and sufficiently caught up` holds on `partition_id` at decision time.
`P6` `read path` `Get key`	HARD or SOFT	freshness	freshness template	Read path `get_key` reflects authoritative key state within `configured consistency bound`.

Important note on the last invariant:

Get can be:

HARD freshness if the store promises strong reads
SOFT freshness if the store allows bounded-stale follower reads

Since the prompt didn’t specify yet, keep this open until execution context.

What matters most:

1. Single authoritative owner per partition #

This prevents split-brain writes.

2. Total authoritative order per key #

Even if the store is distributed, each key must have one authoritative mutation order.

3. Replication correctness #

Replicas must converge to the committed authoritative stream.

4. Reassignment safety #

Failover cannot promote a stale replica that would lose committed writes.

Step 5 — Execution Context #

We now declare the baseline assumptions that constrain valid mechanisms.

Field	Value	Why
Topology	single service distributed	one logical KV service spread across many nodes
Write coordination scope	per object scope	correctness is per key and per partition ownership scope
Read consistency target	strong only	safest baseline unless prompt explicitly allows stale reads
Holder model	node	partition ownership is held by nodes or replica leaders
Compensation acceptable?	No	lost or conflicting writes cannot be repaired later by compensation

Derived implications:

holder_may_crash = true
- nodes can fail while owning partitions
cross_service_write = false
- baseline keeps storage, replication, and routing within one logical system
bounded_staleness_allowed = false
- for baseline Get, unless interviewer later asks for follower reads
cross_service_atomicity_required = false
- we do not need multi-service transactions in the baseline
exclusive_claim_required = true
- partition ownership must be exclusive
guarded_by_current_state = true
- deletes, reassignment, and versioned puts depend on current state

This pushes us toward:

one authoritative writer per partition
guarded overwrite/delete for key state
exclusive claim for partition ownership
ordered replication from authoritative owner to replicas

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

Path	Why	Write shape
`P1` put key	current key value is replaced, often with version precondition	`overwrite current value`
`P2` delete key	delete/tombstone changes current key state under current-state rules	`guarded state transition`
`P3` replicate mutation	replicas consume committed mutation stream	`append-only event`
`P4` route to partition owner	partition owner is unique current authority	`exclusive claim`
`P5` reassign partition	valid only if current ownership/failover state allows it	`guarded state transition`

6B. Base Mechanism #

Path	Write shape	Base mechanism	Required companions
`P1` put key	`overwrite current value`	`single writer per partition`	version, commit index
`P2` delete key	`guarded state transition`	`CAS on (state, version)` or leader-applied guarded transition	tombstone/version
`P3` replicate mutation	`append-only event`	`append log`	commit index, replica ack/progress
`P4` route to partition owner	`exclusive claim`	`lease`	fencing token, heartbeat
`P5` reassign partition	`guarded state transition`	`CAS on (state, version)`	fencing token, replica catch-up check

Why these fit:

Put key #

The clean infra realization is:

leader for partition is the single writer
so put is not a distributed CAS across all replicas
leader serializes overwrite order for that partition

Delete key #

Delete is safer modeled as a guarded transition because:

delete vs overwrite ordering matters
tombstone/version semantics matter
stale deletes must not silently win

Replication #

Replication is naturally:

append committed mutations to the partition’s ordered stream
replicas follow that stream

Partition ownership #

This is the classic exclusive claim problem:

only one node may be authoritative leader for a partition

Reassignment #

Promotion/failover is valid only if:

current owner is gone or relinquished
candidate replica is eligible/caught up

So it is a guarded transition.

Canonical substrate implied:

partitioned leader-based KV store
one leader per partition
append-only mutation log per partition
replicas follow the leader’s commit stream
key updates applied by partition leader
partition leadership managed with lease/fencing semantics

Step 7 — Read Model / Source of Truth #

For a distributed KV store, truth is mostly direct source state plus replication progress state.

Concept	Truth	Read path	Rebuild path
`C1` `source concept` `Current value for key`	`KeyValueEntry` on authoritative partition owner	`read source directly`	authoritative partition state / replay from committed log
`C2` `source concept` `Partition ownership`	`PartitionOwnership`	`read source directly`	authoritative ownership store
`C3` `source concept` `Partition map`	`PartitionMap`	`read source directly`	authoritative routing metadata
`C4` `status concept` `Replica catch-up state`	`ReplicaState`	`read source directly`	recompute from commit index and replica progress
`C5` `projection concept` `Cluster status view`	derived from ownership and replica state	`materialized view`	recompute from primary metadata

Important distinction:

For the correctness-critical baseline:

Get(key) should read from authoritative source state
routing should read from authoritative partition map
failover should read authoritative replica progress / ownership state

Only operational views like:

cluster health
status dashboards
capacity views

should be treated as projections.

Optional variant:

If interviewer later allows follower reads:

Get(key) can become:
- read source directly from leader for strong reads
- or bounded-stale read from replica for relaxed consistency

But baseline source-of-truth remains:

leader or authoritative partition state

Step 8 — Failure Handling #

Path	Retry	Competing writers	Crash after commit	Publish failure	Stale holder
`P1` put key	retry with request id or version precondition	partition leader serializes competing writes; stale precondition loses	committed write survives leader crash if replicated past commit point	replication to followers may lag but committed truth is preserved	old leader blocked by fencing token
`P2` delete key	retry with version/tombstone precondition	stale delete loses guarded transition	committed tombstone survives crash if replicated past commit point	follower lag acceptable until caught up	stale leader blocked by fencing token
`P3` replicate mutation	follower retries from last known offset/index	replicas independently catch up from committed stream	leader crash stops new replication until failover; committed log remains authoritative	missing replica append retried from commit index	stale replica cannot become authority if not caught up
`P4` route to partition owner	retry after refreshing partition map	only one valid owner/leader should exist	if owner changed, refreshed map points to new leader	n/a	stale owner rejected by lease epoch/fencing
`P5` reassign partition	retry failover transition safely	only one reassignment wins current ownership transition	promoted leader crash triggers later reassignment	n/a	previous leader fenced and must not accept writes

What matters most:

1. Stale leader protection #

This is the main safety issue.

Bad case:

old leader still accepts writes after failover
new leader also accepts writes
split brain

Solution:

lease epoch / fencing token
only current owner epoch may commit writes

2. Commit point vs replica lag #

A leader crash is safe only if:

committed mutations are known and preserved
failover chooses a replica that is sufficiently caught up

3. Retry semantics #

Client retries should not create ambiguity:

either use request ids
or require version preconditions / last-write rules
or document that repeated puts can produce repeated overwrites safely

4. Delete semantics #

Delete often needs tombstones/versioning so that:

stale replicas or stale writes do not resurrect deleted keys incorrectly

Step 9 — Scale Adjustments #

Hotspot	Type	First response
hot partition leader	contention hotspot	increase partition count or rebalance hot key ranges
hot keys	write throughput hotspot	isolate hot keys, use finer partitioning, or special-case caching if semantics allow
replication lag	write throughput hotspot	tune replication pipeline, batch log shipping, and control follower catch-up
partition-map churn during failures	contention hotspot	slow reassignment, use stable hashing/partitioning, and separate membership from hot data path
strong-read load on leaders	read hotspot	add follower reads only for relaxed-consistency paths or add more partitions
storage growth from logs/tombstones	storage growth hotspot	compaction, snapshotting, and tombstone cleanup after safe retention window

What scales well:

A leader-based distributed KV store scales by:

partitioning keyspace
making each partition have one authoritative writer
replicating per partition independently

So throughput grows primarily with:

number of partitions
number of healthy leaders
replication pipeline capacity

What fails first if designed badly:

one global leader for all keys
no partitioning
failover to stale replicas
unlimited tombstone/log growth
hot keys sharing the same coarse partition forever

Canonical design conclusion:

primary state:
- KeyValueEntry
- PartitionOwnership
- PartitionMap
- ReplicaState
critical invariants:
- one authoritative owner per partition
- one authoritative mutation order per key within partition
- replicas converge from committed mutation stream
- failover only promotes sufficiently up-to-date replicas
mechanisms:
- single writer per partition
- append log
- lease
- guarded transition for delete/failover
reads:
- source reads from authoritative owner by default
- operational views are projections

Polished interview answer:

“I’d design this as a partitioned, leader-based distributed KV store. Each partition has one authoritative owner that serializes writes for keys in that partition. Key state is the source of truth, partition ownership is lease-backed and fenced, and replication happens through an ordered append stream from the leader to replicas. Put is an overwrite on current key state applied by the partition leader, Delete is a guarded transition with tombstone/version semantics, and failover is a guarded ownership transition that only promotes a sufficiently caught-up replica. By default, reads come from authoritative source state, and the main scaling levers are more partitions, balanced leadership, and efficient replication.”

Concrete Substrate #

I’ll choose a leader-based replicated KV store as the concrete baseline, because that matches the mechanical derivation we already reached:

single writer per partition
append log
lease
guarded failover

Concrete substrate:

keyspace partitioned by hash/range into partitions
each partition is a small replicated group
one leader per partition
followers replicate the partition’s ordered mutation log
a metadata/config service tracks:
- partition map
- current leader per partition
- replica membership

Concrete tech family:

service in Go or Java
per-partition replicated log using Raft
local durable storage with RocksDB or LSM-backed engine
cluster membership / metadata either:
- built into the same Raft control plane
- or separate small metadata service like etcd

That is a definite substrate.

Operation Layer #

1. Put key #

API

Put(key, value, expected_version?)

Initiator

client

Entry point

client-facing gateway or any storage node

Authoritative decider

current leader of partition(key)

Precondition

request routed to current leader
if expected_version is present, current version must match

Transition

leader appends PUT(key, value, new_version) to partition log
once committed, leader updates KeyValueEntry

Response

{version, commit_index}

Physical responder

leader, or front node after leader response

Failure cases

stale routing -> redirect or retry with newer partition map
leader unavailable -> retry
version mismatch -> conflict

Sequence

client sends Put
entry node resolves partition(key)
request forwarded to partition leader
leader validates version precondition
leader appends mutation to Raft log
majority commits
leader applies write to RocksDB
leader responds success

2. Get key #

API

Get(key, consistency = strong | stale-ok)

Initiator

client

Entry point

gateway or storage node

Authoritative decider

leader for strong reads
selected replica for stale-ok reads

Precondition

none

Transition

none

Response

{value, version, deleted?, commit_index}

Failure cases

stale map -> reroute
replica may reject strong read if not leader

Sequence

client sends Get
entry node resolves partition
for strong read, route to leader
leader serves committed state from RocksDB/memtable/cache
return value/version

3. Delete key #

API

Delete(key, expected_version?)

Initiator

client

Entry point

gateway or any node

Authoritative decider

partition leader

Precondition

request routed to current leader
version/state precondition if supplied

Transition

leader appends DELETE(key, tombstone_version) to log
committed state becomes deleted/tombstoned

Response

{version, commit_index}

Failure cases

stale leader
version mismatch
tombstone cleanup must not race with lagging replicas

Sequence

client sends Delete
route to leader
leader validates precondition
leader appends tombstone mutation
majority commits
leader applies tombstone state
reply success

4. Replicate mutation #

API

internal Raft replication RPC, e.g. AppendEntries

Initiator

partition leader

Entry point

follower replicas

Authoritative decider

leader for proposed log entries
quorum for commit

Precondition

follower log matches previous index/term at append boundary

Transition

follower appends missing log entries
commit index advances after quorum success

Response

follower ack / next index hint

Failure cases

follower behind -> backfill
follower divergence -> log repair by Raft
leader crash -> new election

Sequence

leader appends local entry
leader sends AppendEntries to followers
followers validate prev index/term
followers append
quorum ack
leader marks committed
followers later apply committed entries to local state

5. Reassign partition / leader failover #

API

internal election + reassignment flow, for example:
- StartElection(partition_id)
- TransferLeadership(partition_id) or Raft election RPCs

Initiator

system

Entry point

partition replicas / control plane

Authoritative decider

quorum of partition replica group

Precondition

current leader failed or leadership transfer initiated
candidate is eligible and sufficiently up to date

Transition

new leader epoch/term established
PartitionOwnership updated
old leader fenced

Response

new leader advertised in partition map / metadata

Failure cases

split vote
stale node tries to keep serving
promotion of lagging replica avoided by quorum protocol

Sequence

leader fails or loses lease
replicas detect timeout
candidate starts election
quorum elects new leader
new leader begins serving writes
routing metadata updates
old leader, if it returns, is fenced by higher term

Entry Point vs Decider vs Responder #

This is the distinction that usually needs to be made explicit.

Path	Entry point	Authoritative decider	Physical responder	Logical responder
`Put`	gateway / any node	partition leader	leader or front node	KV store
`Get strong`	gateway / any node	partition leader	leader or front node	KV store
`Get stale-ok`	gateway / any node	chosen replica	contacted replica	KV store
`Delete`	gateway / any node	partition leader	leader or front node	KV store
replication RPC	leader	follower + quorum commit logic	follower ack	partition replica group
failover	replica/control plane	partition quorum	newly elected leader/control plane	KV store

Concrete HLD #

Main components:

client gateway / router
- resolves key -> partition
- forwards to current leader or replica
partition leader
- authoritative writer for that partition
- validates versioned writes
- appends to replicated log
partition followers
- replicate log
- may serve stale reads if allowed
local storage engine
- RocksDB/LSM-backed key state + WAL
partition metadata / config service
- partition map
- replica membership
- leader identity

Concrete Technologies #

A strong concrete interview answer:

Go or Java service
Raft per partition group for replication and leader election
RocksDB on each replica for local durable key state
metadata/config managed either:
- by a separate small etcd control plane, or
- by a dedicated metadata Raft group

Short version you can say:

“I’d implement this as a partitioned leader-based KV store. Each partition is a small Raft group. The leader is the single writer for keys in that partition, writes are appended to the partition log and committed on quorum before being applied to RocksDB, deletes are tombstoned, and failover is handled by Raft leader election with term-based fencing.”

Distributed Key-Value Store Analysis Note #

Step 1 — Normalize #

Step 2 — Critical Path Selection #

Step 3 — Primary State Extraction #

KeyValueEntry #

Tombstone #

ReplicaState #

PartitionOwnership #

PartitionMap #

Step 4 — Hard Invariants #

1. Single authoritative owner per partition #

2. Total authoritative order per key #

3. Replication correctness #

4. Reassignment safety #

Step 5 — Execution Context #

Step 6 — Deterministic Mechanism Selection #

6A. Write Shape #

6B. Base Mechanism #

Put key #

Delete key #

Replication #

Partition ownership #

Reassignment #

Step 7 — Read Model / Source of Truth #

Step 8 — Failure Handling #

1. Stale leader protection #

2. Commit point vs replica lag #

3. Retry semantics #

4. Delete semantics #

Step 9 — Scale Adjustments #

Concrete Substrate #

Operation Layer #

1. Put key #

2. Get key #

3. Delete key #

4. Replicate mutation #

5. Reassign partition / leader failover #

Entry Point vs Decider vs Responder #

Concrete HLD #

Concrete Technologies #

`KeyValueEntry` #

`Tombstone` #

`ReplicaState` #

`PartitionOwnership` #

`PartitionMap` #