Distributed Lock Service (ZooKeeper / etcd-class)
Distributed Lock Service (ZooKeeper / etcd-class) #
This note models a distributed lock service where clients acquire, renew, release, and observe locks safely across many nodes, with lease semantics, fencing tokens, and crash recovery.
Step 1 - Normalize #
Assume the baseline prompt is:
- design a distributed lock service
- clients acquire and release named locks
- locks should auto-expire if holder crashes
- clients may watch lock ownership changes
- system scales across many locks and clients
Normalize into state-affecting paths.
| Requirement | Actor | Operation | State touched | Priority |
|---|---|---|---|---|
| Client acquires lock | Client | state transition | S1update targetLockState | C1 |
| Client renews lock lease | Client | state transition | S1update targetLockState | C1 |
| Client releases lock | Client | state transition | S1update targetLockState | C1 |
| System expires stale lock | System | async process | S1hidden write targetLockState | C1 |
| Client reads current lock holder | Client | read source | S1read source targetLockState | R1 |
| Client registers watch on lock | Client | append event | S1create targetLockWatchRegistration | R1 |
| System emits lock-change watch event | System | async process | S1hidden write targetLockWatchEvent | R1 |
| System routes lock key to current owner/leader | System | read source | S1read source targetPartitionMap | C1 |
| System reassigns shard ownership after node failure | System | state transition | S1update targetPartitionOwnership | C1 |
Notes on normalization #
Important choices:
- acquire/renew/release are
state transition- lock lifecycle changes over time
- stale expiry is explicit
- crash recovery is central correctness logic
- current lock holder read is a source read
- watch registration and watch events are distinct from lock truth
This system is fundamentally:
exclusive claim + lease + fencing
not:
- generic config storage
- queue delivery
Step 2 - Critical Path Selection #
| Requirement | Priority class | Why |
|---|---|---|
| Acquire lock | C1 | duplicate holders break correctness immediately |
| Renew lock lease | C1 | stale or lost renewal changes holder validity |
| Release lock | C1 | lock handoff depends on correct release |
| Expire stale lock | C1 | crash recovery and stale-holder cleanup are core correctness |
| Read current lock holder | R1 | important serving path for clients |
| Register / deliver watch | R1 | useful but downstream of correct lock truth |
| Route lock key to shard owner | C1 | wrong routing can split lock truth |
| Reassign shard ownership | C1 | failover must preserve exclusivity and fencing |
Baseline critical paths #
Main C1 paths:
P1acquire lockP2renew lockP3release lockP4expire stale lockP5route to shard ownerP6reassign shard ownership
Main R1 paths:
P7read lock stateP8watch registration and delivery
This design is driven by:
- one valid holder at a time
- lease expiry on crash
- fenced transitions so stale holders cannot act
Step 3 - Primary State Extraction #
For a distributed lock service, the minimal primary state is the current lock lifecycle, client/session lifecycle, and routing/ownership state.
| Candidate object label | Candidate source | Candidate needed for C1/R1? | Candidate decomposition action | Class | Primary? | Owner | Evolution | Scope kind | Scope value |
|---|---|---|---|---|---|---|---|---|---|
| LockState | direct noun | Yes | keep as candidate | process | Yes | service | state machine | instance | lock_key |
| ClientSession | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | session_id |
| LockWatchRegistration | direct noun | Yes | keep as candidate | relationship | Yes | service | append-only | relation | client_id + lock_key |
| LockWatchEvent | hidden write target | No | keep as candidate | event | No | derived | append-only | collection | lock_key |
| PartitionOwnership | hidden write target | Yes | keep as candidate | process | Yes | service | state machine | instance | shard_id |
| PartitionMap | hidden write target | Yes | keep as candidate | entity | Yes | service | overwrite | collection | lock shards |
| LockStatusView | derived read model | No | reject as UI artifact | projection | No | derived | overwrite | collection | tenant or shard |
Important modeling choices #
LockState #
This is the central correctness object.
Likely fields:
lock_keyholder_session_idholder_client_idepochor fencing tokenexpirystate
States:
FREEHELDEXPIRED
ClientSession #
Primary because:
- real lock services usually tie lock validity to a session/lease lifecycle
- if the session dies, held locks must eventually expire
LockWatchRegistration #
Kept explicit because:
- watches are part of the product surface
- registration may be tied to current session
Minimal strict primary set #
The strongest minimal set is:
LockStateClientSessionPartitionOwnershipPartitionMap
With:
LockWatchRegistrationas an optional but useful explicit primary object
Step 4 - Hard Invariants #
For a ZooKeeper/etcd-class lock service, the hard invariants are about one valid holder per lock, valid renew/release only by the current holder, and safe expiry plus fencing.
| Path | Tier | Type | Invariant statement |
|---|---|---|---|
P1 acquire lock | HARD | uniqueness | Key lock_key maps to at most one logical outcome current valid lock holder within lock scope. |
P1 acquire lock | HARD | eligibility | Action acquire_lock is valid only if current LockState(lock_key) is acquirable and current session is active at decision time. |
P2 renew lock | HARD | eligibility | Action renew_lock is valid only if LockState(lock_key) is currently held by the same session and epoch/token matches at decision time. |
P3 release lock | HARD | eligibility | Action release_lock is valid only if LockState(lock_key) is currently held by the same session and epoch/token matches at decision time. |
P4 expire stale lock | HARD | eligibility | Action expire_lock is valid only if current LockState(lock_key) is still held, expiry has passed, and epoch/token is unchanged at decision time. |
P5 route to shard owner | HARD | uniqueness | Key shard_id maps to at most one logical outcome current authoritative owner within shard_id. |
P6 reassign shard ownership | HARD | eligibility | Action reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time. |
P7 read lock state | HARD | freshness | Read path reflects authoritative lock and session state within configured consistency bound. |
P8 watch delivery | SOFT | freshness | Watch stream reflects authoritative lock-state changes within watch propagation bound. |
What matters most #
1. One valid holder per lock #
This is the primary correctness rule.
2. Renew/release are fenced #
Only the current holder with the current epoch/token may extend or release the lock.
3. Expiry must be revalidated #
A timeout worker cannot blindly expire a lock without checking that the lock state is still unchanged.
4. Watch delivery is secondary to lock truth #
Watches are important, but source truth is the lock/session state.
Step 5 - Execution Context #
For the strict baseline distributed lock service:
| Field | Value | Why |
|---|---|---|
| Topology | single service distributed | one logical lock service spread across many nodes |
| Write coordination scope | per object scope | correctness is per lock_key and per shard ownership scope |
| Read consistency target | strong only | stale lock reads can break mutual exclusion |
| Holder model | client | clients temporarily hold locks |
| Compensation acceptable? | No | duplicate lock holders cannot be repaired later safely |
Derived implications #
holder_may_crash = true- clients can fail while holding locks
cross_service_write = false- baseline keeps lock, session, and ownership state within one logical service
bounded_staleness_allowed = false- lock acquisition and validation need authoritative current state
cross_service_atomicity_required = false- no multi-service transaction required in baseline
exclusive_claim_required = true- mutual exclusion is the core product
guarded_by_current_state = true- acquire, renew, release, and expiry all depend on current state
What this implies #
This pushes us toward:
- one authoritative writer per lock shard
- lease-backed lock ownership
- session-linked expiry
- fencing token/epoch for stale-holder protection
Step 6 - Deterministic Mechanism Selection #
| Path | Write shape | Base mechanism | Required companions |
|---|---|---|---|
P1 acquire lock | exclusive claim | lease | epoch/fencing token, session heartbeat |
P2 renew lock | guarded state transition | CAS on (state, version) | epoch/fencing token |
P3 release lock | guarded state transition | CAS on (state, version) | epoch/fencing token |
P4 expire stale lock | guarded state transition | leader-applied guarded transition | epoch/fencing token, timeout scan |
P5 route to shard owner | exclusive claim | lease | fencing token, heartbeat |
P6 reassign shard ownership | guarded state transition | CAS on (state, version) | fencing token, shard catch-up check |
Why these fit #
Acquire #
This is the canonical exclusive-claim path:
- one holder wins
- ownership is temporary
- expiry matters
Renew/release #
These are not blind writes. They must verify:
- current holder matches
- epoch/token matches
- session is still active
So they are guarded transitions.
Expiry #
Expiry is also guarded:
- only expired, unchanged lock state may be transitioned back to free
Canonical substrate implied #
The baseline now points to:
- sharded lock service
- one authoritative owner per shard
- lease-like lock records
- current session state
- fenced renew/release/expiry transitions
Step 7 - Read Model / Source of Truth #
For a distributed lock service, truth is direct source state. Watches are derived.
| Concept | Truth | Read path | Rebuild path |
|---|---|---|---|
C1 current lock holder and lease | LockState | read source directly | authoritative lock-state store |
C2 current client/session validity | ClientSession | read source directly | authoritative session store |
C3 shard ownership | PartitionOwnership | read source directly | authoritative ownership store |
C4 shard routing map | PartitionMap | read source directly | authoritative routing metadata |
C5 lock watch stream | lock/session state changes | materialized view | rebuild from authoritative lock-state transitions |
C6 status dashboards | derived from lock and session state | materialized view | recompute from authoritative state |
Important point #
For the core semantics:
- acquire/renew/release/read all use authoritative
LockState - session validity is authoritative source truth
- watch notifications are derived from committed state changes
Step 8 - Failure Handling #
| Path | Retry | Competing writers | Crash after commit | Publish failure | Stale holder |
|---|---|---|---|---|---|
P1 acquire lock | retry safe; loser gets failure or later success | only one claimant should win current lock epoch | if acquire committed and client crashes, lock stays held until expiry | watch delivery may lag | stale holder fenced by epoch/token |
P2 renew lock | retry with current epoch/token | stale renew loses guarded transition | committed renewal survives crash if persisted | watch delivery may lag | old epoch rejected |
P3 release lock | retry with current epoch/token | stale release loses guarded transition | committed release survives crash if persisted | watch delivery may lag | old epoch rejected |
P4 expire stale lock | timeout scan retry safe | only one expiry transition should win for current expired state | scanner crash delays cleanup; next scan retries | watch delivery may lag | prior holder blocked once epoch/version advanced |
P5 route to shard owner | retry after refreshing shard map | only one valid owner should exist | if owner changed, refreshed map points to new owner | n/a | stale owner rejected by fencing token |
P6 reassign shard ownership | retry failover transition safely | only one reassignment wins current ownership state | promoted owner crash triggers later reassignment | n/a | old owner fenced and must not continue serving |
What matters most #
1. Fencing token / epoch #
This is the core stale-holder defense.
Bad case:
- client A acquires lock
- A pauses or partitions
- lock expires
- client B acquires lock
- A resumes and continues acting
Without fencing:
- both can act as holder
So every external use of the lock should be coupled with the current epoch/token when possible.
2. Expiry is necessary but not sufficient #
Lease timeout reclaims the lock, but fencing is what prevents stale post-expiry actions.
3. Watch lag must not affect correctness #
Clients should treat watches as hints, not as the sole source of lock truth.
Step 9 - Scale Adjustments #
| Hotspot | Type | First response |
|---|---|---|
| hot lock keys | contention hotspot | isolate hot locks, shard by lock key, or push users toward coarser-grained coordination alternatives |
| renewal traffic | write throughput hotspot | lengthen lease duration within acceptable failover bounds and batch/sessionize renewals |
| watch fanout | fan-out hotspot | derive watch delivery from committed state stream and decouple it from lock truth |
| strong reads on hot locks | read hotspot | keep reads on authoritative owner; avoid stale replica reads for correctness paths |
| failover churn | contention hotspot | stabilize shard leadership and avoid aggressive reassignment |
| session reconnect storms | contention hotspot | stagger reconnects and rate-limit watch/session restoration |
What scales well #
A lock service scales only for relatively small coordination data.
It scales by:
- sharding lock keys
- keeping one authoritative owner per shard
- making lock records small
- treating watch delivery as secondary
What fails first #
Usually:
- one or a few very hot locks
- renewal storms
- watch fanout spikes
- clients relying on watches instead of source truth
Canonical design conclusion #
The mechanical outcome is:
- primary state:
LockStateClientSessionPartitionOwnershipPartitionMap
- critical invariants:
- one valid holder per lock
- renew/release valid only for current holder epoch/token
- stale locks expire safely
- exclusive shard ownership for lock truth
- mechanisms:
lease- guarded renew/release/expiry transitions
exclusive claimfor lock acquire- fenced shard ownership
- reads:
- direct authoritative reads for lock/session truth
- watches as derived notifications
Polished interview answer #
I’d design the lock service as a sharded strongly consistent system with one authoritative owner per lock shard. Acquiring a lock is an exclusive claim that creates a lease-backed
LockStaterecord with an epoch or fencing token. Renew, release, and timeout expiry are guarded transitions that only succeed if the current holder and epoch still match. Client sessions are tracked explicitly so crashed clients eventually lose their locks, and stale holders are prevented from acting by the fencing token even after they resume. Watches are derived from committed lock-state transitions, but correctness-critical reads come directly from authoritative lock and session state.
Concrete Substrate #
I’ll choose a sharded strongly consistent metadata service with lease-backed lock records as the concrete baseline, because it matches the mechanics we derived:
- exclusive claim on acquire
- guarded renew/release/expiry
- session-linked lease validity
- one owner per shard
Concrete tech family:
- lock service in
GoorJava - authoritative state in a replicated metadata store or service-owned Raft state machine
- metadata/control:
- built-in Raft consensus per shard or a small etcd-like control layer
Each shard leader stores:
LockState(lock_key)ClientSession(session_id)- watch registrations
- timeout/expiry index
This is effectively the same substrate family as ZooKeeper/etcd-style coordination stores, with a narrower product surface focused on lock semantics.
Operation Layer #
1. Acquire lock #
API
AcquireLock(lock_key, session_id, ttl, expected_free=true)
Initiator
- client
Entry point
- gateway or any lock-service node
Authoritative decider
- current shard leader for
lock_key
Precondition
- session is active
- lock currently acquirable
Transition
LockState(lock_key) = HELD(holder_session_id, epoch, expiry)
Response
{acquired: true|false, epoch, expiry}
Failure cases
- competing claimant loses
- stale routing -> retry with updated shard map
2. Renew lock #
API
RenewLock(lock_key, session_id, epoch, ttl)
Initiator
- client
Entry point
- gateway or any node
Authoritative decider
- shard leader
Precondition
- lock is currently held by
session_id - epoch matches current state
Transition
- extend expiry
Response
{renewed: true|false, expiry}
3. Release lock #
API
ReleaseLock(lock_key, session_id, epoch)
Initiator
- client
Entry point
- gateway or any node
Authoritative decider
- shard leader
Precondition
- lock is currently held by
session_id - epoch matches current state
Transition
HELD -> FREE
Response
{released: true|false}
4. Expire stale lock #
API
- internal background process
Initiator
- system
Entry point
- shard leader
Authoritative decider
- shard leader
Precondition
- current time > expiry
- lock state and epoch unchanged
Transition
HELD -> FREEwith epoch advancement as needed
5. Register watch #
API
WatchLock(lock_key, from_version?)
Initiator
- client
Entry point
- gateway or watch endpoint
Authoritative decider
- watch subsystem on committed lock-state stream
Precondition
- session active
Transition
- create transient or session-bound watch registration
Response
- watch id / stream handle
Entry Point vs Decider vs Responder #
| Path | Entry point | Authoritative decider | Physical responder | Logical responder |
|---|---|---|---|---|
AcquireLock | gateway / any node | shard leader | leader or front node | lock service |
RenewLock | gateway / any node | shard leader | leader or front node | lock service |
ReleaseLock | gateway / any node | shard leader | leader or front node | lock service |
| expiry | shard leader | shard leader | internal | lock service |
| watch | watch endpoint | committed state stream / shard leader | watch-serving node | lock service |
| shard failover | follower / coordination layer | shard quorum / lease store | new leader / control plane | lock service |
Concrete HLD #
Main components:
- client gateway
- routes lock operations to current shard leader
- shard leaders
- authoritative owners of lock and session state
- maintain expiry index
- shard followers
- replicate committed lock/session state
- watch service
- emits lock-change notifications from committed state transitions
- metadata/control service
- tracks shard ownership and routing
Short Interview Version #
I’d build the distributed lock service as a sharded strongly consistent system with one authoritative owner per lock shard. Acquiring a lock is an exclusive claim that creates a lease-backed
LockStaterecord with an epoch or fencing token. Renew, release, and timeout expiry are guarded transitions that only succeed if the current holder and epoch still match. Client sessions are tracked explicitly so crashed clients eventually lose their locks, and stale holders are prevented from acting by the fencing token even after they resume. Watches are derived from committed lock-state transitions, but correctness-critical reads come directly from authoritative lock and session state.