Skip to main content
  1. System Design Components/

Distributed Lock Service (ZooKeeper / etcd-class)

Distributed Lock Service (ZooKeeper / etcd-class) #

This note models a distributed lock service where clients acquire, renew, release, and observe locks safely across many nodes, with lease semantics, fencing tokens, and crash recovery.


Step 1 - Normalize #

Assume the baseline prompt is:

  • design a distributed lock service
  • clients acquire and release named locks
  • locks should auto-expire if holder crashes
  • clients may watch lock ownership changes
  • system scales across many locks and clients

Normalize into state-affecting paths.

RequirementActorOperationState touchedPriority
Client acquires lockClientstate transitionS1
update target
LockState
C1
Client renews lock leaseClientstate transitionS1
update target
LockState
C1
Client releases lockClientstate transitionS1
update target
LockState
C1
System expires stale lockSystemasync processS1
hidden write target
LockState
C1
Client reads current lock holderClientread sourceS1
read source target
LockState
R1
Client registers watch on lockClientappend eventS1
create target
LockWatchRegistration
R1
System emits lock-change watch eventSystemasync processS1
hidden write target
LockWatchEvent
R1
System routes lock key to current owner/leaderSystemread sourceS1
read source target
PartitionMap
C1
System reassigns shard ownership after node failureSystemstate transitionS1
update target
PartitionOwnership
C1

Notes on normalization #

Important choices:

  • acquire/renew/release are state transition
    • lock lifecycle changes over time
  • stale expiry is explicit
    • crash recovery is central correctness logic
  • current lock holder read is a source read
  • watch registration and watch events are distinct from lock truth

This system is fundamentally:

  • exclusive claim + lease + fencing

not:

  • generic config storage
  • queue delivery

Step 2 - Critical Path Selection #

RequirementPriority classWhy
Acquire lockC1duplicate holders break correctness immediately
Renew lock leaseC1stale or lost renewal changes holder validity
Release lockC1lock handoff depends on correct release
Expire stale lockC1crash recovery and stale-holder cleanup are core correctness
Read current lock holderR1important serving path for clients
Register / deliver watchR1useful but downstream of correct lock truth
Route lock key to shard ownerC1wrong routing can split lock truth
Reassign shard ownershipC1failover must preserve exclusivity and fencing

Baseline critical paths #

Main C1 paths:

  • P1 acquire lock
  • P2 renew lock
  • P3 release lock
  • P4 expire stale lock
  • P5 route to shard owner
  • P6 reassign shard ownership

Main R1 paths:

  • P7 read lock state
  • P8 watch registration and delivery

This design is driven by:

  • one valid holder at a time
  • lease expiry on crash
  • fenced transitions so stale holders cannot act

Step 3 - Primary State Extraction #

For a distributed lock service, the minimal primary state is the current lock lifecycle, client/session lifecycle, and routing/ownership state.

Candidate object labelCandidate sourceCandidate needed for C1/R1?Candidate decomposition actionClassPrimary?OwnerEvolutionScope kindScope value
LockStatedirect nounYeskeep as candidateprocessYesservicestate machineinstancelock_key
ClientSessionhidden write targetYeskeep as candidateprocessYesservicestate machineinstancesession_id
LockWatchRegistrationdirect nounYeskeep as candidaterelationshipYesserviceappend-onlyrelationclient_id + lock_key
LockWatchEventhidden write targetNokeep as candidateeventNoderivedappend-onlycollectionlock_key
PartitionOwnershiphidden write targetYeskeep as candidateprocessYesservicestate machineinstanceshard_id
PartitionMaphidden write targetYeskeep as candidateentityYesserviceoverwritecollectionlock shards
LockStatusViewderived read modelNoreject as UI artifactprojectionNoderivedoverwritecollectiontenant or shard

Important modeling choices #

LockState #

This is the central correctness object.

Likely fields:

  • lock_key
  • holder_session_id
  • holder_client_id
  • epoch or fencing token
  • expiry
  • state

States:

  • FREE
  • HELD
  • EXPIRED

ClientSession #

Primary because:

  • real lock services usually tie lock validity to a session/lease lifecycle
  • if the session dies, held locks must eventually expire

LockWatchRegistration #

Kept explicit because:

  • watches are part of the product surface
  • registration may be tied to current session

Minimal strict primary set #

The strongest minimal set is:

  • LockState
  • ClientSession
  • PartitionOwnership
  • PartitionMap

With:

  • LockWatchRegistration as an optional but useful explicit primary object

Step 4 - Hard Invariants #

For a ZooKeeper/etcd-class lock service, the hard invariants are about one valid holder per lock, valid renew/release only by the current holder, and safe expiry plus fencing.

PathTierTypeInvariant statement
P1 acquire lockHARDuniquenessKey lock_key maps to at most one logical outcome current valid lock holder within lock scope.
P1 acquire lockHARDeligibilityAction acquire_lock is valid only if current LockState(lock_key) is acquirable and current session is active at decision time.
P2 renew lockHARDeligibilityAction renew_lock is valid only if LockState(lock_key) is currently held by the same session and epoch/token matches at decision time.
P3 release lockHARDeligibilityAction release_lock is valid only if LockState(lock_key) is currently held by the same session and epoch/token matches at decision time.
P4 expire stale lockHARDeligibilityAction expire_lock is valid only if current LockState(lock_key) is still held, expiry has passed, and epoch/token is unchanged at decision time.
P5 route to shard ownerHARDuniquenessKey shard_id maps to at most one logical outcome current authoritative owner within shard_id.
P6 reassign shard ownershipHARDeligibilityAction reassign_shard is valid only if current owner is failed or relinquished and candidate owner is eligible and sufficiently current on shard_id at decision time.
P7 read lock stateHARDfreshnessRead path reflects authoritative lock and session state within configured consistency bound.
P8 watch deliverySOFTfreshnessWatch stream reflects authoritative lock-state changes within watch propagation bound.

What matters most #

1. One valid holder per lock #

This is the primary correctness rule.

2. Renew/release are fenced #

Only the current holder with the current epoch/token may extend or release the lock.

3. Expiry must be revalidated #

A timeout worker cannot blindly expire a lock without checking that the lock state is still unchanged.

4. Watch delivery is secondary to lock truth #

Watches are important, but source truth is the lock/session state.


Step 5 - Execution Context #

For the strict baseline distributed lock service:

FieldValueWhy
Topologysingle service distributedone logical lock service spread across many nodes
Write coordination scopeper object scopecorrectness is per lock_key and per shard ownership scope
Read consistency targetstrong onlystale lock reads can break mutual exclusion
Holder modelclientclients temporarily hold locks
Compensation acceptable?Noduplicate lock holders cannot be repaired later safely

Derived implications #

  • holder_may_crash = true

    • clients can fail while holding locks
  • cross_service_write = false

    • baseline keeps lock, session, and ownership state within one logical service
  • bounded_staleness_allowed = false

    • lock acquisition and validation need authoritative current state
  • cross_service_atomicity_required = false

    • no multi-service transaction required in baseline
  • exclusive_claim_required = true

    • mutual exclusion is the core product
  • guarded_by_current_state = true

    • acquire, renew, release, and expiry all depend on current state

What this implies #

This pushes us toward:

  • one authoritative writer per lock shard
  • lease-backed lock ownership
  • session-linked expiry
  • fencing token/epoch for stale-holder protection

Step 6 - Deterministic Mechanism Selection #

PathWrite shapeBase mechanismRequired companions
P1 acquire lockexclusive claimleaseepoch/fencing token, session heartbeat
P2 renew lockguarded state transitionCAS on (state, version)epoch/fencing token
P3 release lockguarded state transitionCAS on (state, version)epoch/fencing token
P4 expire stale lockguarded state transitionleader-applied guarded transitionepoch/fencing token, timeout scan
P5 route to shard ownerexclusive claimleasefencing token, heartbeat
P6 reassign shard ownershipguarded state transitionCAS on (state, version)fencing token, shard catch-up check

Why these fit #

Acquire #

This is the canonical exclusive-claim path:

  • one holder wins
  • ownership is temporary
  • expiry matters

Renew/release #

These are not blind writes. They must verify:

  • current holder matches
  • epoch/token matches
  • session is still active

So they are guarded transitions.

Expiry #

Expiry is also guarded:

  • only expired, unchanged lock state may be transitioned back to free

Canonical substrate implied #

The baseline now points to:

  • sharded lock service
  • one authoritative owner per shard
  • lease-like lock records
  • current session state
  • fenced renew/release/expiry transitions

Step 7 - Read Model / Source of Truth #

For a distributed lock service, truth is direct source state. Watches are derived.

ConceptTruthRead pathRebuild path
C1 current lock holder and leaseLockStateread source directlyauthoritative lock-state store
C2 current client/session validityClientSessionread source directlyauthoritative session store
C3 shard ownershipPartitionOwnershipread source directlyauthoritative ownership store
C4 shard routing mapPartitionMapread source directlyauthoritative routing metadata
C5 lock watch streamlock/session state changesmaterialized viewrebuild from authoritative lock-state transitions
C6 status dashboardsderived from lock and session statematerialized viewrecompute from authoritative state

Important point #

For the core semantics:

  • acquire/renew/release/read all use authoritative LockState
  • session validity is authoritative source truth
  • watch notifications are derived from committed state changes

Step 8 - Failure Handling #

PathRetryCompeting writersCrash after commitPublish failureStale holder
P1 acquire lockretry safe; loser gets failure or later successonly one claimant should win current lock epochif acquire committed and client crashes, lock stays held until expirywatch delivery may lagstale holder fenced by epoch/token
P2 renew lockretry with current epoch/tokenstale renew loses guarded transitioncommitted renewal survives crash if persistedwatch delivery may lagold epoch rejected
P3 release lockretry with current epoch/tokenstale release loses guarded transitioncommitted release survives crash if persistedwatch delivery may lagold epoch rejected
P4 expire stale locktimeout scan retry safeonly one expiry transition should win for current expired statescanner crash delays cleanup; next scan retrieswatch delivery may lagprior holder blocked once epoch/version advanced
P5 route to shard ownerretry after refreshing shard maponly one valid owner should existif owner changed, refreshed map points to new ownern/astale owner rejected by fencing token
P6 reassign shard ownershipretry failover transition safelyonly one reassignment wins current ownership statepromoted owner crash triggers later reassignmentn/aold owner fenced and must not continue serving

What matters most #

1. Fencing token / epoch #

This is the core stale-holder defense.

Bad case:

  • client A acquires lock
  • A pauses or partitions
  • lock expires
  • client B acquires lock
  • A resumes and continues acting

Without fencing:

  • both can act as holder

So every external use of the lock should be coupled with the current epoch/token when possible.

2. Expiry is necessary but not sufficient #

Lease timeout reclaims the lock, but fencing is what prevents stale post-expiry actions.

3. Watch lag must not affect correctness #

Clients should treat watches as hints, not as the sole source of lock truth.


Step 9 - Scale Adjustments #

HotspotTypeFirst response
hot lock keyscontention hotspotisolate hot locks, shard by lock key, or push users toward coarser-grained coordination alternatives
renewal trafficwrite throughput hotspotlengthen lease duration within acceptable failover bounds and batch/sessionize renewals
watch fanoutfan-out hotspotderive watch delivery from committed state stream and decouple it from lock truth
strong reads on hot locksread hotspotkeep reads on authoritative owner; avoid stale replica reads for correctness paths
failover churncontention hotspotstabilize shard leadership and avoid aggressive reassignment
session reconnect stormscontention hotspotstagger reconnects and rate-limit watch/session restoration

What scales well #

A lock service scales only for relatively small coordination data.

It scales by:

  • sharding lock keys
  • keeping one authoritative owner per shard
  • making lock records small
  • treating watch delivery as secondary

What fails first #

Usually:

  • one or a few very hot locks
  • renewal storms
  • watch fanout spikes
  • clients relying on watches instead of source truth

Canonical design conclusion #

The mechanical outcome is:

  • primary state:
    • LockState
    • ClientSession
    • PartitionOwnership
    • PartitionMap
  • critical invariants:
    • one valid holder per lock
    • renew/release valid only for current holder epoch/token
    • stale locks expire safely
    • exclusive shard ownership for lock truth
  • mechanisms:
    • lease
    • guarded renew/release/expiry transitions
    • exclusive claim for lock acquire
    • fenced shard ownership
  • reads:
    • direct authoritative reads for lock/session truth
    • watches as derived notifications

Polished interview answer #

I’d design the lock service as a sharded strongly consistent system with one authoritative owner per lock shard. Acquiring a lock is an exclusive claim that creates a lease-backed LockState record with an epoch or fencing token. Renew, release, and timeout expiry are guarded transitions that only succeed if the current holder and epoch still match. Client sessions are tracked explicitly so crashed clients eventually lose their locks, and stale holders are prevented from acting by the fencing token even after they resume. Watches are derived from committed lock-state transitions, but correctness-critical reads come directly from authoritative lock and session state.


Concrete Substrate #

I’ll choose a sharded strongly consistent metadata service with lease-backed lock records as the concrete baseline, because it matches the mechanics we derived:

  • exclusive claim on acquire
  • guarded renew/release/expiry
  • session-linked lease validity
  • one owner per shard

Concrete tech family:

  • lock service in Go or Java
  • authoritative state in a replicated metadata store or service-owned Raft state machine
  • metadata/control:
    • built-in Raft consensus per shard or a small etcd-like control layer

Each shard leader stores:

  • LockState(lock_key)
  • ClientSession(session_id)
  • watch registrations
  • timeout/expiry index

This is effectively the same substrate family as ZooKeeper/etcd-style coordination stores, with a narrower product surface focused on lock semantics.


Operation Layer #

1. Acquire lock #

API

  • AcquireLock(lock_key, session_id, ttl, expected_free=true)

Initiator

  • client

Entry point

  • gateway or any lock-service node

Authoritative decider

  • current shard leader for lock_key

Precondition

  • session is active
  • lock currently acquirable

Transition

  • LockState(lock_key) = HELD(holder_session_id, epoch, expiry)

Response

  • {acquired: true|false, epoch, expiry}

Failure cases

  • competing claimant loses
  • stale routing -> retry with updated shard map

2. Renew lock #

API

  • RenewLock(lock_key, session_id, epoch, ttl)

Initiator

  • client

Entry point

  • gateway or any node

Authoritative decider

  • shard leader

Precondition

  • lock is currently held by session_id
  • epoch matches current state

Transition

  • extend expiry

Response

  • {renewed: true|false, expiry}

3. Release lock #

API

  • ReleaseLock(lock_key, session_id, epoch)

Initiator

  • client

Entry point

  • gateway or any node

Authoritative decider

  • shard leader

Precondition

  • lock is currently held by session_id
  • epoch matches current state

Transition

  • HELD -> FREE

Response

  • {released: true|false}

4. Expire stale lock #

API

  • internal background process

Initiator

  • system

Entry point

  • shard leader

Authoritative decider

  • shard leader

Precondition

  • current time > expiry
  • lock state and epoch unchanged

Transition

  • HELD -> FREE with epoch advancement as needed

5. Register watch #

API

  • WatchLock(lock_key, from_version?)

Initiator

  • client

Entry point

  • gateway or watch endpoint

Authoritative decider

  • watch subsystem on committed lock-state stream

Precondition

  • session active

Transition

  • create transient or session-bound watch registration

Response

  • watch id / stream handle

Entry Point vs Decider vs Responder #

PathEntry pointAuthoritative deciderPhysical responderLogical responder
AcquireLockgateway / any nodeshard leaderleader or front nodelock service
RenewLockgateway / any nodeshard leaderleader or front nodelock service
ReleaseLockgateway / any nodeshard leaderleader or front nodelock service
expiryshard leadershard leaderinternallock service
watchwatch endpointcommitted state stream / shard leaderwatch-serving nodelock service
shard failoverfollower / coordination layershard quorum / lease storenew leader / control planelock service

Concrete HLD #

Main components:

  • client gateway
    • routes lock operations to current shard leader
  • shard leaders
    • authoritative owners of lock and session state
    • maintain expiry index
  • shard followers
    • replicate committed lock/session state
  • watch service
    • emits lock-change notifications from committed state transitions
  • metadata/control service
    • tracks shard ownership and routing

Short Interview Version #

I’d build the distributed lock service as a sharded strongly consistent system with one authoritative owner per lock shard. Acquiring a lock is an exclusive claim that creates a lease-backed LockState record with an epoch or fencing token. Renew, release, and timeout expiry are guarded transitions that only succeed if the current holder and epoch still match. Client sessions are tracked explicitly so crashed clients eventually lose their locks, and stale holders are prevented from acting by the fencing token even after they resume. Watches are derived from committed lock-state transitions, but correctness-critical reads come directly from authoritative lock and session state.