Skip to main content
  1. concepts/

Lease / Fencing #

lease   = time-bounded authority        "who may act, for now?"
fencing = proof that stale authority
          can no longer mutate          "how is the old owner stopped?"

Role in the catalog: the keystone. This is the ownership protocol that nine files have been importing — queue’s claim, scheduler’s binding, state machine’s guards, checkpoint’s generation stamp, policy’s short expiry, cache’s coherence rung, GC’s epoch fence, capacity’s reservation, boundary’s crossing credential. One mechanism, defined here once; every consumer thins to a reference. In catalog-VSA terms this block is Lease/Fence* — essential to nine consumers, one known recipe (monotonic token, checked downstream), and until this file, owned nowhere.

Central tension:

availability and automatic recovery  vs  stale-owner safety
(expire fast and risk two owners; expire slow and stall on every failure)

Design Axes (the core module) #

Axis 1 — Validity Substrate (the structural cleave — the fifth strength ladder) #

What is the lease anchored to? Ordered by strength:

wall-clock time:    weakest. clock skew between grantor and holder — and
                    worse, the holder's own pauses make its local expiry
                    check worthless (see bottleneck*)
                    (Redis TTL locks, DNS TTL)
heartbeat/session:  liveness proxy. session alive ≠ work progressing —
                    a renewing zombie is still a zombie
                    (ZooKeeper sessions, etcd leases, group membership)
quorum/consensus:   authority is a fact agreed by a majority, not a timer.
                    strongest: the log itself arbitrates
                    (Raft term; leader validity = still winning quorum)
issuer-grant:       expiry set by an authority service; validity is the
                    issuer's promise (OAuth/SVID — policy.md's territory)

The deep-lesson row “time vs consensus” is this ladder’s bottom and top rung.

Interrogation:

What does validity actually rest on — a timer, a heartbeat, a quorum, an issuer?
Who observes the anchor, and can their view diverge from the holder's?
What does the bottom rung's failure cost here, and does that price the climb?

Axis 2 — Authority Payload (the export table) #

One mechanism, many payloads — each consumed by a component block:

exclusion:      lock lease            (Chubby, ZK ephemeral, Redis-with-caution)
coordination:   leader lease          (Raft leader, k8s Lease election)
a work item:    delivery lease        → queue.md's claim (SQS visibility,
                                        Pub/Sub ack deadline)
a shard:        ownership lease       → scheduler.md's leased+fenced binding
                                        (Kinesis lease table, Kafka generation)
liveness:       session lease         → state_machine.md's Suspect machinery
                                        (ZK session + ephemerals)
capacity:       reservation lease     → capacity.md's committed-entry hold
                                        (inventory hold, payment auth)
read validity:  cache lease           → cache.md's coherence rung (NFS leases)
permission:     credential lease      → policy.md's short-expiry recipe
                                        (OAuth token, SPIFFE SVID)

Interrogation:

What EXACTLY is the leased authority — enumerate what the holder may touch
Which consumer block's semantics apply? (its axis vector rides along)
Does the payload outlive the lease? (checkpoint written under a lease must
  carry its generation — checkpoint_replay.md's binding coordinate*)

Axis 3 — Enforcement Locus (who stops the stale owner) #

self-policing:      holder checks its own expiry before acting.
                    WORTHLESS: the pause lands between check and act.
grantor-enforced:   lease service rejects stale renewals/acks
                    (SQS receipt IDs). protects the GRANTOR'S state only —
                    nothing the holder touches elsewhere.
downstream-enforced: every mutation target validates the current token.
                    the only complete answer.

This is the Kleppmann–Redlock argument as an axis: a distributed lock whose enforcement stops at the lock service protects none of the resources the lock was taken to protect. The deep-lesson row “ownership vs downstream enforcement,” promoted to structure.

Interrogation:

List every system the holder mutates under this lease.
For each: does IT check the token, or does it trust the holder's claim?
A fence enforced by only some targets is a fence with a hole in it.

Axis 4 — Token Discipline #

the fence = a monotonic epoch/generation/term, with three requirements:

monotonic:    survives rollback and restart of the ISSUER —
              a reused token after issuer recovery re-arms the zombie
universal:    checked at every mutation target (axis 3's completeness)
comparable:   total order; "newer" must be decidable everywhere

Canonical instances: Raft term, Kafka producer epoch, Kinesis lease counter, k8s resourceVersion-as-guard, storage primary epoch.

Interrogation:

Where does the counter live, and what makes it survive its own crash?
Is the check compare-and-reject, or merely logged?
Can two tokens ever be incomparable? (split issuers, restored backups)

Axis 5 — Lifecycle Phases #

acquire → renew → handoff → expire/release

Renewal’s traps:

renewal proves liveness, not progress — a stuck worker renewing forever
  is starvation wearing a heartbeat (deep lesson: heartbeat ≠ real progress)
renewal storms; renewal that succeeds after logical ownership should end

Handoff’s dial — the block’s finest tension:

no-overlap guarantee:  stop-the-world transfer; pay availability
                       (Kafka eager rebalance)
no-gap guarantee:      cooperative/incremental transfer; pay a window of
                       possible double processing, fenced by generation
                       (Kafka cooperative-sticky)
you choose which failure you prefer; you do not get neither.

Interrogation:

Does renewal require evidence of progress, or only a pulse?
Handoff: gap or overlap — which was chosen, and is it written down?
Old owner's uncommitted progress: reaped, replayed, or lost?
New owner's starting checkpoint: stamped with whose generation?

Technical Bottleneck: The Paused Holder* #

a holder stalls — GC pause, VM migration, packet loss — past expiry,
then resumes BELIEVING IT STILL OWNS.
it cannot know it is stale from the inside:
this is state_machine.md's ignorance*, turned reflexive —
the holder is in Unknown about its own authority,
and no local check closes the gap,
because the pause can land between the check and the act.

Essential; no general solution. Known recipes:

downstream fencing        the only complete recipe: the world rejects the
                          zombie, since the zombie cannot detect itself
quorum-anchored authority the mutation IS the validity check — writes go
                          through the log, and a stale leader's entries
                          simply don't commit (the top rung dissolves the
                          problem instead of guarding it)
check-at-effect           fold the fence into the final write: CAS on
                          generation at the storage layer, atomically with
                          the mutation
bounded-pause assumption  honest, explicit — and violated by every JVM
                          eventually; acceptable only when the blast
                          radius of violation is priced

A strong design says explicitly:

what authority is leased and what exactly it covers,
what validity is anchored to (named rung),
how long it lasts and what renewal proves,
which token fences it and where the token is checked —
and the list of mutation targets that reject the stale owner,
because the stale owner will not stop itself.

Lease/Fencing As Protocol (the crossing-point spec — keep) #

General:

request authority
grant lease {expiry, token}
holder acts, presenting token
holder renews before expiry
mutation targets validate current token
lease expires or is released
new holder receives HIGHER token
old holder's later mutations are rejected downstream

Delivery-lease instantiation ( queue.md’s claim):

worker receives message {receipt ID, deadline}
message hidden until deadline
worker may extend (ChangeMessageVisibility / ModifyAckDeadline)
worker acks with CURRENT receipt — stale receipt fails or is ignored
deadline expiry re-exposes the message; attempt count rises

Shard-ownership instantiation ( scheduler.md’s binding):

worker acquires shard lease at generation G
processes; CHECKPOINTS CARRY G (binding coordinate*)
renews; on failure, lease expires
successor acquires at G+1
checkpoint writes require current generation —
the zombie's late checkpoint at G is rejected

Named Configurations (lookup table) #

Vector = {substrate, payload, enforcement, token, handoff style}.

NameVectorCanonical study objectSignature failure
Lock leasetime/session, exclusion, often grantor-only(!), sometimes none, releaseChubby; ZK ephemeral; Redis TTL (cautionary)pause-past-expiry resume; wrong-holder release; no downstream fence
Leader leasequorum (or unsafely, time), coordination, log-enforced, term, electionRaft term + leader leasesplit brain; old leader serves stale reads; clock-based lease under partition
Delivery leasetime, work item, grantor (receipt), attempt count, expiry-redeliverSQS visibility timeoutcomplete-after-expiry → duplicate; visibility mistuned; stale ack
Shard leaseheartbeat+counter, shard, downstream via checkpoint-generation, lease counter, rebalanceKinesis lease table; Kafka generationzombie checkpoint; rebalance thrash; handoff gap/overlap unchosen
Session leaseheartbeat, liveness+ephemerals, grantor, session ID, expiry-cleanupZK session + ephemeralsnetwork pause kills session; old session believes it lives; duplicate identity on reconnect
Reservation leasetime, capacity, grantor, hold ID, commit/cancelinventory hold; payment authleak; expired hold honored; overbooking; stranding ( capacity.md)
Cache leasetime+invalidation, read validity, client-cooperative, version, renewNFS coherence leasesmissed invalidation; renewal race; server forgets client ( cache.md)
Credential leaseissuer, permission, verifier-enforced, expiry+audience, rotationOAuth token; SPIFFE SVIDrevoked-but-unexpired (revocation*, policy.md); long TTL blast radius
Fencing tokenany, the fence itself, downstream, epoch/term/generation, —Kafka producer epoch; Raft termunchecked at some target; non-monotonic after issuer rollback; incomparable tokens
Renewalany, extension, grantor, same token, —etcd KeepAlive; k8s Lease renewTimeliveness without progress; renewal storm; renew past logical end
Handoffany, transfer, generation-mediated, G→G+1, gap-vs-overlap dialKafka cooperative-sticky; volume detach/attachownership gap; ownership overlap; uncommitted progress lost; stale starting checkpoint

Vocabulary #

lease  holder  owner  grantor  expiry  TTL
renewal  heartbeat  session  ephemeral
generation  epoch  term  fencing token  monotonic  comparable
receipt  visibility timeout  ack deadline
handoff  rebalance  drain  gap  overlap
stale owner  zombie  split brain  paused holder
clock skew  bounded-pause assumption
check-at-effect  quorum anchor  downstream enforcement

Deep Lesson #

Lease/fencing bugs come from confusing pairs on different axes:

lease               vs  lock forever          (authority is RENTED — the block's premise)
expiry              vs  safe stop             (bottleneck*: the holder doesn't feel expiry)
heartbeat           vs  real progress         (axis 5: a pulse is not a checkpoint)
session liveness    vs  authority correctness (axis 1: proxy vs fact)
time                vs  consensus             (axis 1: bottom rung vs top rung)
ownership           vs  downstream enforcement (axis 3: a fence with holes)
new owner started   vs  old owner stopped     (axis 5: gap/overlap — you chose one,
                                               and the token handles the other)

Design procedure: name the payload and its consumer block, pick the validity rung and price the bottom rung’s failure, enumerate every mutation target and put the token check IN each one, make the token survive its issuer’s own crashes, choose gap or overlap out loud — and never write a line that assumes the old owner knows it is old.