Lease / Fencing #
lease = time-bounded authority "who may act, for now?"
fencing = proof that stale authority
can no longer mutate "how is the old owner stopped?"
Role in the catalog: the keystone. This is the ownership protocol that nine files have been importing — queue’s claim, scheduler’s binding, state machine’s guards, checkpoint’s generation stamp, policy’s short expiry, cache’s coherence rung, GC’s epoch fence, capacity’s reservation, boundary’s crossing credential. One mechanism, defined here once; every consumer thins to a reference. In catalog-VSA terms this block is Lease/Fence* — essential to nine consumers, one known recipe (monotonic token, checked downstream), and until this file, owned nowhere.
Central tension:
availability and automatic recovery vs stale-owner safety
(expire fast and risk two owners; expire slow and stall on every failure)
Design Axes (the core module) #
Axis 1 — Validity Substrate (the structural cleave — the fifth strength ladder) #
What is the lease anchored to? Ordered by strength:
wall-clock time: weakest. clock skew between grantor and holder — and
worse, the holder's own pauses make its local expiry
check worthless (see bottleneck*)
(Redis TTL locks, DNS TTL)
heartbeat/session: liveness proxy. session alive ≠ work progressing —
a renewing zombie is still a zombie
(ZooKeeper sessions, etcd leases, group membership)
quorum/consensus: authority is a fact agreed by a majority, not a timer.
strongest: the log itself arbitrates
(Raft term; leader validity = still winning quorum)
issuer-grant: expiry set by an authority service; validity is the
issuer's promise (OAuth/SVID — policy.md's territory)
The deep-lesson row “time vs consensus” is this ladder’s bottom and top rung.
Interrogation:
What does validity actually rest on — a timer, a heartbeat, a quorum, an issuer?
Who observes the anchor, and can their view diverge from the holder's?
What does the bottom rung's failure cost here, and does that price the climb?
Axis 2 — Authority Payload (the export table) #
One mechanism, many payloads — each consumed by a component block:
exclusion: lock lease (Chubby, ZK ephemeral, Redis-with-caution)
coordination: leader lease (Raft leader, k8s Lease election)
a work item: delivery lease → queue.md's claim (SQS visibility,
Pub/Sub ack deadline)
a shard: ownership lease → scheduler.md's leased+fenced binding
(Kinesis lease table, Kafka generation)
liveness: session lease → state_machine.md's Suspect machinery
(ZK session + ephemerals)
capacity: reservation lease → capacity.md's committed-entry hold
(inventory hold, payment auth)
read validity: cache lease → cache.md's coherence rung (NFS leases)
permission: credential lease → policy.md's short-expiry recipe
(OAuth token, SPIFFE SVID)
Interrogation:
What EXACTLY is the leased authority — enumerate what the holder may touch
Which consumer block's semantics apply? (its axis vector rides along)
Does the payload outlive the lease? (checkpoint written under a lease must
carry its generation — checkpoint_replay.md's binding coordinate*)
Axis 3 — Enforcement Locus (who stops the stale owner) #
self-policing: holder checks its own expiry before acting.
WORTHLESS: the pause lands between check and act.
grantor-enforced: lease service rejects stale renewals/acks
(SQS receipt IDs). protects the GRANTOR'S state only —
nothing the holder touches elsewhere.
downstream-enforced: every mutation target validates the current token.
the only complete answer.
This is the Kleppmann–Redlock argument as an axis: a distributed lock whose enforcement stops at the lock service protects none of the resources the lock was taken to protect. The deep-lesson row “ownership vs downstream enforcement,” promoted to structure.
Interrogation:
List every system the holder mutates under this lease.
For each: does IT check the token, or does it trust the holder's claim?
A fence enforced by only some targets is a fence with a hole in it.
Axis 4 — Token Discipline #
the fence = a monotonic epoch/generation/term, with three requirements:
monotonic: survives rollback and restart of the ISSUER —
a reused token after issuer recovery re-arms the zombie
universal: checked at every mutation target (axis 3's completeness)
comparable: total order; "newer" must be decidable everywhere
Canonical instances: Raft term, Kafka producer epoch, Kinesis lease counter, k8s resourceVersion-as-guard, storage primary epoch.
Interrogation:
Where does the counter live, and what makes it survive its own crash?
Is the check compare-and-reject, or merely logged?
Can two tokens ever be incomparable? (split issuers, restored backups)
Axis 5 — Lifecycle Phases #
acquire → renew → handoff → expire/release
Renewal’s traps:
renewal proves liveness, not progress — a stuck worker renewing forever
is starvation wearing a heartbeat (deep lesson: heartbeat ≠ real progress)
renewal storms; renewal that succeeds after logical ownership should end
Handoff’s dial — the block’s finest tension:
no-overlap guarantee: stop-the-world transfer; pay availability
(Kafka eager rebalance)
no-gap guarantee: cooperative/incremental transfer; pay a window of
possible double processing, fenced by generation
(Kafka cooperative-sticky)
you choose which failure you prefer; you do not get neither.
Interrogation:
Does renewal require evidence of progress, or only a pulse?
Handoff: gap or overlap — which was chosen, and is it written down?
Old owner's uncommitted progress: reaped, replayed, or lost?
New owner's starting checkpoint: stamped with whose generation?
Technical Bottleneck: The Paused Holder* #
a holder stalls — GC pause, VM migration, packet loss — past expiry,
then resumes BELIEVING IT STILL OWNS.
it cannot know it is stale from the inside:
this is state_machine.md's ignorance*, turned reflexive —
the holder is in Unknown about its own authority,
and no local check closes the gap,
because the pause can land between the check and the act.
Essential; no general solution. Known recipes:
downstream fencing the only complete recipe: the world rejects the
zombie, since the zombie cannot detect itself
quorum-anchored authority the mutation IS the validity check — writes go
through the log, and a stale leader's entries
simply don't commit (the top rung dissolves the
problem instead of guarding it)
check-at-effect fold the fence into the final write: CAS on
generation at the storage layer, atomically with
the mutation
bounded-pause assumption honest, explicit — and violated by every JVM
eventually; acceptable only when the blast
radius of violation is priced
A strong design says explicitly:
what authority is leased and what exactly it covers,
what validity is anchored to (named rung),
how long it lasts and what renewal proves,
which token fences it and where the token is checked —
and the list of mutation targets that reject the stale owner,
because the stale owner will not stop itself.
Lease/Fencing As Protocol (the crossing-point spec — keep) #
General:
request authority
grant lease {expiry, token}
holder acts, presenting token
holder renews before expiry
mutation targets validate current token
lease expires or is released
new holder receives HIGHER token
old holder's later mutations are rejected downstream
Delivery-lease instantiation ( queue.md’s claim):
worker receives message {receipt ID, deadline}
message hidden until deadline
worker may extend (ChangeMessageVisibility / ModifyAckDeadline)
worker acks with CURRENT receipt — stale receipt fails or is ignored
deadline expiry re-exposes the message; attempt count rises
Shard-ownership instantiation ( scheduler.md’s binding):
worker acquires shard lease at generation G
processes; CHECKPOINTS CARRY G (binding coordinate*)
renews; on failure, lease expires
successor acquires at G+1
checkpoint writes require current generation —
the zombie's late checkpoint at G is rejected
Named Configurations (lookup table) #
Vector = {substrate, payload, enforcement, token, handoff style}.
| Name | Vector | Canonical study object | Signature failure |
|---|---|---|---|
| Lock lease | time/session, exclusion, often grantor-only(!), sometimes none, release | Chubby; ZK ephemeral; Redis TTL (cautionary) | pause-past-expiry resume; wrong-holder release; no downstream fence |
| Leader lease | quorum (or unsafely, time), coordination, log-enforced, term, election | Raft term + leader lease | split brain; old leader serves stale reads; clock-based lease under partition |
| Delivery lease | time, work item, grantor (receipt), attempt count, expiry-redeliver | SQS visibility timeout | complete-after-expiry → duplicate; visibility mistuned; stale ack |
| Shard lease | heartbeat+counter, shard, downstream via checkpoint-generation, lease counter, rebalance | Kinesis lease table; Kafka generation | zombie checkpoint; rebalance thrash; handoff gap/overlap unchosen |
| Session lease | heartbeat, liveness+ephemerals, grantor, session ID, expiry-cleanup | ZK session + ephemerals | network pause kills session; old session believes it lives; duplicate identity on reconnect |
| Reservation lease | time, capacity, grantor, hold ID, commit/cancel | inventory hold; payment auth | leak; expired hold honored; overbooking; stranding ( capacity.md) |
| Cache lease | time+invalidation, read validity, client-cooperative, version, renew | NFS coherence leases | missed invalidation; renewal race; server forgets client ( cache.md) |
| Credential lease | issuer, permission, verifier-enforced, expiry+audience, rotation | OAuth token; SPIFFE SVID | revoked-but-unexpired (revocation*, policy.md); long TTL blast radius |
| Fencing token | any, the fence itself, downstream, epoch/term/generation, — | Kafka producer epoch; Raft term | unchecked at some target; non-monotonic after issuer rollback; incomparable tokens |
| Renewal | any, extension, grantor, same token, — | etcd KeepAlive; k8s Lease renewTime | liveness without progress; renewal storm; renew past logical end |
| Handoff | any, transfer, generation-mediated, G→G+1, gap-vs-overlap dial | Kafka cooperative-sticky; volume detach/attach | ownership gap; ownership overlap; uncommitted progress lost; stale starting checkpoint |
Vocabulary #
lease holder owner grantor expiry TTL
renewal heartbeat session ephemeral
generation epoch term fencing token monotonic comparable
receipt visibility timeout ack deadline
handoff rebalance drain gap overlap
stale owner zombie split brain paused holder
clock skew bounded-pause assumption
check-at-effect quorum anchor downstream enforcement
Deep Lesson #
Lease/fencing bugs come from confusing pairs on different axes:
lease vs lock forever (authority is RENTED — the block's premise)
expiry vs safe stop (bottleneck*: the holder doesn't feel expiry)
heartbeat vs real progress (axis 5: a pulse is not a checkpoint)
session liveness vs authority correctness (axis 1: proxy vs fact)
time vs consensus (axis 1: bottom rung vs top rung)
ownership vs downstream enforcement (axis 3: a fence with holes)
new owner started vs old owner stopped (axis 5: gap/overlap — you chose one,
and the token handles the other)
Design procedure: name the payload and its consumer block, pick the validity rung and price the bottom rung’s failure, enumerate every mutation target and put the token check IN each one, make the token survive its issuer’s own crashes, choose gap or overlap out loud — and never write a line that assumes the old owner knows it is old.