Failure Mode Interview Verbiage From Taxonomy
Failure Mode Interview Verbiage From Taxonomy #
Use this as a speaking companion to:
This note is not a new failure framework. It is just the interview language for the failure modes that already exist in the taxonomy.
The speaking shape is:
what- what broke?
so what- what property is now at risk?
prevention- what normally stops this?
repair- what restores the invariant if it still happens?
close- why this failure is survivable if the design is correct
This note is deliberately grounded in the archetype mechanics:
- guarded transitions
- durable checkpoints
- replayable logs
- fencing tokens
- reconciliation loops
- rebuildable derived state
Core Verbiage #
These are the reusable sentence stems.
what #
- “The failure here is not total system collapse; it is the system losing track of which transition is actually authoritative.”
- “This is a stale-actor failure: an actor that used to be valid is still trying to commit.”
- “This is a progress-accounting failure: the system advanced some marker before the underlying effect was durably true.”
- “This is a drift failure: a derived or cached view no longer matches its source of truth.”
- “This is an ownership failure: more than one actor may believe it owns the same work or resource.”
so what #
- “The risk is not just temporary unavailability; it is violating the core invariant of the archetype.”
- “If we get this wrong, we can skip work, apply work twice, or let stale state leak into serving.”
- “The user-visible symptom may be latency or availability, but the deeper issue is loss of authoritative sequencing.”
- “This is dangerous because the system can look healthy while its durable truth has drifted.”
prevention #
- “The normal prevention move is to make the transition guarded against authoritative truth.”
- “The prevention move is fencing: a stale actor may continue running, but it must not be allowed to commit.”
- “The prevention move is to advance progress only after durable evidence exists.”
- “The prevention move is to keep enough ordered history so replay or rebuild remains possible.”
- “The prevention move is to version snapshots and reject out-of-order apply.”
repair #
- “The repair move is to replay from the last safe point.”
- “The repair move is to expire, reclaim, or fence the stale actor and requeue the work.”
- “The repair move is to rebuild the derived view from the source of truth.”
- “The repair move is to reconcile ambiguity against the external or durable system of record.”
- “The repair move is to resync from a fresh snapshot and continue from a known-good revision.”
close #
- “So this failure is survivable if the design keeps the right truth durable and replayable.”
- “The main goal is not to pretend the failure cannot happen; it is to make the failure detectable, bounded, and repairable.”
- “This is acceptable if the system degrades into replay or reconciliation rather than silent corruption.”
Per-Archetype Verbiage #
Use one row as the default interview wording, then expand only if the interviewer pushes.
| Archetype | Failure | What | So what | Prevention | Repair |
|---|---|---|---|---|---|
I01 | split brain | “Two leaders may both believe they are current.” | “That breaks the authority invariant and can fork metadata truth.” | “Guard leadership by quorum, monotonic term, and fencing.” | “Revoke the stale leader, elect again, and resync followers and watchers from authoritative revisioned state.” |
I01 | watch gap on reconnect | “A watcher disconnected and its exact missing history may no longer be replayable.” | “The danger is silent state drift, not just a dropped connection.” | “Persist the last applied revision cursor and reject resume below the compaction floor.” | “Reload a fresh snapshot and resume watch from a new known-good revision.” |
I02 | duplicate claim | “Two actors may think they own the same item.” | “That risks duplicate side effects or conflicting mutation.” | “Guard claim transition, keep one active owner, and fence with lease plus epoch.” | “Expire or reap the stale claim, reject stale epochs downstream, and reconcile any duplicate work.” |
I02 | stale holder acting after expiry | “An actor whose lease expired is still trying to commit.” | “The real danger is zombie writes after authority has moved on.” | “Check the fencing token or epoch on every downstream commit.” | “Reject the stale commit, reassign ownership, and replay or reconcile partial effects.” |
I03 | due item materialized twice | “The same logical run was released into runnable state more than once.” | “That turns one scheduled action into duplicate execution pressure.” | “Checkpoint only after durable materialization and key release idempotently by logical run.” | “Deduplicate runnable records and reconcile duplicate attempts.” |
I03 | due item never materialized | “A scheduled item became due but never entered runnable state.” | “This is a missed-work failure rather than a transient queue delay.” | “Use overdue reconciliation sweeps and monotonic scan checkpoints.” | “Rescan overdue buckets and rebuild runnable state from schedule truth.” |
I04 | frontier advanced too far | “The system marked progress beyond what was durably covered.” | “That risks silent skip of uncovered work.” | “Advance the frontier only after durable success and checkpoint discipline.” | “Rescan from the last safe checkpoint and rebuild the uncovered set.” |
I04 | uncovered work skipped | “Some eligible range was neither claimed nor completed.” | “The frontier may look advanced while real work was lost.” | “Maintain coverage invariants and resumable scan discipline.” | “Run anti-entropy scan and replay unfinished partitions.” |
I05 | offset advanced before effect committed | “Consumer progress moved ahead of the durable sink effect.” | “The system can now skip data on replay.” | “Keep effect-before-offset discipline and use idempotent or transactional sinks.” | “Replay from the last safe offset and reconcile sink ambiguity.” |
I05 | duplicate consumption | “The same record was processed more than once.” | “That is usually survivable, but only if the consumer or sink is idempotent.” | “Use idempotent consumers, dedup state, and fenced progress ownership.” | “Replay with dedup or downstream reconciliation.” |
I06 | stale projection | “The read model has fallen behind the source of truth.” | “Users now see old derived state even though source writes succeeded.” | “Use ordered checkpoints, replayable source history, and monotonic apply.” | “Replay from source or rebuild the projection from scratch.” |
I06 | missing entry / tombstone not propagated | “The derived index missed a write or delete.” | “This creates incorrect search results or ghost records in the read path.” | “Handle tombstones explicitly and run completeness or backfill checks.” | “Reindex the affected range or do a full rebuild.” |
I07 | stale cache | “The cache no longer reflects the current origin truth.” | “Serving becomes fast but incorrect.” | “Use TTL/versioned purge discipline and origin-version checks on fill.” | “Purge, refresh, or force origin reads until the cache is correct again.” |
I07 | cache stampede | “Many readers all miss and refill the same entry at once.” | “The cache stops protecting the origin and can amplify overload.” | “Use single-flight fills, request coalescing, and refresh-ahead.” | “Temporarily shed or protect origin load and backfill cache gradually.” |
I08 | over-admit under race | “Budget or concurrency was overspent under concurrent decisions.” | “The gate failed open and let overload through.” | “Use atomic budget updates, bounded-drift local fast paths, and hard concurrency caps.” | “Shed or defer excess load and reset counters from authoritative truth.” |
I08 | stale policy apply | “An evaluator is making decisions from an old policy snapshot.” | “Requests are admitted or denied according to the wrong rule set.” | “Version policy snapshots and apply monotonically.” | “Republish the policy, invalidate local state, and reconcile if bad decisions matter.” |
I09 | duplicate IDs | “Two generators issued the same identifier.” | “This is a correctness failure that downstream systems may not be able to repair cheaply.” | “Use unique worker identity or leased ranges with monotonic epoch and guarded local sequence.” | “Fence the bad generator, rotate worker identity, and repair duplicates only where downstream semantics allow it.” |
I09 | non-monotonic IDs | “The generator moved backward in ordering.” | “Ordering assumptions above the ID layer may now break.” | “Use monotonic clock discipline or fall back to logical sequence.” | “Bump the epoch and restart that generation lane.” |
I10 | false death | “A live member was judged dead.” | “The system may reroute or reassign unnecessarily, causing churn and possible duplicate work.” | “Use suspicion, heartbeat grace, and incarnation/version rules before eviction.” | “Rejoin with a higher incarnation and anti-entropy-sync membership state.” |
I10 | ghost member | “A dead member still appears live in the registry.” | “Traffic or assignments can continue going to a nonexistent target.” | “Sweep expired heartbeats and keep watch/version semantics clear.” | “Purge explicitly or rebuild the registry from active heartbeats.” |
I11 | out-of-order snapshot apply | “An agent applied an older config after a newer one.” | “This is a stale-control failure: the fleet can regress even though control truth advanced.” | “Require monotonic version checks and ACK/NACK discipline.” | “Refetch a full snapshot and replay from the current version.” |
I11 | partial rollout | “Only some targets received or applied the new control state.” | “The fleet is now inconsistent by version, which can create latent correctness or availability issues.” | “Track staged rollout status and report applied versions explicitly.” | “Rollback or republish until the lagging targets converge.” |
I12 | crash after transition but before side effect | “Workflow truth advanced, but the external action may not have happened.” | “The system has an ambiguity gap between internal state and external reality.” | “Use a transactional outbox and idempotency keys.” | “Replay the outbox and reconcile with the provider.” |
I12 | retry ambiguity against provider | “A previous attempt may or may not have succeeded, and retrying could duplicate it.” | “This is the classic external-effect ambiguity problem.” | “Use idempotent provider APIs or provider-side dedup keys.” | “Poll, reconcile, or apply manual correction if necessary.” |
I13 | out-of-order apply | “Operations arrived or were applied in the wrong order.” | “Subject state can diverge even though every individual op looks valid.” | “Use per-subject sequencing or causality checks.” | “Replay from the op log and rebuild the subject snapshot.” |
I13 | divergence | “Different replicas or sessions now disagree on subject state.” | “Users can observe conflicting shared state.” | “Keep authoritative merge or sequence discipline and versioned sync.” | “Resync from snapshot plus missing ops.” |
I14 | content uploaded but namespace not advanced | “The blob exists, but the published head still points elsewhere.” | “The system has orphaned content or incomplete publication.” | “Publish content first, then move the head.” | “Retry the head advance or garbage-collect the orphaned content.” |
I14 | namespace CAS race | “Two publishers tried to advance the same head concurrently.” | “Without CAS, one publish could silently clobber another.” | “Use compare-and-swap on the head and immutable manifests.” | “Retry against the current head or materialize the conflict as a new version.” |
I15 | duplicate placement | “The same runnable work was placed twice.” | “That can create double execution or stale completion races.” | “Guard placement and reserve capacity before dispatch.” | “Cancel the duplicate attempt and fence stale completion paths.” |
I15 | capacity leak after worker crash | “A slot stayed reserved after the worker died.” | “The fleet appears full before hardware is actually full.” | “Expire leases and ensure completion paths release slots.” | “Reclaim the slot and requeue stranded work.” |
I15 | preempted or evicted work continues acting | “A task lost authority but still tries to complete or affect state.” | “This is an attempt-scoped stale actor failure.” | “Fence post-preemption side effects with lease/token checks and monotonic attempt version.” | “Cancel the stale attempt, reclaim capacity, and requeue if policy allows.” |
I16 | stale read after failover | “A read hit a replica that had not caught up to the required version.” | “Clients may observe old state even though the write already succeeded elsewhere.” | “Use leader or quorum read policy and check replica versions.” | “Read repair and anti-entropy replication restore convergence.” |
I16 | ghost expired key | “A key that should be expired is still visible.” | “The system violates TTL semantics and may serve data past its validity window.” | “Define authoritative TTL semantics and expiry visibility rules.” | “Sweep expiry and propagate tombstones.” |
I17 | routing to dead backend | “The mediation layer chose a target that should have been out of service.” | “The user sees failure even though healthy backends may still exist.” | “Use health-aware selection, active/passive health, and outlier ejection.” | “Retry or reroute, drain the bad backend, and refresh health state.” |
I17 | stale policy enforcement | “The gateway or proxy enforced an old policy snapshot.” | “Traffic is being allowed or blocked under the wrong control state.” | “Version policy snapshots and apply config monotonically.” | “Republish config and invalidate stale local state.” |
I18 | dropped sample / late sample skew | “Samples were lost or arrived too late for the intended evaluation window.” | “Dashboards and alerts are now wrong or incomplete.” | “WAL before ack, bounded lateness policy, and clear clock or ordering discipline.” | “Replay the WAL and backfill missing windows where possible.” |
I18 | alert flapping | “The alert state is oscillating because the evaluation is too sensitive to noise.” | “Users lose trust in the alerting system and paging quality collapses.” | “Use hysteresis, stable windows, and dedup across alerting tiers.” | “Suppress noisy state and recompute from a recent stable window.” |
I19 | metadata/data divergence | “Placement or metadata says one thing, but the replicas on disk say another.” | “The namespace and the actual bytes are no longer aligned.” | “Commit placement updates carefully and require writer lease on the mutable path.” | “Scrub and reconcile metadata against real replicas, then rebuild the placement map.” |
I19 | replica under-count after node loss | “After failures, the object no longer has enough healthy replicas.” | “Durability has degraded, and another fault could now cause real loss.” | “Track replica count continuously and schedule repair promptly.” | “Copy from surviving replicas and rebalance placement.” |
Reusable Failure Families #
When you do not want to speak in archetype numbers, use the family wording instead.
stale actor #
what: “An actor that used to be valid is still trying to commit.”so what: “If not fenced, stale execution can corrupt current truth.”prevention: “Epoch, lease, token, or attempt checks on commit.”repair: “Reject stale commit, reclaim authority, and replay safely.”
lost or ambiguous progress #
what: “The system no longer knows whether work was durably completed.”so what: “You either skip work or repeat it unsafely.”prevention: “Advance progress only after durable evidence.”repair: “Replay from the last safe point and reconcile ambiguity.”
drifted derived state #
what: “A projection, cache, index, or replica is no longer aligned with its source.”so what: “Fast reads are now wrong.”prevention: “Monotonic apply, replayable source history, explicit tombstones or version checks.”repair: “Rebuild or replay from source truth.”
ownership split #
what: “Two actors may believe they currently own the same work or resource.”so what: “This risks duplicate side effects or split authority.”prevention: “Guarded claim, lease, fencing token, monotonic epoch.”repair: “Expire or fence the loser, reconcile duplicates.”
control-state regression #
what: “An older config or snapshot was applied after a newer one.”so what: “The fleet can regress even though the control plane advanced.”prevention: “Versioned snapshots and monotonic apply.”repair: “Reload fresh snapshot and republish or rollback deliberately.”
Minimal Interview Script #
For any failure mode, you can usually say:
what- “The failure here is
....”
- “The failure here is
so what- “That matters because
....”
- “That matters because
prevention- “Normally we prevent it by
....”
- “Normally we prevent it by
repair- “If it still happens, we repair by
....”
- “If it still happens, we repair by
close- “So the system stays safe because
....”
- “So the system stays safe because
Example:
- “The failure here is offset advancing before the sink effect is durably committed.”
- “That matters because replay can now skip work rather than merely repeat it.”
- “Normally we prevent it with effect-before-offset discipline and an idempotent sink.”
- “If it still happens, we replay from the last safe offset and reconcile sink ambiguity.”
- “So the failure is survivable as long as offsets are not treated as truth before the effect is durable.”
That is the whole point of this note:
- make the failure concrete
- tie it to the violated invariant
- state both the prevention mechanism and the repair loop