Skip to main content
  1. System Design Components/

Failure Mode Interview Verbiage From Taxonomy

Failure Mode Interview Verbiage From Taxonomy #

Use this as a speaking companion to:

This note is not a new failure framework. It is just the interview language for the failure modes that already exist in the taxonomy.

The speaking shape is:

  • what
    • what broke?
  • so what
    • what property is now at risk?
  • prevention
    • what normally stops this?
  • repair
    • what restores the invariant if it still happens?
  • close
    • why this failure is survivable if the design is correct

This note is deliberately grounded in the archetype mechanics:

  • guarded transitions
  • durable checkpoints
  • replayable logs
  • fencing tokens
  • reconciliation loops
  • rebuildable derived state

Core Verbiage #

These are the reusable sentence stems.

what #

  • “The failure here is not total system collapse; it is the system losing track of which transition is actually authoritative.”
  • “This is a stale-actor failure: an actor that used to be valid is still trying to commit.”
  • “This is a progress-accounting failure: the system advanced some marker before the underlying effect was durably true.”
  • “This is a drift failure: a derived or cached view no longer matches its source of truth.”
  • “This is an ownership failure: more than one actor may believe it owns the same work or resource.”

so what #

  • “The risk is not just temporary unavailability; it is violating the core invariant of the archetype.”
  • “If we get this wrong, we can skip work, apply work twice, or let stale state leak into serving.”
  • “The user-visible symptom may be latency or availability, but the deeper issue is loss of authoritative sequencing.”
  • “This is dangerous because the system can look healthy while its durable truth has drifted.”

prevention #

  • “The normal prevention move is to make the transition guarded against authoritative truth.”
  • “The prevention move is fencing: a stale actor may continue running, but it must not be allowed to commit.”
  • “The prevention move is to advance progress only after durable evidence exists.”
  • “The prevention move is to keep enough ordered history so replay or rebuild remains possible.”
  • “The prevention move is to version snapshots and reject out-of-order apply.”

repair #

  • “The repair move is to replay from the last safe point.”
  • “The repair move is to expire, reclaim, or fence the stale actor and requeue the work.”
  • “The repair move is to rebuild the derived view from the source of truth.”
  • “The repair move is to reconcile ambiguity against the external or durable system of record.”
  • “The repair move is to resync from a fresh snapshot and continue from a known-good revision.”

close #

  • “So this failure is survivable if the design keeps the right truth durable and replayable.”
  • “The main goal is not to pretend the failure cannot happen; it is to make the failure detectable, bounded, and repairable.”
  • “This is acceptable if the system degrades into replay or reconciliation rather than silent corruption.”

Per-Archetype Verbiage #

Use one row as the default interview wording, then expand only if the interviewer pushes.

ArchetypeFailureWhatSo whatPreventionRepair
I01split brain“Two leaders may both believe they are current.”“That breaks the authority invariant and can fork metadata truth.”“Guard leadership by quorum, monotonic term, and fencing.”“Revoke the stale leader, elect again, and resync followers and watchers from authoritative revisioned state.”
I01watch gap on reconnect“A watcher disconnected and its exact missing history may no longer be replayable.”“The danger is silent state drift, not just a dropped connection.”“Persist the last applied revision cursor and reject resume below the compaction floor.”“Reload a fresh snapshot and resume watch from a new known-good revision.”
I02duplicate claim“Two actors may think they own the same item.”“That risks duplicate side effects or conflicting mutation.”“Guard claim transition, keep one active owner, and fence with lease plus epoch.”“Expire or reap the stale claim, reject stale epochs downstream, and reconcile any duplicate work.”
I02stale holder acting after expiry“An actor whose lease expired is still trying to commit.”“The real danger is zombie writes after authority has moved on.”“Check the fencing token or epoch on every downstream commit.”“Reject the stale commit, reassign ownership, and replay or reconcile partial effects.”
I03due item materialized twice“The same logical run was released into runnable state more than once.”“That turns one scheduled action into duplicate execution pressure.”“Checkpoint only after durable materialization and key release idempotently by logical run.”“Deduplicate runnable records and reconcile duplicate attempts.”
I03due item never materialized“A scheduled item became due but never entered runnable state.”“This is a missed-work failure rather than a transient queue delay.”“Use overdue reconciliation sweeps and monotonic scan checkpoints.”“Rescan overdue buckets and rebuild runnable state from schedule truth.”
I04frontier advanced too far“The system marked progress beyond what was durably covered.”“That risks silent skip of uncovered work.”“Advance the frontier only after durable success and checkpoint discipline.”“Rescan from the last safe checkpoint and rebuild the uncovered set.”
I04uncovered work skipped“Some eligible range was neither claimed nor completed.”“The frontier may look advanced while real work was lost.”“Maintain coverage invariants and resumable scan discipline.”“Run anti-entropy scan and replay unfinished partitions.”
I05offset advanced before effect committed“Consumer progress moved ahead of the durable sink effect.”“The system can now skip data on replay.”“Keep effect-before-offset discipline and use idempotent or transactional sinks.”“Replay from the last safe offset and reconcile sink ambiguity.”
I05duplicate consumption“The same record was processed more than once.”“That is usually survivable, but only if the consumer or sink is idempotent.”“Use idempotent consumers, dedup state, and fenced progress ownership.”“Replay with dedup or downstream reconciliation.”
I06stale projection“The read model has fallen behind the source of truth.”“Users now see old derived state even though source writes succeeded.”“Use ordered checkpoints, replayable source history, and monotonic apply.”“Replay from source or rebuild the projection from scratch.”
I06missing entry / tombstone not propagated“The derived index missed a write or delete.”“This creates incorrect search results or ghost records in the read path.”“Handle tombstones explicitly and run completeness or backfill checks.”“Reindex the affected range or do a full rebuild.”
I07stale cache“The cache no longer reflects the current origin truth.”“Serving becomes fast but incorrect.”“Use TTL/versioned purge discipline and origin-version checks on fill.”“Purge, refresh, or force origin reads until the cache is correct again.”
I07cache stampede“Many readers all miss and refill the same entry at once.”“The cache stops protecting the origin and can amplify overload.”“Use single-flight fills, request coalescing, and refresh-ahead.”“Temporarily shed or protect origin load and backfill cache gradually.”
I08over-admit under race“Budget or concurrency was overspent under concurrent decisions.”“The gate failed open and let overload through.”“Use atomic budget updates, bounded-drift local fast paths, and hard concurrency caps.”“Shed or defer excess load and reset counters from authoritative truth.”
I08stale policy apply“An evaluator is making decisions from an old policy snapshot.”“Requests are admitted or denied according to the wrong rule set.”“Version policy snapshots and apply monotonically.”“Republish the policy, invalidate local state, and reconcile if bad decisions matter.”
I09duplicate IDs“Two generators issued the same identifier.”“This is a correctness failure that downstream systems may not be able to repair cheaply.”“Use unique worker identity or leased ranges with monotonic epoch and guarded local sequence.”“Fence the bad generator, rotate worker identity, and repair duplicates only where downstream semantics allow it.”
I09non-monotonic IDs“The generator moved backward in ordering.”“Ordering assumptions above the ID layer may now break.”“Use monotonic clock discipline or fall back to logical sequence.”“Bump the epoch and restart that generation lane.”
I10false death“A live member was judged dead.”“The system may reroute or reassign unnecessarily, causing churn and possible duplicate work.”“Use suspicion, heartbeat grace, and incarnation/version rules before eviction.”“Rejoin with a higher incarnation and anti-entropy-sync membership state.”
I10ghost member“A dead member still appears live in the registry.”“Traffic or assignments can continue going to a nonexistent target.”“Sweep expired heartbeats and keep watch/version semantics clear.”“Purge explicitly or rebuild the registry from active heartbeats.”
I11out-of-order snapshot apply“An agent applied an older config after a newer one.”“This is a stale-control failure: the fleet can regress even though control truth advanced.”“Require monotonic version checks and ACK/NACK discipline.”“Refetch a full snapshot and replay from the current version.”
I11partial rollout“Only some targets received or applied the new control state.”“The fleet is now inconsistent by version, which can create latent correctness or availability issues.”“Track staged rollout status and report applied versions explicitly.”“Rollback or republish until the lagging targets converge.”
I12crash after transition but before side effect“Workflow truth advanced, but the external action may not have happened.”“The system has an ambiguity gap between internal state and external reality.”“Use a transactional outbox and idempotency keys.”“Replay the outbox and reconcile with the provider.”
I12retry ambiguity against provider“A previous attempt may or may not have succeeded, and retrying could duplicate it.”“This is the classic external-effect ambiguity problem.”“Use idempotent provider APIs or provider-side dedup keys.”“Poll, reconcile, or apply manual correction if necessary.”
I13out-of-order apply“Operations arrived or were applied in the wrong order.”“Subject state can diverge even though every individual op looks valid.”“Use per-subject sequencing or causality checks.”“Replay from the op log and rebuild the subject snapshot.”
I13divergence“Different replicas or sessions now disagree on subject state.”“Users can observe conflicting shared state.”“Keep authoritative merge or sequence discipline and versioned sync.”“Resync from snapshot plus missing ops.”
I14content uploaded but namespace not advanced“The blob exists, but the published head still points elsewhere.”“The system has orphaned content or incomplete publication.”“Publish content first, then move the head.”“Retry the head advance or garbage-collect the orphaned content.”
I14namespace CAS race“Two publishers tried to advance the same head concurrently.”“Without CAS, one publish could silently clobber another.”“Use compare-and-swap on the head and immutable manifests.”“Retry against the current head or materialize the conflict as a new version.”
I15duplicate placement“The same runnable work was placed twice.”“That can create double execution or stale completion races.”“Guard placement and reserve capacity before dispatch.”“Cancel the duplicate attempt and fence stale completion paths.”
I15capacity leak after worker crash“A slot stayed reserved after the worker died.”“The fleet appears full before hardware is actually full.”“Expire leases and ensure completion paths release slots.”“Reclaim the slot and requeue stranded work.”
I15preempted or evicted work continues acting“A task lost authority but still tries to complete or affect state.”“This is an attempt-scoped stale actor failure.”“Fence post-preemption side effects with lease/token checks and monotonic attempt version.”“Cancel the stale attempt, reclaim capacity, and requeue if policy allows.”
I16stale read after failover“A read hit a replica that had not caught up to the required version.”“Clients may observe old state even though the write already succeeded elsewhere.”“Use leader or quorum read policy and check replica versions.”“Read repair and anti-entropy replication restore convergence.”
I16ghost expired key“A key that should be expired is still visible.”“The system violates TTL semantics and may serve data past its validity window.”“Define authoritative TTL semantics and expiry visibility rules.”“Sweep expiry and propagate tombstones.”
I17routing to dead backend“The mediation layer chose a target that should have been out of service.”“The user sees failure even though healthy backends may still exist.”“Use health-aware selection, active/passive health, and outlier ejection.”“Retry or reroute, drain the bad backend, and refresh health state.”
I17stale policy enforcement“The gateway or proxy enforced an old policy snapshot.”“Traffic is being allowed or blocked under the wrong control state.”“Version policy snapshots and apply config monotonically.”“Republish config and invalidate stale local state.”
I18dropped sample / late sample skew“Samples were lost or arrived too late for the intended evaluation window.”“Dashboards and alerts are now wrong or incomplete.”“WAL before ack, bounded lateness policy, and clear clock or ordering discipline.”“Replay the WAL and backfill missing windows where possible.”
I18alert flapping“The alert state is oscillating because the evaluation is too sensitive to noise.”“Users lose trust in the alerting system and paging quality collapses.”“Use hysteresis, stable windows, and dedup across alerting tiers.”“Suppress noisy state and recompute from a recent stable window.”
I19metadata/data divergence“Placement or metadata says one thing, but the replicas on disk say another.”“The namespace and the actual bytes are no longer aligned.”“Commit placement updates carefully and require writer lease on the mutable path.”“Scrub and reconcile metadata against real replicas, then rebuild the placement map.”
I19replica under-count after node loss“After failures, the object no longer has enough healthy replicas.”“Durability has degraded, and another fault could now cause real loss.”“Track replica count continuously and schedule repair promptly.”“Copy from surviving replicas and rebalance placement.”

Reusable Failure Families #

When you do not want to speak in archetype numbers, use the family wording instead.

stale actor #

  • what: “An actor that used to be valid is still trying to commit.”
  • so what: “If not fenced, stale execution can corrupt current truth.”
  • prevention: “Epoch, lease, token, or attempt checks on commit.”
  • repair: “Reject stale commit, reclaim authority, and replay safely.”

lost or ambiguous progress #

  • what: “The system no longer knows whether work was durably completed.”
  • so what: “You either skip work or repeat it unsafely.”
  • prevention: “Advance progress only after durable evidence.”
  • repair: “Replay from the last safe point and reconcile ambiguity.”

drifted derived state #

  • what: “A projection, cache, index, or replica is no longer aligned with its source.”
  • so what: “Fast reads are now wrong.”
  • prevention: “Monotonic apply, replayable source history, explicit tombstones or version checks.”
  • repair: “Rebuild or replay from source truth.”

ownership split #

  • what: “Two actors may believe they currently own the same work or resource.”
  • so what: “This risks duplicate side effects or split authority.”
  • prevention: “Guarded claim, lease, fencing token, monotonic epoch.”
  • repair: “Expire or fence the loser, reconcile duplicates.”

control-state regression #

  • what: “An older config or snapshot was applied after a newer one.”
  • so what: “The fleet can regress even though the control plane advanced.”
  • prevention: “Versioned snapshots and monotonic apply.”
  • repair: “Reload fresh snapshot and republish or rollback deliberately.”

Minimal Interview Script #

For any failure mode, you can usually say:

  1. what
    • “The failure here is ....”
  2. so what
    • “That matters because ....”
  3. prevention
    • “Normally we prevent it by ....”
  4. repair
    • “If it still happens, we repair by ....”
  5. close
    • “So the system stays safe because ....”

Example:

  • “The failure here is offset advancing before the sink effect is durably committed.”
  • “That matters because replay can now skip work rather than merely repeat it.”
  • “Normally we prevent it with effect-before-offset discipline and an idempotent sink.”
  • “If it still happens, we replay from the last safe offset and reconcile sink ambiguity.”
  • “So the failure is survivable as long as offsets are not treated as truth before the effect is durable.”

That is the whole point of this note:

  • make the failure concrete
  • tie it to the violated invariant
  • state both the prevention mechanism and the repair loop