Skip to main content
  1. System Design Components/

I15 Failure Workbook #

Baseline:

Bug scenarios:

Main insights #

1. Placement consumes capacity, not just metadata #

  • bug: the same runnable unit is placed on more than one node
  • catches: AtMostOnePlacement
  • lesson: scheduling has to mutate authoritative capacity truth, not just annotate the task

2. Completion must be fenced by the current attempt epoch #

  • bug: terminal completion is accepted from a stale generation
  • catches: CompletedUsesCurrentEpoch
  • lesson: execution fleets need the same epoch discipline as lease and claim systems

3. Terminal transitions must release worker capacity #

  • bug: the task becomes terminal, but the node slot stays consumed
  • catches: NoBusyAfterTerminal
  • lesson: completion is both a task-state transition and a capacity-accounting transition

4. Reconciliation is part of correctness, not just cleanup #

  • bug: an expired assigned/running attempt remains stranded forever
  • catches: NoExpiredAssignedOrRunning
  • lesson: in a tiny bounded model this appears as a safety-style violation, but it stands in for the real liveness requirement that the fleet eventually reclaims dead attempts