I15 Failure Workbook #
Baseline:
Bug scenarios:
- I15DuplicatePlacementBug.tla
- I15StaleCompletionBug.tla
- I15LeakedSlotBug.tla
- I15ExpiredAttemptNotReclaimedBug.tla
Main insights #
1. Placement consumes capacity, not just metadata #
- bug: the same runnable unit is placed on more than one node
- catches:
AtMostOnePlacement - lesson: scheduling has to mutate authoritative capacity truth, not just annotate the task
2. Completion must be fenced by the current attempt epoch #
- bug: terminal completion is accepted from a stale generation
- catches:
CompletedUsesCurrentEpoch - lesson: execution fleets need the same epoch discipline as lease and claim systems
3. Terminal transitions must release worker capacity #
- bug: the task becomes terminal, but the node slot stays consumed
- catches:
NoBusyAfterTerminal - lesson: completion is both a task-state transition and a capacity-accounting transition
4. Reconciliation is part of correctness, not just cleanup #
- bug: an expired assigned/running attempt remains stranded forever
- catches:
NoExpiredAssignedOrRunning - lesson: in a tiny bounded model this appears as a safety-style violation, but it stands in for the real liveness requirement that the fleet eventually reclaims dead attempts