Skip to main content
  1. Posts/

Your LLM App Has a Mutation Problem

10 mins

There is a failure mode that shows up in almost every LLM-powered application that accumulates state. It does not announce itself. It creeps in as a UX quirk, then a data integrity issue, then a fundamental trust problem.

The failure is this: freeform chat behaves like a mutation protocol.

The user thinks out loud. The AI helpfully extracts, summarizes, decides. The canonical artifact — the document, the record, the knowledge graph — absorbs half-formed thoughts as if they were decisions. The user comes back the next session and finds their brainstorm committed as fact.

If you have built an LLM app where users build something over time — a document, a codebase, a research brief, a knowledge base — you have either hit this problem or you will.

The Standard Responses Don’t Work #

The usual fixes are:

Better prompting. Tell the model to only extract when asked. This works until a slightly different phrasing of the user’s request triggers extraction anyway. Prompt brittleness is not a solution to an architectural problem.

Confirmation dialogs. Ask the user to approve changes before committing. This moves the problem from silent mutation to annoying interruption. Users click through confirmations. The underlying issue — that the system doesn’t know the difference between exploration and decision — is unresolved.

Separate modes. Have a “chat” mode and an “edit” mode. This helps, but creates friction at exactly the moment when the user is most engaged: the transition from brainstorming to deciding. The mode boundary is arbitrary from the user’s perspective.

None of these address the root cause.

The Root Cause #

The root cause is that the LLM has been given the wrong role in the system.

In most LLM app architectures, the model is implicitly the control plane — it decides what happens next, what state changes, what gets written. The app is the thin wrapper that executes whatever the model outputs.

This works for single-turn tasks. It breaks badly for multi-turn, stateful, creative, or analytical workflows where:

  • the user’s intent evolves across turns
  • not all output is meant to be committed
  • the cost of a wrong state change is high
  • the user needs to feel safe thinking out loud

The model cannot reliably distinguish exploration from decision. That distinction requires understanding the user’s current intent, the session’s history, and the application’s commit policy. The model has partial access to the first, unreliable access to the second, and no access to the third.

The fix is architectural, not prompt-based: the application must own state transitions, and the model must become an effectful helper.

What Game Engines Figured Out in 1980 #

Interactive fiction and narrative game engines faced this exact problem decades before LLMs existed.

The problem statement is identical: how do you let a user (player) have freeform expression while the world (canonical state) only changes through deliberate, authored transitions?

The game engine solution was never “let the player mutate world state directly.” It was always:

Author-defined state machines. Player-triggered transitions. World state owned by the engine.

In Inform 7, every player action passes through a check / carry out / report lifecycle. The action is validated against current world state before it executes. Invalid actions are rejected with a message, not silently corrupted.

In Ink (the narrative scripting language behind 80 Days and Heaven’s Vault), the author defines knots — named story states — and diverts — explicit transitions between them. The player cannot jump to an arbitrary knot. They can only follow authored paths.

The Drama Manager in Michael Mateas and Andrew Stern’s Façade (2005) went furthest. It introduced a layer that watched what was happening in the scene, evaluated which beats — discrete authored dramatic units — were eligible given current state, and selected the highest-priority eligible beat to fire. The player’s freeform input was classified and routed. The Drama Manager decided what happened next. The player could not bypass it.

These systems share a common insight: the author and the player are different roles with different privileges. The author defines what is possible. The player navigates within that possibility space.

For LLM apps, the mapping is:

Game EngineLLM App
AuthorYou (the app developer)
PlayerThe user
World stateCanonical artifact
NPC dialogueLLM output
Drama ManagerApplication runtime

The LLM is an NPC. It has no author privileges.

The Deliberative Agent Loop #

Working from these principles, a coherent architecture emerges. The core idea is that knowledge work with an AI assistant involves two distinct activities that must be kept structurally separate:

  1. Deliberation — thinking out loud, exploring options, brainstorming, refining
  2. Commitment — deciding that something is true, correct, or complete enough to persist

Most LLM apps conflate these. The Deliberative Agent Loop separates them architecturally.

Three Memory Tiers #

The first structural element is explicit memory separation:

Interaction State    — what is happening right now in this turn
Working Memory       — what this session cares about (transient)
Canonical State      — what has been committed and persists

These are not implementation details. They are load-bearing architectural boundaries.

Interaction state is ephemeral — the current topic, current candidate, current mode. It resets or evolves with each turn.

Working memory is session-scoped — rejected ideas, open questions, pinned decisions, thread continuity. It supports the current deliberation without polluting canonical state.

Canonical state is the artifact the user is building. It changes only through explicit, typed mutations through a mutation gateway. Never through chat. Never through model output. Never implicitly.

The Typed Action Space #

The second structural element is a typed, app-defined action space that serves as the sole mutation boundary:

ConfirmCurrent
AlternativeCurrent
ExpandCurrent
AddCurrentToArtifact
DeriveEntityFromCurrent

These are not UI buttons. They are the complete set of ways the application can change canonical state. The labels shown to the user are contextual — “Use this setting”, “Try another angle”, “Add to document” — but the underlying types are fixed and authored.

The load-bearing rule: the agent can reason and generate freely. It can only act through typed, app-defined actions.

Every increase in agent capability shows up as a new typed action — not as freeform model output mutating state directly.

The Perception Layer #

Between raw user language and application routing sits a perception layer: a dialogue act classifier that produces structured turn interpretation.

TurnInterpretation {
  act_type   : Brainstorm | Refine | Commit | Question | Tangent
  target     : CurrentCandidate | NewTopic | Artifact
  confidence : Float
}

The application routes on this structured interpretation, not on raw text. User language stays natural. Application logic stays typed.

This is the thin crossing point between freeform expression and structured state management. Without it, every routing decision becomes a fragile string match or a model judgment call.

The Evaluator #

The third structural element is a deterministic evaluator — the Drama Manager translation — that computes the right suggestion surface from current interaction state:

evaluate(InteractionState) -> EvaluationResult {
  mode     : NarrativeMode
  actions  : TypedAction list
  nudge    : String option
}

The evaluator is pure and app-owned. It reads typed interaction state. It produces a small set of contextually appropriate actions. It does not call the model. It does not read raw chat. It does not touch canonical state.

The evaluator is implemented as a small beat library — named, priority-ordered rules with pure boolean preconditions:

Beat: CandidateReady
  Precondition: mode=Brainstorming AND turn_count > 2 AND candidate_exists
  Priority: 10
  Effect: Surface [ConfirmCurrent, AlternativeCurrent, ExpandCurrent]

Beat: DriftDetected
  Precondition: thread_status=Drifting AND turn_count > 5
  Priority: 8
  Effect: Nudge "Want to return to [current topic]?"

Beat: ReadyToCommit
  Precondition: mode=Converging AND candidate_confidence > threshold
  Priority: 9
  Effect: Surface [AddCurrentToArtifact, ExpandCurrent]

Every suggestion the user sees has a logged reason. Every reason is a beat that fired. Every beat fired because its precondition evaluated to true against typed interaction state.

This is the observability property that most LLM apps lack entirely. When the suggestion surface feels wrong, you check the beat log — not the model output.

The Full Loop #

User turn
  → DialogueActClassifier (LLM Cmd)
  → TurnInterpretation
  → Update (deterministic, app-owned)
  → InteractionState updated
  → Evaluator.evaluate(InteractionState)
  → EvaluationResult { mode, actions, nudge }
  → Update decides which Cmd to issue
  → LLM executes the selected Cmd
  → AssistantReturnedReply
  → Update incorporates reply
  → Evaluator.evaluate again
  → View updated

The model appears twice: once as a classifier (perception), once as an executor (generation). It appears nowhere as a decider.

The Elm Architecture Connection #

This loop is structurally identical to the Elm architecture:

Model -> Update -> Cmd -> View

Where:

  • Model is the three-tier memory state
  • Update is the deterministic app-owned reducer
  • Cmd is the LLM or tool call
  • View is the UI projection of current model state

The Elm architecture was designed to make state management in interactive UIs predictable and debuggable. It turns out to be equally well-suited to making LLM-powered stateful applications predictable and debuggable. The insight transfers because the problem is the same: freeform user input should not directly mutate application state.

Why This Is Not Just Good Architecture #

Most discussions of AI agent architecture focus on capability — how do you make the agent do more, reason better, use more tools. This pattern focuses on control — how do you ensure that increased agent capability does not increase the risk of unintended state changes.

These are in tension. The standard response to that tension is to accept some mutation risk as the price of capability. But this is a false tradeoff.

By separating deliberation from commitment structurally — not by prompt, not by mode, but by architecture — you can increase agent capability freely within the deliberation layer without increasing mutation risk. The mutation boundary is a typed interface. Capability changes inside that boundary do not affect what crosses it.

This is the same insight that makes microservice boundaries valuable: you can change the internals freely as long as the interface contract holds. The same logic applies to the AI agent boundary.

Where This Pattern Applies #

Strip away the domain specifics and the failure mode is general:

A user builds something over time, thinks out loud toward decisions, and an AI assistant that can act prematurely is the main risk.

That is not a writing app problem. It appears in:

Legal research — lawyer explores a theory across turns; brief should only update on explicit “add to brief” action. The model’s helpful tendency to synthesize is a liability when the synthesis is premature.

Medical clinical decision support — physician works through a differential diagnosis; patient record should only update when the physician explicitly commits. A half-formed hypothesis in the record is worse than no record.

Software architecture tooling — engineer deliberates over design options; ADRs (Architecture Decision Records) should only be written when the engineer explicitly decides. Brainstorming contaminating the decision record is a real problem in team settings.

Investment research — analyst builds a thesis across sessions; investment memo should only incorporate claims the analyst has explicitly validated.

In each case, the three-tier memory model, typed action space, perception layer, and deterministic evaluator apply directly. The domain instantiation changes the vocabulary — the canonical artifact, the action enum, the beat library. The structure does not change.

Starting Point #

The full pattern as described is more than a first implementation needs. The essential minimum is two things:

1. Lock the mutation boundary first.

Plain chat is non-mutating. Only explicit typed actions change canonical state. This is a policy decision, not an implementation challenge. Make it a rule and enforce it before anything else.

2. Keep typed interaction state.

Replace routing on raw chat text with routing on a small typed state surface — current mode, current candidate, last classified intent. This is what makes everything else — the evaluator, the beat library, the observability — possible.

Everything else in the pattern is an elaboration of these two commitments. The beat library earns its complexity when you have enough interaction patterns to need priority-ordered selection. Start small. The architecture scales with your actual interaction complexity, not ahead of it.

The model will suggest. The user will decide. The application will commit.

That is the correct division of labor. Build the architecture that enforces it.