Useful memories become faulty
when continuously updated by LLMs

When an LLM agent rewrites its own experience into textual lessons, more updates don't always make the memory more useful. In several settings we looked at, the agent ended up performing worse than the same model with no memory at all — sometimes even on problems it had previously solved.

Dylan Zhang
University of Illinois Urbana-Champaign
scroll ↓
TL;DR
  • The popular recipe of distill experience → store as text → rewrite later is not a reliable engine of self-improvement.
  • After streaming ground-truth solutions through a consolidation loop, GPT-5.4 fails on 54 % of ARC-AGI problems it had previously solved with zero memory.
  • The same trajectories yield different memories under different schedules — the failure is in the rewrite step, not the data.
  • An episodic-only agent — one that selectively retains and deletes raw rollouts, with abstraction disabled — matches or beats every consolidator we tested. The point is curated raw evidence, not an unfiltered firehose.

The paradigm

A line of recent work gives an LLM agent a notebook. After it solves a problem, the agent distills the trajectory into a textual lesson, drops it in persistent memory, and the next time something similar shows up, retrieves and reuses it [refs].

The pitch is irresistible: continual self-improvement without parameter updates. The agent's "weights" are just text it can read and edit. Memory grows, lessons compound, accuracy goes up.

We ran this loop end-to-end on five agent benchmarks (ALFWorld, ScienceWorld, WebShop, AppWorld, Mind2Web) and a controlled stream we built on top of ARC-AGI. The story didn't hold.

Headline result

The agent regresses on tasks it had already solved.

Take 19 ARC-AGI problems that GPT-5.4 solves at 100 % accuracy with no memory. Stream those exact problems through the consolidation loop, with ground-truth solutions available at every step.

ARC-AGI Stream: GPT-5.4 accuracy collapses from 100% to 54% as memory consolidates from ground truth.
Solving the same problem twice, the second time worse. Without memory: 100 %. After consolidating from the ground-truth solutions of those very problems, GPT-5.4 drops to 54 %. The trajectories are perfect; the rewrite step is what breaks.

"Faulty memory" is not a euphemism for "noisy data." The data is clean. The agent saw the right answer. The act of compressing those right answers into a re-usable lesson is what made it forget how to solve them.

The shape of the decline

Memory utility is non-monotonic in updates.

ScienceWorld: score peaks early then declines below the no-memory baseline.
ScienceWorld. Score climbs for the first ~20 update steps, then declines through step 100. Eventually it slips below the no-memory baseline.
WebShop AWM: 0.64 at 8 examples down to 0.20 at 128.
WebShop (AWM). 0.64 at 8 examples → 0.20 at 128. The "no-memory" baseline sits at 0.20. Scaling the memory erases its own benefit.

And these aren't bad starting points. We seeded one ALFWorld memory with the strongest model we tested (GPT-5.4) on the cleanest "Static-Group" schedule. Then continued updating it with smaller models on the same trajectory pool. Three different solvers (Qwen3.5-{27B, 9B, 4B}). Same shape:

ALFWorld utility decay across solvers under continued consolidation.
A strong memory is not a fixed point. Continued consolidation on the same trajectory pool drags utility down across all three solvers, sometimes catastrophically between consecutive steps.
It's the rewrite, not the data

The same trajectories produce different memories
depending on how you serve them.

Hold the trajectory pool fixed. Vary only the consolidation schedule. The output memory changes qualitatively, and so does downstream score.

Best of the three. When the consolidator sees a clean batch of one task family at a time, it actually has a chance to extract the latent structure. This is the cleanest possible offline setting.

Static-Group, Static-All, and Stream schedules diverge despite using identical trajectory pools.
Same trajectories, three schedules, three different memories. Streaming — the schedule a continually-deployed agent actually has — is the worst.
Why this matters

The trajectory pool is identical across these three runs. Whatever's wrong with the resulting memory cannot be blamed on the data the agent collected. It has to be in the consolidation step itself.

Three failure modes

Why does the rewrite go wrong?

We isolate three mechanisms. Each one turns the consolidation loop from accumulation into lossy rewriting.

01

Misgrouping

Before abstracting, the consolidator decides which episodes belong together. When forced to consolidate every step, it pools episodes that share little underlying structure.

Under forced consolidation on ARC-AGI Stream, the model frequently combines memory entries across distinct problem classes. When given autonomy, it eventually converges to a clean episodic store covering each of the 6 problem types — but only after 568 examples have elapsed. The capacity to segment is there. The forced rewrite overrides it.

Verbatim memory entry GPT-5.4 · forced consolidation · ARC-AGI Stream

When to use: A large hollow rectangular frame encloses some objects while other objects lie outside it … In the kept interior objects, a single distinguished cell is changed based on a relation to a matching object outside the frame, often when an outside object has the same shape as an inside object.

Strategy: … (5) For each interior object, look for an exterior object with the same shape signature… (6) If an interior object has such a matching exterior counterpart, mark the center cell of the interior object's bounding box with the exterior object's color.

The highlighted spans are foreign-family injections: a shape-signature lookup belongs to the group-by-shape family, the marker color-write belongs to key-marker. Neither is part of the inside-frame source task. The consolidator stitched together a composite no actual family prescribes.
Misclassification count rises sharply under forced consolidation.
Misclassification count under Force: episodes from different families merged into one entry.
02

Interference

Each abstraction pass smooths existing entries. When the chunks are imprecisely bounded, the rewrite strips the applicability conditions: a lesson that was true for Pick&Place reads as broadly relevant and misleads Pick-Clean-Place.

On a 15-task ScienceWorld switch sequence, distilling memories only on the current task ("Fresh") beats jointly consolidating across all prior tasks ("Cumulative") by +203 points. An LLM judge labels each entry: Cumulative accumulates over-generalized memories at ~5× Fresh's rate, and outright garbage at ~20×.

Verbatim memory entry ScienceWorld · over-generalized

Using a lighter, fire source, or oven MAY BE NECESSARY to change the state of a food or substance in state-change tasks.

Reads as broadly applicable. But many state-change tasks need cooling, freezing, or melting. The applicability conditions have been stripped: the lesson now biases the agent toward heat sources for tasks where heat is irrelevant or harmful.
Cumulative consolidation falls 203 points behind Fresh on ScienceWorld 15-task sequence.
Fresh vs Cumulative: identical trajectories, different consolidation scope, +203 point gap.
03

Overfit

When the input distribution narrows instead of widening, abstraction overfits to surface regularities of the seen instances rather than the underlying strategy. The memory recognizes exact repetitions and fails on close variants of the same family.

We feed the agent tasks drawn from a single ARC-AGI strategy family across consolidation cycles. Performance stays stable on exact repeats, then collapses on small variations within the same family. The "lesson" turned into a description of the example.

Same lineage, 50 rewrites apart GPT-5-mini · "recolor the largest object"

Round 1

identify and extract structured elements from input → compute a global metric (e.g., max size) → iterate elements and selectively apply targeted edits

Round 50

Find the maximum value of a derived per-object numeric attribute and apply a uniform transformation to every object whose attribute equals that maximum.

Round 1 names the actual selector, "max size" — a property a solver can compute. Round 50, after 49 rewrites of the same lineage on the same task, has erased it: the entry no longer records which attribute to maximize.
ARC overfit: accuracy on exact repeats stays stable while accuracy on close variants collapses.
Narrow streams produce memories that recognize seen cases and fail on neighbors.
An aside, for the cognitive-science minded

This is exactly what dual-system memory was built to prevent.

Complementary Learning Systems theory [refs] says the brain keeps a fast episodic store and a slow schema-forming store architecturally distinct, with consolidation gated by schema fit rather than triggered on every event. Collapse the two into one mandatory rewrite loop and you get exactly the interference catastrophe the dual system was designed to avoid.

Today's agentic-memory designs collapse the two. The same LLM that solves the task also rewrites its own memory of that task at every turn, with no gating. Our findings are what that prediction looks like in practice.

Memory zoo

Click through. Each one is a real entry from a real run.

No charts here, just artifacts. Each tab loads one verbatim entry along with a one-line note on what's broken about it.

GPT-5-mini · ARC-AGI · 200 tasks · entry 1 of memory

"Make a working copy of the input grid (list of row lists) before mutating, perform all modifications on the copy, and return the copy to avoid mutating the original input."

What's wrong. A defensive Python idiom. Mentions no color, shape, or rule that the six task families distinguish themselves by. The model wrote a coding tip and called it a strategy.
The deeper problem

Every consolidation step is a generation. The agent is hallucinating its own past.

The failure is not "the LLM is bad at summarizing." It's structural. We're building a system whose stable long-term knowledge is the fixed point of a generative loop — and there is no fixed point.

Each consolidation pass works like this:

  1. Read the current memory and a fresh trajectory.
  2. Generate what the new memory entry "should" be. This is an LLM forward pass. It produces fluent, plausibly-structured text. It is not a faithful summary of the input — it is a sample from a distribution conditioned on the input.
  3. Write the sample back as if it were ground truth. The next consolidation step reads this sample and conditions on it.

Now stack 200 of these. Step k+1's context is a sample drawn conditioned on step k's sample, which was drawn conditioned on step k−1's, and so on. Plausible-looking text accumulates. Specific facts (which color, which receptacle, which selector) are most likely to drop out at each step because they're the most surprising tokens conditional on the running summary. The memory drifts toward the LLM's prior over what a good lesson looks like, not toward the truth of the trajectories.

The reframe

Continuously updated textual memory is an iterated generative loop with no anchor. The "memory" is not a record. It is a sample — fluent, confident, and increasingly disconnected from what actually happened. We saw vacuous abstractions, phantom rules distilled from bugs, byte-identical duplicates, 99-vote tautologies, 50 items collapsing into one. These are not bugs. They are what samples from the consolidator's prior look like after enough iterations.

Why the experiments line up.

Three specific results we showed earlier follow directly from this framing:

  • Stream < Static-All < Static-Group. The more times the sample is fed back as context, the more the entry drifts toward the prior. Static-Group resamples once per family; Stream resamples thousands of times.
  • Cumulative < Fresh by 203 points. Cumulative consolidates over a growing prefix of past summaries; Fresh consolidates from raw trajectories of one task. Cumulative sits deeper in the loop.
  • Episodic-only matches abstraction. Raw episodes are outside the loop. They are not samples. They are records. Of course they hold up better.

The implication is uncomfortable.

The dominant agentic-memory paradigm — "after each task, distill the trajectory into a textual lesson and store it" — is not a way of accumulating experience. It is a way of replacing experience with a slowly-drifting LLM prior over what experience looks like. Until the consolidator is grounded in something it cannot itself overwrite, scaling the experience scales the drift.

A surprisingly strong fix

Don't force abstraction. Just keep the episodes.

ARC-AGI Stream lets us put the agent in charge of its own memory. At each step it can Retain, Delete, or Consolidate. We compare three regimes:

Force

Must consolidate every round. Episodic entries don't persist between rounds. The default in most existing systems.

Auto

Agent chooses: retain raw, delete, or consolidate. Both episodic and abstract stores are available at retrieval.

Episodic Only

Retain or delete raw episodes. Abstraction is disabled entirely.

ARC-AGI training curves: Auto and Episodic Only beat Force across 400 steps.
Across 400 training steps and two backbones, Auto — which keeps episodes by default and uses abstraction sparingly — outperforms Force. Whatever Force gains from compression, it loses more by overwriting evidence.
Episodic Management Only matches Auto; abstract-only collapses to no-memory baseline.
Where the gain actually lives. Removing episodic evidence and reading only abstract lessons collapses accuracy back to the no-memory baseline. Episodic Management Only — raw episodes that the agent has selectively retained or deleted, with abstraction disabled — matches or exceeds the full Auto mode. The useful information was sitting in the curated raw episodes the whole time.

ARC-AGI GT Stream: 400 steps, ground-truth solutions, all four management policies.

The cleanest test of the gating prediction is the GT regime, where the agent receives ground-truth solutions at every step. There is no "the trajectories were noisy" excuse here. Whatever happens at the consolidation step is what happens.

ARC-AGI GT Stream over 400 steps: Auto+Episodic beats Force; the gap widens with training.
ARC-AGI GT Stream, 400 training steps. Force lags from step ~50 onward. The Auto+Episodic curve climbs and stays climbing; Force plateaus and is overtaken. Same model, same trajectories, same ground-truth solutions — just a different rule about whether abstraction is mandatory.

To isolate where the Auto+Episodic gain comes from, we re-evaluated four checkpoints from the same run with each memory source restricted in turn: Abstract Only reads just the distilled lessons, Episodic Only reads just the raw episodic store, and Auto reads both.

GT Stream component ablation: Episodic Only recovers nearly all of the Auto gain; Abstract Only never beats no-memory.
The abstract store is doing none of the work. Reading only distilled lessons (Abstract Only) never improves on the no-memory baseline at any of the four checkpoints. Reading only raw episodes (Episodic Only) recovers nearly the entire Auto gain. The combined Auto reading is, at best, marginally better than Episodic Only alone — meaning the consolidator's distillations are contributing roughly zero on top of the raw episodes the agent already chose to keep.

And the agent itself agrees, when given the choice. It saturates the episodic buffer quickly at every budget level and keeps the abstract store sparse:

Buffer composition under Auto: episodic dominates, abstract stays sparse.
Auto-mode buffer composition. The agent's own management policy is episodic-first when the architecture permits it.
The principle

Episodic and schema-forming roles should not be collapsed into a single rewrite loop. Raw episodes are first-class evidence, not material to be compressed away. Abstraction, when it happens, should be opt-in and gated by the agent — not forced on every trajectory.

An uncomfortable baseline

An episodic-only memory is competitive with every consolidator we tested.

On WebShop, ALFWorld, and AppWorld, an "episodic-only" memory — just append raw trajectory rollouts to context, no cross-trajectory rewriting — is competitive with ACE, AWM, and Dynamic Cheatsheet. Same trajectories. No distillation step. The solver's in-context learning extracts the relevant signal directly from preserved instances.

We're not saying abstraction is useless. We're saying: a memory method whose value depends on distillation should be tested against the unabstracted rollouts it distills. Currently, very few are.

Takeaways

So — what should you build?

  1. Treat raw episodes as first-class evidence. Don't compress them away by default. Today's solvers can already use them via in-context learning.
  2. Make abstraction selective and gated. Not every trajectory needs to become a "lesson". Most should not.
  3. Decouple the episodic and schema-forming roles. A fast episodic buffer + a slow, gated abstract store dominates a single mandatory rewrite loop.
  4. Stress-test against scale. A memory system that's good at 8 examples and bad at 128 is not a memory system. It's a prompt with a leak.
  5. Always include an episodic-only baseline. If your distilled memory can't beat raw rollouts retrieved as in-context demos, the distillation isn't earning its keep.

Continually rewritten memory is fragile.

Persistent textual memory promised a path for LLM agents to improve after deployment without weight updates. Our results say: not yet. Continuously updated textual memory should be viewed not as a reliable engine of self-improvement, but as a fragile mechanism that can make more experience lead to worse memory.

Long-horizon agents will need both episodic and schematic memory. But until LLMs can decide when and how to consolidate, the safer default is to keep the evidence and abstract sparingly — or not at all.