Less memory, more context
The agent loop forgot, so we built memory systems. Memory streams, hierarchical paging, self-organizing notes, bi-temporal graphs. Some of it works. Most of it solves a problem context engineering has already fixed.
On this page
The agent loop forgot, so we built memory systems. Memory streams, OS-inspired paging, self-organizing notes, bi-temporal graphs, six-type controllers.
Some of it works. Most of it is solving a problem context engineering has already dissolved.
The cognitive-science vocabulary (sensory, short-term, long-term, episodic, semantic, procedural) gave the field a respectable-looking blueprint for memory.
The blueprint did not survive contact with the loss surface. Real agents in production are not running anything that maps cleanly onto Tulving’s 1972 distinctions.
What they are running is procedural memory as code, externalized state on disk, bi-temporal stores for facts that drift, and a compaction policy.
Four primitives. Everything else is scaffolding around a particular generation of the loop primitive, and the loop primitive keeps widening.
Five cognitive boxes that don’t map
Almost every agent-memory paper from 2023 opens the same way. A diagram. Sensory at the top. Short-term in the middle. Long-term at the bottom, split into episodic, semantic, and procedural.
It is the same diagram cognitive scientists drew for biological brains in the 1970s. And it became a build spec for systems with nothing in common with a brain.
Lilian Weng’s 2023 essay lifted the stack: “Agent = LLM + Planning + Memory + Tool Use.” Short-term equals the context window. Long-term equals an external vector store.
CoALA formalized it later that year. LangGraph, LangMem, LlamaIndex, AutoGen all run the same categories now.
The trouble is the diagram describes how human recall feels from the inside. It does not describe a useful set of storage primitives for a system whose physical constraints are nothing like a brain’s.
The Memory in the Age of AI Agents survey (December 2025) names this directly and proposes carving by form, function, and dynamics instead.
The carve matters less than the move. Stop deriving your storage tiers from a 1972 psychology paper.
Most failures you would attribute to “lack of memory” are not failures Tulving’s taxonomy predicts. They are context engineering failures in memory-system clothing.
Treat any framework whose primary distinction is “episodic vs. semantic” as a hint it is solving the wrong problem.
Picking what matters from a growing log
Run an agent for a weekend. By Monday the observation log has 14,000 timestamped entries and the context window holds 200. The model is making decisions on the last forty minutes because everything older fell out.
The first instinct is recency. Keep the last N. It misses anything that mattered an hour ago. The second instinct is similarity. Embed everything and retrieve top-k. It misses anything you cannot phrase as a similar query.
The 2023 paper that productionized the obvious synthesis is Park et al.’s Smallville simulation. Sum three normalized signals at retrieval. Equal weights.
Exponential recency decay. LLM-rated importance from write time. Cosine similarity to the query.
def score(memory, query, now):
recency = exp_decay(now - memory.last_accessed)
importance = memory.importance / 10 # LLM-rated at write
relevance = cosine(embed(query), memory.embedding)
return (recency + importance + relevance) / 3
A reflection layer sits on top. When recent importance crosses a threshold, the agent asks itself three high-level questions and writes synthesized insights back as new memories, linked to evidence. Hierarchy over the raw stream.
Strip out reflection in the ablation and the human-rated believability of the characters drops measurably. The shape was right.
The load-bearing details were wrong. Equal weighting of three orthogonal signals is unprincipled and task-dependent. LLM-rated importance is noisy and biased toward whatever felt narratively significant at write.
The stream grows unbounded. Reflection produces lossy summaries you cannot undo. Nothing in the stack tells you what to forget, only what to keep.
Ship the shape, not the paper’s defaults. Calibrate weights for your task. Budget reflection. Decide what gets forgotten before you decide what gets kept.
Paging the context window
Even with smart retrieval the window is a hard wall. Push past it and the FIFO queue drops state off the back. Summarize inline and you pay tokens out of the budget you are trying to extend.
The instinct is to treat the LLM like a CPU. MemGPT (Packer et al., October 2023, now Letta) reaches for it explicitly. The context window is RAM. Everything else is disk.
The agent moves data between them with function calls: core_memory_append, archival_memory_insert, archival_memory_search. When the FIFO queue nears its token limit, recursive summarization compresses it.
class Agent:
def step(self, user_msg):
self.fifo.append(user_msg)
if self.fifo.token_count() > self.threshold:
summary = self.llm.summarize(self.fifo.drain_oldest())
self.recall.insert(summary)
action = self.llm.act(self.main_context())
if action.is_memory_tool():
action.execute(self.core, self.recall, self.archival)
else:
return action.execute()
The OS metaphor leaks where you would expect. The agent has to decide what to page in and out. That decision uses tokens, and it is itself fallible.
Summarization is lossy and irreversible. The gains over a fixed-context baseline shrink fast when modern long-context windows just hold the whole conversation.
What pays off is the explicit memory blocks. Agents that can read and write a named region of their own state. The follow-on sleep-time compute work is where this architecture actually earns out (later in the post).
Reach for MemGPT when you need agent-writable memory blocks and shared state between agents. Do not reach for it because you are afraid of the context window. The window is bigger than your instinct from 2023 thinks it is.
Letting memories rewrite each other
A flat append-only store gets crowded fast. A memory written on turn 4 sits next to a memory from turn 400. The embedding from write time no longer captures what those tags mean in the current context.
The naive fix is to never edit, just retrieve harder. That trades read cost for write cost, and the read cost compounds every turn.
A-Mem (Xu et al., NeurIPS 2025) borrows from Zettelkasten. Memories are notes that link to other notes, and the network rewrites itself as new notes arrive.
def write(content, store):
note = llm.structure(content)
candidates = store.knn(note.embedding, k=8)
links = llm.select_links(note, candidates)
note.links = links
for c in candidates:
if c in links:
c.description = llm.maybe_rewrite(c.description, note)
c.tags = llm.maybe_rewrite(c.tags, note)
store.insert(note)
A-Mem reports SOTA on long-conversation benchmarks across six foundation models. The design is striking and the wins are real.
What breaks is write cost and monotonicity. Every memory write fires several LM calls.
The rewrites are not monotonic: an old note rewritten by progressively shakier context degrades, and there is no audit trail to spot it.
If you ship this, ship a write log so you can replay, an undo path, and a freeze policy for notes linked from enough places. Without those, the self-revising property becomes self-poisoning.
The note you cite three months from now might not be the note you wrote.
Memory as runnable code
Teach an agent how to do something on Tuesday. By Friday the model has to reconstruct the procedure from snippets, and the parts it gets wrong are the parts you spent Tuesday correcting.
The naive fix is to stuff the procedure into the system prompt. That works for one procedure. It does not scale to a thousand, and you cannot compose them.
The 2023 paper everyone misclassified as a robotics curiosity is Voyager (Wang et al., May 2023). An open-ended Minecraft agent built around three pieces.
An automatic curriculum that proposes goals. A skill library of executable JavaScript with natural-language descriptions. An iterative prompting loop with environment feedback.
The skill library is the memory.
// Description: Mine three iron ore blocks using an iron pickaxe.
async function mineThreeIronOre(bot) {
await equipBestPickaxe(bot);
const ores = await findNearestBlocks(bot, "iron_ore", 3, 32);
for (const ore of ores) {
await mineBlock(bot, ore);
}
}
Every verified skill is a function with an embedded description. Retrieval is embedding similarity over descriptions. New skills compose old skills as calls.
The agent’s “memory of how to do things” is a code repository that grows.
The 2023 numbers were absurd: 3.3x more unique items, 2.3x longer distances, 15.3x faster wooden tech tree, 6.4x iron. Only system at the time to reach diamond. Skills transferred zero-shot to new Minecraft worlds.
The numbers are not the point. The point is that procedural memory has a runnable, testable, composable representation. Code.
You do not “remember” a skill. You call the function. Retrieval matches a description. Verification is environment feedback.
A skill written for GPT-4 in 2023 still runs unchanged on Opus 4.7 today. The model is not recalling anything. It is picking and composing.
That lineage runs everywhere now. Anthropic Agent Skills ship SKILL.md files with frontmatter, instructions, and bundled scripts.
Progressive disclosure means Claude reads the table of contents at session start and only drills into a skill when triggered. Procedural memory becomes effectively unbounded, paid for at name-and-description level until used.
Compound engineering writes lessons to AGENTS.md or learnings directories after each task. The repo itself becomes episodic memory that compounds across PRs.
This is the pivot in the whole field. Every memory system before Voyager encoded memory as data the model retrieves and re-reads. Voyager encoded memory as functions the model picks and runs.
The first move is brittle to the model. The second one is not.
Vector retrieval is fuzzy recall
The default move when something does not fit in context is to embed it. Chunk text, store in an ANN index, retrieve top-k, stuff into context. Practitioners reach for it first, every time.
It also fails in two ways the surrounding papers rarely surface.
First, top-k accuracy on production corpora often lands under 60% even after pipeline tuning. Chunk boundaries cut across meaning. Standard 10 to 20% overlap does not close the gap.
Second, retrieving the right chunk does not mean the model uses it. Liu et al. 2023 showed a U-shaped recall curve over context position. Over 20 absolute points of degradation when the gold passage sits in the middle.
Vector retrieval is fuzzy recall over text. It is excellent at “find me documents about this topic.” It is mediocre at “recall this specific fact and act on it.”
For specific-fact recall, structured stores win. That is what motivates the next generation: knowledge graphs, bi-temporal stores, entity-relation extractions.
Treat vector retrieval as topical discovery, not memory.
A million tokens is still a small room
Frontier models in 2025-2026 ship 1M+ token windows. The natural instinct is to stuff everything in and stop worrying about retrieval.
The empirical answer is no. The context-engineering post walks the receipts.
Every model degrades with length. The curve is non-uniform across positions. A million tokens of inference costs 30 to 60x a tuned retrieval pipeline.
The model card’s headline window is the size of the room. The speaker still has the same vocal range inside it.
You cannot solve “remember conversations from last month” by extending context. The agent’s ability to use what is in front of it falls off well before the advertised limit.
Cache cost climbs. You still need a way to scope retrieval by user, tenant, or time.
What works in production is hybrid. Long context for sensemaking and summarization, where the whole document matters. Retrieval (vector for fuzzy topics, structured stores for facts, graphs for relations) for everything else.
The right question is never “can I fit it.” It is “do I want the model trying to use all of it at once.”
Consolidate while the user is gone
Consolidation costs tokens. Do it during the user’s turn and they wait. Skip it and tomorrow’s session opens to a flat log the model has to re-read in full.
End-of-session summarize bunches the cost at a moment when the user has already left, but locks the result in until the next session.
The move is to decouple the loop. Letta’s sleep-time compute runs a background agent alongside the primary, user-facing one.
The primary answers in real time. The sleep-time agent reviews history offline and rewrites the shared blocks the primary uses.
def primary_turn(msg, memory):
return primary_agent.respond(msg, memory=memory)
def sleep_time_loop(memory, history):
while True:
wait_for_idle_window()
for block in memory.blocks:
new_block = sleep_agent.rethink(block, history, recent_facts())
memory.write(block.id, new_block, source="sleep")
On stateful GSM-Symbolic and stateful AIME: 5x lower test-time compute for the same accuracy, with up to +13 absolute points on GSM-Symbolic and +18 on AIME when sleep-time compute scales up.
LightMem builds the same shape into a three-stage Atkinson-Shiffrin pipeline. Same lesson, different vocabulary: heavy work in the background, fast retrieval online.
This pays off when the system actually rests. For continuously-engaged agents (coding sessions, real-time assistants), the offline windows shrink and dual-agent shared memory becomes the dominant tax.
Find your idle time. Where it exists, claim it. Where it does not, do not pretend the architecture is free.
Six frameworks, four shapes
Open the docs for six production memory frameworks and they describe themselves with the same four nouns: episodic, semantic, procedural, working. Different vendors, identical taxonomy.
Stand them up next to each other and the differences live in the data structures, not the diagrams. The field already clustered around four shapes.
Salient-fact extraction. Mem0 extracts salient facts and consolidates against existing memory with explicit add / update / delete / no-op actions. Vector index underneath. Mem0g adds a graph layer.
On LOCOMO: +26% over OpenAI’s built-in ChatGPT memory, p95 latency 91% lower, roughly 90% token savings. Works when conversational user-facts are the dominant shape. Bi-temporal-lite: facts rewrite, the old fact is gone.
Bi-temporal graphs. Zep / Graphiti is the cleanest version. Each edge has both an event time (when the fact was true) and an ingestion time (when the system learned it).
def write_fact(graph, subj, pred, obj, event_time, now):
existing = graph.edges_matching(subj, pred)
for e in existing:
if conflicts(e, obj):
e.invalidated_at = now
e.invalidated_by_event_time = event_time
# keep for historical queries
graph.add_edge(subj, pred, obj, valid_from=event_time, ingested_at=now)
New facts that contradict old ones do not delete the old edge. They invalidate it. Queries can ask “what is true now?” or “what was true on date D?” without recomputation.
On DMR: 94.8% versus MemGPT’s 93.4%. If your memory has any temporal axis, this is the only shape that respects it.
Multimodal personal recall. MIRIX defines six explicit memory types under a multi-agent controller. Multimodal handling for screenshots.
ScreenshotVQA: +35% over a RAG baseline with 99.9% storage reduction. LOCOMO: 85.4%. Works when the user is recalling lived experience across formats.
Sensemaking corpora. Microsoft GraphRAG extracts entities and relations at index time, builds a graph, detects communities with Leiden, generates hierarchical community summaries.
Global queries map over communities. Local queries expand from entity neighborhoods. Expensive to build and keep fresh. Unlocks queries vanilla RAG cannot answer.
LangMem wraps LangGraph’s BaseStore with the cognitive types. Cognee exposes graph + vector indices via MCP across Claude Code, Codex, LangGraph, CrewAI.
Each framework pays off one of four primitives. MIRIX, MemGPT, and A-Mem try to cover all four at once, which is also why their gains over a focused stack narrow as you actually deploy them.
Pick the framework whose dominant shape matches your dominant problem. Avoid any framework whose model of memory is the cognitive taxonomy itself.
Four primitives survive
Strip everything to what carries weight in production and four primitives are left.
Procedural memory as code or markdown. Voyager-style skill libraries, Anthropic Skills, AGENTS.md, CLAUDE.md, learnings directories. Composable, runnable, testable, retrievable by description.
The skill is the artifact. The model picks and composes. This is the primitive that survives generation changes.
Externalized state in a filesystem. Intermediate results, large tool outputs, scratchpads, plan files. Whatever does not have to be in context goes to disk.
Paths or URLs stay in the conversation. The agent reads what it needs. The filesystem is the memory store with no token budget. The operational craft lives in the context-engineering post.
Bi-temporal stores for facts that drift. Zep / Graphiti-style graphs for anything that changes: user preferences, system state, schema, who-owns-what.
New facts do not delete old facts. They invalidate them. Queries can be time-scoped and stale assumptions can be detected.
Without a bi-temporal model, the system silently builds plans on facts that were true last quarter.
Compaction policy. Not a system, a policy. When to summarize, what to drop, what to externalize, what to preserve as raw evidence (especially errors).
Every long-running agent needs one explicitly. Getting it wrong is the most common production failure that gets blamed on “the model forgot.”
Everything else is downstream. The working-memory abstractions, the elaborate retrieval scoring, the episodic-memory streams. They pull their weight by reducing context rot, or they add to it.
Drew Breunig’s four failure modes (poisoning, distraction, confusion, clash) are the diagnostic. The agent gets worse, not better, every time you add the wrong memory.
Decision rubric
| If your memory need is… | Reach for |
|---|---|
| Reusable skills the agent calls again | Anthropic Skills, Voyager-style code library, AGENTS.md |
| Intermediate results, large outputs, dereferenceable artifacts | Filesystem + paths in context |
| User facts and preferences that change | Mem0, or a bi-temporal graph (Zep / Graphiti) |
| Conversation history that must persist across sessions | MemGPT / Letta archival with consolidation |
| Sensemaking over a corpus larger than context | GraphRAG community summaries + retrieval |
| Multimodal personal recall (screenshots, files, devices) | MIRIX |
| Just enough to avoid context rot in a long-running session | Compaction + tool-result clearing, no separate memory store |
| Lessons learned from failures | Reflexion-style episodic buffer, plus a curation policy |
| Avoiding repeated tool calls in a single trajectory | KV cache (no memory system at all) |
Two defaults cover most production cases.
Long-running coding agents: filesystem-as-memory plus compaction plus skills. No conversational memory store needed beyond what the harness already does.
Long-running assistants over user data: a bi-temporal graph for stable facts, Mem0-style salient-fact extraction for the rest, and a compaction policy on the conversation itself.
If you cannot articulate which of these primitives you are reaching for and why, you do not need a memory framework yet. You need to fix your context engineering.
The harness keeps shrinking
What survives is procedural memory as code, externalized state, bi-temporal facts, and a compaction policy.
Four primitives. None of them came from the cognitive taxonomy that defined the field.
Pick the one that matches your problem. Delete the rest as the model catches up.
Related writing
-
The agent stack I actually ship with
How I direct the agent loop instead of letting it drive. Six tools, a dozen human gates, and one rule I never break: nothing ships unattended. Requirements, brainstorm, grill, plan, test, implement, review, verify, design, compound.
-
Multi-agent is where context goes to die
More agents was supposed to mean more capability. Production says otherwise: every handoff between agents loses information, every parallel decision conflicts, and the system fails where context crosses. The number of agents is rarely the question worth asking.
-
Context engineering is the job now
A practitioner playbook for engineering an agent's context window: token-aware design, just-in-time retrieval, compaction vs offloading, sub-agents, KV-cache discipline, and the four moves every working agent stack uses.