Writing
36 min read

Context engineering is the job now

A practitioner playbook for engineering an agent's context window: token-aware design, just-in-time retrieval, compaction vs offloading, sub-agents, KV-cache discipline, and the four moves every working agent stack uses.

On this page

Part one was about the action primitive: what the model emits at each step. This post is about the other half of the loop. What the model reads at each step, and the engineering job that turns out to be.

When an agent fails in production, the post-mortem almost never lands on “the model was not smart enough.” It lands one level up. The wrong information was in the window. The right information was somewhere else. A tool list ate the prompt before the user typed.

A summary lost the one detail the next step needed. A sub-agent came back with prose that wouldn’t compose with what the orchestrator had already decided. The model was running on bad inputs the whole time. That class of failure has a name now.

Andrej Karpathy called it the delicate art and science of filling the context window with just the right information for the next step. Tobi Lütke coined the term a week earlier, arguing it described the actual job better than “prompt engineering” ever had.

Anthropic’s working definition is sharper: “the set of strategies for curating and maintaining the optimal set of tokens during LLM inference.”

Their objective function is the one I keep coming back to. The smallest set of high-signal tokens that maximize the probability of the next correct decision.

This post is a practitioner playbook for that objective function. Not what context engineering is, the field has settled that.

What it actually looks like when you’re shipping: the four moves a working stack makes, the numbers that justify each one, and the patterns that have stabilized across Anthropic, Manus, Cognition, Cursor, and Claude Code.

The action primitive and the context window are inseparable in practice. A wider primitive (CodeAct, sandboxed tool execution) is partly an argument about context: code in, observation out, less in the window than a JSON tool-call trace would have spent.

The MCP series (MCP is not the problem, how to build an MCP server agents will actually use) was about one tributary of the same stream: the server is part of the prompt. The whole window is the prompt. Engineering it is the job.

Context is not free, even when the model says it is

The first thing to internalize is that the headline context-window number on a model card is closer to a structural maximum than a usable one.

Models degrade as the window grows. They degrade differently from each other, but they all degrade, and they start degrading well below the advertised maximum.

The cleanest current demonstration is OpenAI’s MRCR v2 benchmark (Multi-Round Coreference Resolution, 8-needle variant).

It’s the successor to the original needle-in-a-haystack test, designed specifically to break the substring-matching shortcut that made standard NIAH look so flattering.

MRCR plants multiple related needles across the haystack and forces the model to disambiguate between them. The benchmark is run at 128K, 256K, 512K, and 1M tokens, which is most of the working range for frontier production agents in 2026.

MRCR v2 8-needle · 2026 frontier

Long context support is not a license to fill the window

0%25%50%75%100%ACCURACY128K256K512K1MCONTEXT LENGTH
  • Opus 4.693.078.314.7pp
  • GPT-5.588.074.014.0pp
  • Opus 4.765.032.232.8pp
  • Gemini 3 Pro77.026.350.7pp
  • Sonnet 4.562.018.543.5pp

MRCR v2 8-needle scores compiled from the llm-stats leaderboard, OpenAI's own evaluation harness, and the published Opus 4.6-vs-4.7 regression analysis. Multi-Round Coreference Resolution is OpenAI's needle-in-a-haystack successor, harder than the original NIAH and now the standard benchmark for frontier long-context retrieval. Endpoints at the lengths each source measured directly (256K and 1M for the Opus models, 128K and 1M for Gemini 3 Pro); intermediate values interpolated where multi-point data wasn't published.

The shape is the point. Every model degrades. The story of the May 2026 frontier reads as four very different curves on the same benchmark:

Claude Opus 4.6 is the long-context leader. 91.9% at 256K, 78.3% at 1M. A 13-point drop across the working range, but a graceful one.

GPT-5.5 is the most resilient runner-up. 87.5% at 256K, 74.0% at 1M. Trails Opus 4.6 by about 4 points throughout, also degrading gracefully.

Claude Opus 4.7 is the surprise: a sharp regression from 4.6. 59.2% at 256K, 32.2% at 1M. Anthropic shipped a stronger reasoner with measurably weaker long-context recall, and the regression is now part of the public record.

Gemini 3 Pro holds 77% at 128K but collapses to 26.3% at 1M, a 50-point drop. Its long-context curve is the steepest of the cohort.

Claude Sonnet 4.5 sits at 18.5% at 1M, shown for comparison. Nothing about that scale was usable in the first place.

Two patterns are worth flagging directly. First, the frontier-versus-frontier spread at 1M is wider than the average score.

Opus 4.6 at 78.3% sits 46 points above Opus 4.7 at 32.2%. The “best long-context model” question now has a real answer that depends on which release you grab.

Second, even the leader loses measurable recall by 1M. The “1M context” headline on the spec sheet is the size of the room, not the volume of the speaker.

There are several supporting studies. The NoLiMa benchmark (Modarressi et al., 2025) showed the same shape one generation earlier across GPT-4o, GPT-4.1, Llama 4 Maverick, and Gemini 2.5 Flash. Ten of twelve models dropped below half their short-context baseline by 32K.

Lost in the Middle (Liu et al., TACL 2024) measured a 20-point accuracy swing across multi-document QA based on where in the context the answer document was placed. Position 1 and position 20 scored 72-75%; the middle fell to 55%. The model has the information; the model can’t find it.

Chroma’s “Context Rot” research ran 18 frontier models through extended NIAH variants and found something more unsettling than monotonic decline.

Coherent haystacks underperformed shuffled ones. Distractors caused jagged drops at specific lengths. Refusal rates climbed above certain word counts.

Anthropic’s own framing names the mechanism. “As the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases.” They call it a performance gradient rather than a hard cliff. That is the optimistic framing.

The pessimistic framing is the one production traces tell you. Agent contexts routinely run past 100K tokens. A coding session with five files, a CLAUDE.md, two MCP servers, and twenty turns of conversation passes 50K before lunch.

Whatever degradation the curve above shows at 256K is already inside the distribution your agent is running in. And you are not getting there with a single planted needle.

You are getting there with sixty turns of conversation, four file reads, a stale tool result the agent forgot to discard, and a system prompt that grew an extra section last week.

Every one of those tokens is competing for the model’s attention with the one fact the next decision actually needs. This is the empirical case for treating context as a budget.

Not because tokens cost money, though they do. Because the model’s ability to use what’s in front of it is non-linear in length, and the curve bends the wrong way.

The four failure modes you will watch for

Once you accept the window is a budget, the failure modes are what tell you the budget is misallocated. Drew Breunig’s taxonomy is the cleanest naming in the field. Four shapes, each with a tell, each diagnosable from a trace.

Context poisoning. A hallucination or error makes it into the window and gets repeatedly referenced. The canonical example: DeepMind’s Gemini 2.5 playing Pokémon hallucinated a game state into its goal description and pursued unreachable objectives for hours.

The tell is the agent confidently referring to facts it never observed. The damage compounds because every subsequent turn reads the poisoned token as ground truth.

Context distraction. The window grows so long that the model over-attends to recent history and stops drawing on its training. Past about 100K tokens in the Pokémon traces, the agent began repeating prior actions rather than synthesizing new plans.

Databricks measured the inflection earlier for smaller models, around 32K for Llama 3.1 405B. The tell is the agent doing the same thing over and over.

Context confusion. Superfluous content (most often, extra tools) gets used to generate a worse answer. The Berkeley function-calling leaderboard shows every model degrading as the tool count grows.

Breunig cites a quantized Llama 3.1 8B that failed on a benchmark at 46 tools and succeeded at 19, the prompts otherwise identical. The tell is the agent picking the wrong tool from a surface that looks coherent on paper.

Context clash. Information accrued during the trace conflicts with information already in the window. Microsoft and Salesforce’s sharded-prompts study found a 39% average accuracy drop when the same information arrived in fragments rather than upfront.

OpenAI o3 fell from 98.1% to 64.1%. The tell is the agent making decisions consistent with one part of context while ignoring the part that contradicts it.

These are the targets. The four moves below are the responses.

Poisoning wants Write (checkpoint to disk so you can roll back) and Compress (re-summarize from clean state). Distraction wants Compress or Isolate. Confusion wants Select. Clash wants Write (canonical source) and Select (read it instead of regenerating).

What actually fills a production context window

Before getting to the moves, look at where the tokens go. A 200K Claude window sounds enormous in a paper. In a production coding agent with a few servers wired up, it isn’t.

200K Claude window · what eats it

Where the tokens go before the user types

A real coding session by mid-afternoon. One or two MCP servers, ten or so file reads, twenty turns of conversation.

Window utilization99K / 200K (50%)
  • System prompt4.2K tok(2.1%)
  • Auto-memory + CLAUDE.md1.2K tok(0.6%)
  • MCP tool definitions35K tok(17.5%)
  • Skill index700 tok(0.35%)
  • Conversation history40K tok(20.0%)
  • Tool observations18K tok(9.0%)
  • Free for the user's question and the agent's response101K tok

Startup-overhead figures (system prompt, memory, env info, MCP listing, skill index) from Claude Code's published context-window doc. Heavy MCP scenario uses the 143K-token measurement from the MCP context-tax writeup. Conversation history and observation totals are typical mid-session figures.

The “heavy” scenario isn’t pathological. It is what an engineer running Claude Code with the GitHub server, the Playwright server, and a long-running session looks like by mid-afternoon.

The MCP context tax writeup measured 143K of 200K consumed by tool definitions alone, before the user had typed a character.

Cloudflare’s Code Mode post measured the naive exposure of their 2,500-endpoint API as 1.17 million tokens, six times an Opus 4.5 window. Mario Zechner clocked Playwright at 13.7K and Chrome DevTools at 18K, eating 7-9% of a 200K window each.

The quieter number is what fills the rest. Claude Code’s own context window doc lists the startup overhead before the user types: system prompt 4.2K, auto-memory 0.7K, environment info 0.3K, MCP listing 0.1K, skill index 0.5K. About 6K of background hum, fixed.

Conversation history grows from there. Tool observations stack up unless the agent or the runtime trims them. A cat on a 2,000-line file is 12K tokens. Five of those is 60K. None of it is the user’s question. All of it is in the prompt at every turn.

The four moves below are reactions to this picture, not abstract patterns. Each one names one direction the tokens can be pushed when the window starts to fill.

The four moves

The framework everyone has converged on is Lance Martin’s. Write, select, compress, isolate. Every working agent stack composes all four. They are not exclusive and not ordered, but they answer different questions and address different failures.

MoveWhat it doesWhen to reach for itFights
WritePush state out of the window. Filesystems, scratchpads, todo.md, long-term memory.State has to persist beyond a single turn or beyond the window.Poisoning, clash
SelectBring state in on demand. RAG, just-in-time file reads, tool search, lazy schema expansion.The information is large or branching, only a slice is relevant per turn.Confusion
CompressReplace state with a smaller representation. Summarization, auto-compact, response-format modes.The information is unavoidable and stale, you can afford to lose fidelity.Distraction
IsolateRun a separate context. Sub-agents, sandboxed tool execution, hidden state in the orchestrator.The work is independent and the orchestrator only needs the summary.Distraction, clash

The decision rule, with the moves spread out as options:

Symptom picker

Which move does your symptom want?

Pick every symptom you see in your stack. Most agents have more than one.

Primary move (0 symptoms selected)

Pick at least one symptom above to see the primary move and tactics.

The rest of this post is each of those moves in production detail, plus the cross-cutting concerns (KV cache, errors) that none of the four owns cleanly on its own.

Write: the filesystem is a context window with no token budget

Everything that doesn’t have to live in the context window shouldn’t.

The cleanest argument is from the Manus writeup, the densest practitioner document in this space. Every Manus task runs in an E2B sandbox with a real filesystem, and the filesystem is treated as primary context.

A 50,000-token web page gets compressed to a 500-token summary plus the URL. The agent reads the URL again when it needs the rest. The compression is lossless because the original is one tool call away.

The same pattern shows up across every serious agent system. Anthropic’s multi-agent research system has the researcher save plans to disk before working, “since if the context window exceeds 200,000 tokens it will be truncated.”

Claude Code reads files lazily and trims old reads from the window when the next compaction comes. Cursor and Windsurf both auto-generate long-term memory files.

The unifying instinct: the window holds active working state; the filesystem holds referenceable state; the two are bridged by paths and identifiers.

The patterns worth naming explicitly:

Identifiers over inlined data. When a tool returns a 12K-token document, the next turn should see a path, an ID, or a URL, not the document. Compaction is one-shot; offloading is reversible.

The agent can always fetch the page again. It cannot un-summarize a summary that lost the field it now needs.

Scratchpads as durable working memory. Plans, intermediate calculations, partial outputs, all written to disk between turns. The window holds the index. The disk holds the substance.

Long-term memory only for things that genuinely outlive the session. User preferences, project conventions, repeated facts about a codebase. Not “things the agent might want to read again later in this conversation,” which is a scratchpad concern.

The thing Write does that the other three moves cannot is preserve full fidelity. Compression always loses information. Selection rebuilds context from a corpus that itself had to be authored. Isolation hands a summary to a parent.

Write is the only move where the original is still there, at full resolution, exactly where the agent left it.

Recitation: writing as goal anchor

The Manus pattern with the most leverage relative to its cost: maintain a todo.md file with the current plan. Between turns, rewrite the file, check off completed items, refine what’s next. Re-read at the top of the next turn.

Functionally, this puts the agent’s current understanding of the goal at the end of the context, where attention weighs it most. The reason it works is Lost in the Middle in reverse. The model attends most strongly to the start and end of the window.

Long-running agents lose the plan not because they forget it but because it slid into the middle of a long trace and stopped getting attention. Recitation keeps it at the end, freshly written, every turn.

The same instinct lives elsewhere under different names. Anthropic’s researcher saves its plan to memory at the top of each task. Claude Code rewrites its task list during long sessions. The pattern is everywhere once you start looking; it just doesn’t have one canonical name.

The practical version is two lines of agent logic. Ask the model to rewrite todo.md reflecting current state after each turn. Re-read the file at the top of the next turn.

Cost: a few hundred tokens per turn. Benefit: the agent stops drifting into context-distraction failures around turn 30.

Select: just-in-time over just-in-case

Selection has the biggest published wins because it directly fights the cost the previous section showed. Tool definitions and stale state filling the window before any work has started.

The Anthropic Effective context engineering post names the principle: agents should “maintain lightweight identifiers (file paths, stored queries, web links) and use these references to dynamically load data into context at runtime.” Eager loading is the default that ships broken.

The case is strongest for tools, which is where the numbers are densest. Anthropic’s Tool Search Tool moves definitions out of the prompt by default. An internal example collapsed from 77K tokens to 8.7K, an 85% reduction.

MCP eval accuracy on Opus 4.5 went from 79.5% to 88.1%. On Opus 4, 49% to 74%. None of this required the model to get better. It required the model to read fewer tool descriptions.

Cloudflare Code Mode replaces 2,500 endpoints with two tools (search, execute) and an OpenAPI catalog the agent browses. Their reported context cost is roughly 1,000 tokens. The 1.17M-token alternative would never have fit in any production window at all.

Anthropic’s Code Execution with MCP measures a Drive-to-Salesforce sync collapsing from 150K to 2K, a 98.7% reduction, by exposing MCP servers as filesystem modules the agent imports rather than tools it has descriptions of.

The pattern beneath all three is the same. Make the tool surface searchable instead of loaded. Pull in only the schemas the agent decides it needs. Treat the catalog as data the agent navigates, not as part of the prompt it reads every turn.

The same principle applies to non-tool context. Retrieval-augmented generation is selection over a document corpus.

Anthropic’s Contextual Retrieval reports a 49% reduction in top-20-chunk retrieval failure rate by prepending a chunk-specific context line before embedding. 67% with reranking added.

The mechanism is selection at higher precision, not higher recall. The chunks the agent reads are denser in signal, so fewer of them need to land in the window for the answer to be reachable.

The failure mode just-in-time fights is context confusion. The Berkeley function-calling data is the clean evidence. Every model degraded as the tool count grew. Breunig’s line is the one to remember: “if you put something in the context the model has to pay attention to it.”

Compress: lossy by construction

Compression is the move people reach for first, which is exactly why it should usually be reached for last.

The standard tool is summarization. Claude Code’s auto-compact kicks in around 83.5% utilization, clears older tool outputs first, then summarizes the conversation. Cognition uses a separate fine-tuned model whose only job is boundary summarization at sub-agent handoffs.

Cursor and Windsurf do silent rolling summarization in long sessions. All of these are useful. All of them are lossy, and the loss is structural.

The Cognition framing in “Don’t Build Multi-Agents” is the one to internalize:

Share context, and share full agent traces, not just individual messages. Actions carry implicit decisions, and conflicting decisions carry bad results.

The example they use is two sub-agents asked to build pieces of a Flappy Bird clone. One produces a Mario-styled background; the other produces a stylistically mismatched bird.

The merge fails because the summary the parent had was the description of what each sub-agent was supposed to do, not the trace of what each one actually did.

The summary lost the implicit decisions. The implicit decisions were the entire game.

The same risk applies inside a single agent. Every compaction step replaces a long history of operations with a shorter description of what was done. The next turn sees the description, not the operations.

If the description omits the field the next decision needs, the agent will confidently make the wrong call against context it cannot un-summarize. There is no recovery path. The original is gone.

So the discipline:

Compress only when you cannot offload. A 50K-token web page is a Write move (URL plus summary), not a Compress move. A 10-turn conversation history that has to stay in the window because the next decision depends on it might be a Compress move.

Preserve tool traces where you can. The decision the agent made matters less than the operation it ran and the observation it got back. Trace lines compress poorly into prose; prose summaries lose the structure the next agent needs.

Use a different model for compaction. Anthropic’s recommendation, Cognition’s, and the pattern Aider’s architect/editor split leans on. A summarizer prompted for fidelity to source loses less than the main model trying to write its own summary mid-loop.

Treat compaction as a one-way door. Once a section of history has been summarized, it is gone. Schedule it deliberately. Watch what the agent does in the turn immediately after; that turn is where compression failures show up.

The honest empirical situation is that nobody has published a rigorous before/after benchmark on compaction quality. The discourse is qualitative because the failure modes are.

What everyone agrees on is the direction: compress as little as possible, as late as possible, with as much context preserved structurally as you can manage.

Isolate: when sub-agents earn their cost

Sub-agents are a context-engineering tool, not a free intelligence multiplier. The math determines when they earn their cost.

The Anthropic multi-agent research system is the strongest empirical case. Claude Opus 4 orchestrating Claude Sonnet 4 sub-agents outperformed single-agent Opus 4 by 90.2% on their internal research benchmark.

Each sub-agent receives a narrow objective and a clear task boundary, runs in its own context window, and returns a 1,000-2,000-token distilled summary to the orchestrator. The orchestrator never sees the sub-agent’s full trace.

That’s what makes the math work. Many parallel windows, each focused, each summary much smaller than the work that produced it.

The cost is real. Anthropic measures multi-agent systems using “about 15x more tokens than chats.” Single agents already use 4x. Token usage explained 80% of the variance in their browsing agent performance. The 15x is not free.

The win above is on a workload (parallel research) where the work was independent and the orchestrator’s job was synthesis, not coordination.

Cognition’s “Don’t build multi-agents” is the canonical foil, and it isn’t wrong; it is right for a different workload. The Flappy Bird example fails because the work was not independent.

The bird’s style depended on the background’s style, and the orchestrator’s summary didn’t carry enough of that decision to keep the two consistent.

For coding tasks, where every artifact has to compose with every other, single-threaded usually wins. For research tasks, where each sub-claim verifies independently, parallelism usually wins.

The decision rule that survives both writeups:

Workload propertySub-agents workStay single-threaded
Work shapeGenuinely parallelizableSequentially dependent
Sub-result compositionAdditive (synthesis)Multiplicative (consistency)
Failure recoveryIndependent failureCoupled failure
Context size per subLarger than orchestrator can holdFits in one window
LatencyWall-clock matters more than tokensToken cost dominates

The other isolation pattern, sandboxed tool execution, has fewer arguments around it. Anthropic’s Code execution with MCP and HuggingFace’s CodeAgent both run the agent’s code in a sandbox where intermediate results never enter the LLM’s context unless the code explicitly prints them.

That’s isolation in the small. It’s where the 150K-to-2K reduction comes from: the spreadsheet got filtered inside the sandbox, and only the five rows that mattered crossed the boundary into the prompt.

Sub-agents are isolation in the large. Sandboxed code is isolation in the small. Both buy the same thing: the orchestrator’s context only sees the answer, not the work.

The next post in this series goes much deeper into the orchestration patterns themselves; this is just the context-engineering view of why isolation earns its cost when it does.

The KV cache is the production tier-1 metric

Of all the cross-cutting concerns, the KV cache is the one that gets least discussion relative to its actual production weight. Manus’s writeup puts it bluntly:

KV-cache hit rate is the single most important metric for a production-stage AI agent.

The economics. Cached input on Claude Sonnet costs $0.30 per million tokens. Uncached costs $3 per million tokens. That is a 10x gap on the largest line item in any agent’s spend.

Anthropic’s prompt caching pricing is similar across the family: cache read at 0.1x base, 5-minute cache write at 1.25x, 1-hour write at 2x. Break-even on a 5-minute cache is one read; on a 1-hour cache, two reads.

In an agent loop with 50 tool calls per task and a 100:1 input-to-output ratio (Manus’s measured numbers), the cache is the difference between economics that work and economics that don’t.

The mechanism is unromantic. Models hash the prefix of every request and reuse the prefill computation when the same prefix appears again. Any byte that changes inside that prefix forces a re-prefill of everything after it. Which means the failure modes are also mechanical.

Dynamic timestamps in system prompts. f"The current time is {datetime.now()}" interpolated per request invalidates the cache on every call. The fix is to capture a session-start timestamp once and pass it through, or to omit it entirely. Most agents do not need it.

Non-deterministic JSON serialization. Tool descriptions or session metadata serialized with hash-randomized dict order will produce a different prefix on every restart. Use a sort_keys-equivalent serializer everywhere the prompt is constructed.

Adding or removing tools mid-session. Manus’s specific finding: “even a single-token difference can invalidate the cache.” Tools should be present-but-masked (logit suppression), not dynamically inserted or removed.

Inserting a new tool definition mid-conversation reorders the prefix and torches the cache.

Session-start metadata that changes between sessions. User IDs, workspace names, conversation IDs woven into the system prompt rather than passed as variables.

The diagnostic that matters is the cache_read_input_tokens to cache_creation_input_tokens ratio. Anthropic exposes both in API responses. A sustained drop in the read ratio is a regression, and it almost never has a model-side cause. It is a prompt-construction bug.

Claude Code’s engineering team is reported to declare SEVs when cache hit rates drop, which is the appropriate level of seriousness for a metric this load-bearing.

The cache is also the constraint that makes some of the patterns above viable at all.

Just-in-time tool loading depends on a stable prefix in front of the tool-search machinery. Multi-turn agent loops depend on the prefix being stable across turns. File-system offloading depends on the reference (path, URL, ID) being stable so the cache survives across reads.

KV cache discipline is not a separate concern from context engineering; it is the lower-level mechanic that decides whether the upper-level moves pay for themselves.

Keep errors in context

This one is short and counterintuitive enough to be its own beat.

The instinct, when an agent makes a tool call that fails, is to remove the failure from the trace before the next turn. Cleaner history, less noise. Every agent framework ships with the option to do exactly this, and most teams turn it on by default.

Manus’s argument against:

In multi-step tasks, failure is not the exception; it’s part of the loop. When the model sees a failed action followed by the resulting observation or stack trace, it implicitly updates its internal beliefs. This shifts its prior away from similar actions and reduces the chance of repeating the same mistake.

The observation is empirical. Agents that see their failures recover from them. Agents whose failures are silently scrubbed keep retrying the same wrong move. The failed action and its observation are the training data for the rest of the run. Wiping them is throwing the gradient away.

The MCP series argued that errors are recovery instructions for the agent. The tool’s error message should explain what failed, why, and what to try next.

This is the same rule, viewed from the caller’s side. Keep the errors. Trim the duplicates. Don’t pretend the failed call never happened.

The only error class worth scrubbing is the kind that contaminates rather than informs.

Malformed tool outputs that leak into the model’s parsing. Retry loops producing no new information after the first try. Known-spurious environmental noise (transient network errors that resolved on retry). Everything else is signal.

Build the loop

The list above is a menu, not a recipe. The recipe is the same one the MCP post landed on, restated for context: prototype, evaluate, collaborate with the agent on improvements.

Instrument first. The dashboard for context engineering has four numbers:

  • KV cache hit rate, as cache_read_input_tokens / (cache_read + cache_creation). Alert on any sustained drop. This is the one regression with no model-side explanation.
  • Tokens per task, broken down into system prompt, tool definitions, conversation history, tool observations. Per-component. If the breakdown surprises you, the surprise is the bug.
  • Context utilization, as percent of window used at end of a typical task. If this climbs over weeks without a corresponding feature increase, something is leaking.
  • Completion accuracy per task class, as your existing eval. The other three are levers. This is the outcome.

Evaluate with real tasks. Five to ten realistic prompts per workflow, run programmatically, captured into a comparable table. Same shape as the MCP-post advice, same reason: the alternative is tuning by vibes.

Hand the transcripts back. Feed full traces and the current prompt to a frontier model alongside the metrics. Ask what it would change about the prompt, the tool definitions, the summarization step, the file-system patterns.

The agent-rewrite loop is consistently better at spotting redundant tools, ambiguous tool descriptions, and over-eager loading than any single human pass. Anthropic reports this is where their largest gains on the SWE-bench numbers people quote at them came from.

The split is the same the MCP post drew. Tests catch the regressions you already know about. The agent-driven rewrite finds the ones you don’t. They are complementary. Build both.

The window is the prompt

The MCP post landed on the tool list as a prompt. This post lands on the same shape one level out: the whole context window is the prompt, and engineering it is the job.

That framing collapses most of the surface debates. Long-context support versus RAG isn’t a real argument once you accept the window is a budget. Both are selection moves, traded against each other depending on corpus shape.

Multi-agent versus single-threaded isn’t a real argument once you accept isolation is a tool with a token cost. Both are right for different work shapes. Memory versus context isn’t even a real debate. Durable memory is just write that outlives the session.

What remains, once the surface arguments are cleared, is the operational job. Decide what goes in the window. Decide what stays out.

Decide what gets compressed, offloaded, or pulled in on demand. Hold the KV cache stable while you do it. Keep the errors. Recite the goal. Measure the result.

None of this is dramatic. None of it is novel. All of it is what separates an agent that works in a demo from one that works in production, and the gap between the two is wider than the model card admits.

The next post in this series, less memory, more context, takes the same logic into the memory layer.

The cognitive taxonomy (episodic, semantic, procedural) made bolting on whole memory systems feel principled, but most of what those systems were solving has been dissolved by the moves above. What survives is narrower and stranger than the taxonomy implied.

The post after that, multi-agent is where context goes to die, runs the same argument one more level out.

The orchestration patterns that ship are mostly single-threaded with disciplined sub-agent isolation, and “multi-agent” turns out to be a context-budget tool more than a parallelism tool.

The MCP companions, MCP is not the problem and how to build an MCP server agents will actually use, are the tributary on one slice of the window. The tool server is part of the prompt. The whole series is pieces of the same problem.

Frequently asked questions

What is context engineering?

Context engineering is the discipline of curating the tokens an LLM sees during inference: system prompt, tool definitions, retrieved documents, conversation history, tool observations, memory, scratchpads, and any other state that lands in the window. Anthropic frames it as the search for “the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome.” It replaces “prompt engineering” because production agents are not single prompts but multi-turn loops where almost every interesting decision is about what to include or exclude from the window over time.

How is context engineering different from prompt engineering?

Prompt engineering is about a single static prompt. Context engineering is about the window’s contents across a multi-turn loop, where most of what the model reads was not authored by a human but accumulated by the agent itself: tool outputs, retrieved chunks, prior turns, summaries. The vocabulary shift reflects the workload shift. The actual job is now allocation across an evolving window, not careful prose authoring against a single one.

Why do long-context models still struggle past their advertised window?

Because attention is non-linear in length. MRCR v2 benchmark data from May 2026 shows Claude Opus 4.6 (the long-context leader) falling from 91.9% at 256K to 78.3% at 1M, Claude Opus 4.7 collapsing from 59.2% to 32.2%, and Gemini 3 Pro from 77% at 128K to 26.3% at 1M. Lost in the Middle shows a 20-point accuracy swing based on where the answer sits in the context. The advertised window is a structural maximum, not a usable one.

What are the four moves of context engineering?

Lance Martin’s framework: write (push state out to filesystem, scratchpads, memory), select (pull state in on demand via retrieval, tool search, lazy loading), compress (replace with smaller representation via summarization), isolate (run separate contexts via sub-agents or sandboxed execution). Every working agent stack composes all four. They answer different questions and address different failures.

What are the four failure modes of long context?

Drew Breunig’s taxonomy. Context poisoning (a hallucination gets repeatedly referenced, like DeepMind’s Gemini 2.5 hallucinating Pokémon goals). Context distraction (the window grows so long the model over-attends recent history and stops drawing on training; appears past about 100K). Context confusion (extra tools generate worse answers; degradation correlates with tool count). Context clash (new information conflicts with information already present; the sharded-prompts study measured a 39% accuracy drop).

When should I use sub-agents?

When the work is genuinely parallelizable and additive. Anthropic’s multi-agent research system outperformed single-agent Opus 4 by 90.2% on research tasks, at 15x token cost. Cognition’s “Don’t build multi-agents” argues against them for coding, where artifacts must compose with each other and a summary loses the decisions that made them consistent. The decision is shape, not preference.

What is the KV cache and why does it matter?

The KV cache stores prefilled attention computations for the prompt prefix; when the same prefix appears in the next request, the model reuses the computation. Cached input is 10x cheaper than uncached on Claude Sonnet ($0.30/MTok vs $3/MTok). Manus calls cache hit rate “the single most important metric for a production-stage AI agent.” Common ways to poison the cache: dynamic timestamps in system prompts, non-deterministic JSON serialization, adding or removing tools mid-session.

How should I paginate or trim tool responses to fit the context window?

Cut at a token threshold, not a record count, and return references rather than payloads where possible. A 50K-token web page should compress to a 500-token summary plus the URL; the agent fetches the URL again if it needs the rest. For large datasets, filter inside a sandbox and surface only the rows that matter; Anthropic measured a Drive-to-Salesforce sync collapsing from 150K to 2K tokens using this pattern. Always include has_more and a cursor on paginated responses.

Should I scrub failed tool calls from the agent’s history?

No, almost never. Manus’s empirical finding: agents that see their failures recover from them, and agents whose failures get silently scrubbed keep retrying the same wrong move. The failed action and its observation are training data for the rest of the run. The only failures worth removing are the kind that contaminate rather than inform: malformed tool outputs, transient noise that resolved on retry, retry storms producing no new information.

How do I evaluate whether my context engineering is working?

Instrument four numbers: KV cache hit rate, tokens-per-task broken down by component, context utilization at end of typical tasks, and completion accuracy per task class. The first three are levers; the last is the outcome. Run five to ten realistic prompts programmatically. Then feed full transcripts back to a frontier model alongside the current prompt and ask what it would change. The agent-driven rewrite consistently finds redundant tools, ambiguous descriptions, and over-eager loading that hand passes miss.

What’s the relationship between context engineering and memory systems?

Durable memory (long-term files, vector stores of past sessions, profile data) is just the Write move applied to information that has to outlive a single session. Most of “agent memory” as a category is solving problems that better context engineering, longer cached contexts, and offloading to a filesystem have already dissolved. Reach for a dedicated memory system when you need state across sessions, not as a substitute for in-session context discipline.