Writing
17 min read

Multi-agent is where context goes to die

More agents was supposed to mean more capability. Production says otherwise: every handoff between agents loses information, every parallel decision conflicts, and the system fails where context crosses. The number of agents is rarely the question worth asking.

On this page

More agents was supposed to mean more capability. Role-play frameworks. AI-company-in-a-chatroom simulations. Autonomous orchestrator-worker pipelines. A society of models outperforming a single one.

A narrow band survived contact with production. The rest turned out to be expensive scaffolding around a model that did the job alone for a tenth the cost as soon as it improved.

Multi-agent is a context-isolation pattern, not a parallelism pattern. You reach for multiple agents when one agent’s context window is the bottleneck, not when one agent’s reasoning is. Where context crosses, the system fails.

The number of agents is the wrong question. The right question is who owns what context.

Name the pattern before you build it

Someone says “let’s build a multi-agent system” and you can’t tell what they want. A pipeline of three sequential calls? A classifier routing to specialists? Two LLMs critiquing each other?

Each has wildly different stakes, and only one of those is actually multi-agent. The naive move is to pick a framework before naming the pattern, and then spend a sprint fighting an orchestration runtime when what you needed was one prompt with retrieval.

The vocabulary is the fix. Call it a workflow when LLMs run through predefined code paths. Call it an agent when the LLM directs its own process and tool use. Workflows are predictable. Agents are open-ended and necessary only when you cannot predict the steps in advance.

There are five workflow patterns worth knowing (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer) plus the open-ended agent loop. The taxonomy is Anthropic’s (Building Effective Agents); the names are now lingua franca.

Run any vague “multi-agent” request through that list and most of them collapse to a single-LLM workflow. Routing is not multi-agent. Parallelization is not multi-agent. Evaluator-optimizer is not multi-agent.

The genuinely multi-agent patterns are orchestrator-workers and full agents talking to each other, and both are expensive and narrow. Disambiguate before you architect. One sentence of vocabulary saves a quarter of misbuilt runtime.

The role-play trap

You hand the same model three personas (CEO, CTO, programmer) and let them converse about a feature. The transcript looks impressive.

Then the code doesn’t compile, the CEO contradicts the CTO, and you can’t tell whose decision was load-bearing. You ship anyway because the demo looked good. Two weeks later, production is calling.

The whole 2023 wave of role-play frameworks bet on the same hypothesis: role specialization beats one model trying to be every role. Different stacks (CAMEL, ChatDev, MetaGPT, AutoGen) tightened the protocol over time. Free-form chat became structured artifacts. Structured artifacts became typed events. The demos kept getting better.

The production take-up was thin. Microsoft now routes new projects away from AutoGen to Agent Framework, which is the receipt: the team that built the largest first-gen framework moved on.

The bottleneck was never role discipline. Each handoff between roles lost information. Each parallel decision encoded an implicit conflict that surfaced as a bug later.

A society of personas is a stand-in for not having a real division of work. If you’re about to instantiate a CEO, a CTO, and a programmer to solve a problem one Sonnet call already solves, stop.

Orchestrator-worker is for breadth, nothing else

A research question needs facts from forty sources. One agent reading them in sequence runs out of window before it can synthesize anything. The reasoning is fine; the context is the bottleneck.

Bigger windows don’t fix this. A million tokens of raw search results is a haystack, not a research artifact, and context rot kicks in well before you hit the limit.

What works is to decompose, fan out, synthesize. A lead model plans scoped subtasks. Worker copies run in parallel with isolated contexts. They return summaries, not transcripts.

def orchestrate(task):
    plan = orchestrator.plan(task)
    subagent_results = parallel_map(
        lambda subtask: subagent.run(subtask, context=None),
        plan.subtasks,
    )
    summaries = [r.summary for r in subagent_results]
    if orchestrator.satisfied(summaries):
        return orchestrator.synthesize(task, summaries)
    return orchestrate(orchestrator.refine(task, summaries))

The lift is real on the tasks this fits. A lead model with three to five subagents beats a solo model by roughly 90% on breadth-first research, at 15x the token spend (Anthropic’s writeup has the numbers). The lift tracks the spend almost linearly.

The obvious extension breaks. Writing code is sequential: later decisions depend on earlier ones, and isolated subagents make conflicting choices. Anything with shared state has the same shape, and the same failure.

Reach for orchestrator-worker when the task is breadth-first, parallelizable, and the answer is worth a 15x bill. Do not reach for it because more agents sounds like more capability.

Subagents are a context budget, not a colleague

You ask your coding agent to find every reference to a function. It runs grep 18 times, opens 40 files, follows three false leads. Half your main thread is now search noise, and you still need to do the actual edit.

The naive fix is to keep going and let the context fill. It works once. The third exploration in the same session collapses, because the prefix is already burned and the model has lost the thread.

The fix is to spawn a subagent. It gets its own window, its own restricted tool list, its own permissions. It does the verbose work and returns a paragraph. The main thread sees the paragraph, not the trajectory.

That’s the whole point. The subagent isn’t smarter than the main agent. It isn’t even a different model. It’s a context budget you’ve quarantined from your hot prefix.

Every shipping coding agent does some version of this. Claude Code’s subagents are the canonical pattern. Cline quarantines read mode from write mode. Cursor 2.0 runs speculative branches in git-worktree sandboxes. Aider pairs a prose-reasoning architect with a diff-writing editor.

Different costumes, one job: protect a context budget. None of them is “two minds thinking about the problem.” Two instances of the same model with two roles produce worse results than one instance with both contexts.

Spawn a subagent to keep your main thread clean, not to summon a colleague.

Debate works because criticism is asymmetric

You ask a model to check its own answer. Sometimes it catches the bug. Often it doubles down on the wrong one, more confidently than before.

Self-Refine is fragile, and on reasoning tasks it frequently degrades performance. Asking harder doesn’t help. “Are you sure?” produces more text and the same error, because the biases that produced the wrong answer are the same biases producing the critique.

The trick is to make the criticism asymmetric. Run two or three copies of the same model in parallel. Have each answer independently. Then let each read the others’ answers and revise. Do that for a few rounds. Vote, or take the longest-argued.

Gains of 5 to 20 absolute points on hard reasoning benchmarks (Du et al. ran the experiment cleanly; the technique is the trick). Real, robust, replicated.

The mechanism isn’t “society of minds.” Reading someone else’s answer (even a copy of your own) is enough perspective shift to break the introspective loop. The shift is the whole point. Committee voting is a sideshow.

Debate earns its cost when verification is harder than generation but easier than re-generation. Outside reasoning benchmarks it loses to cheaper alternatives. Don’t generalize it.

Single-threaded writes survive production

A multi-agent system is asked to build Flappy Bird. One subagent renders a Mario-style background. Another draws a bird in a different art style.

Locally each agent did something reasonable. Together you have an incoherent artifact, and you cannot tell at synthesis time whose decision was load-bearing.

More coordination doesn’t fix this. Pass summaries, share state, add a critic. Each fix moves the conflict to a later stage. The bug is structural. Two minds writing into the same artifact without seeing each other’s writes will diverge.

The rule that works: build single-threaded, sequential agents. Subagents answer questions (read operations). They do not make changes (write operations). One mind holds the load-bearing decisions. (Cognition’s “Don’t Build Multi-Agents” is the longer version of this argument, and worth reading.)

The apparent tension with the breadth-first research pattern is the whole point. Breadth-first research is reading with no shared writes. Coding is writing with deep shared state.

Both pictures are right about their own task shape. Ask which shape your task is before reaching for multi-agent. “Looks complex enough to deserve multi-agent” is not an answer.

Pick a framework by who owns the context, not by stars

You need to ship next month. There are forty agent frameworks on GitHub. Sort by stars and you get whatever was viral last week. That has zero correlation with whether the framework’s primary assumption matches your task.

The sort that predicts pain is by what the framework does to context ownership. Three bins, and the bin matters more than the brand.

Orchestration runtimes own context that outlives the request. The LLM acts; the runtime remembers which agent owns what. Reach for one when the work spans hours or days, when you need durable retries, or when human-in-the-loop pause-and-resume is load-bearing. Credible options: LangGraph, LlamaIndex Workflows, Inngest AgentKit. Pick by language and operational story, not feature count.

Agent-loop libraries own nothing the agent didn’t already have. They hand you the loop and stay out of the way. Reach for one when the agent loop is your hot path and you want ergonomics, not architecture. Credible options: Claude Agent SDK, OpenAI Agents SDK, Vercel AI SDK, Pydantic AI, Mastra. Pick the one whose SDK matches your provider strategy.

Compilers and role-play kits replace the model’s choices with theirs. DSPy does this honestly: it compiles signatures and modules against a metric. smolagents replaces JSON tool calls with executable Python. CrewAI does it through role-played Crews, which is the pattern the rest of this post argues against.

The framework that doesn’t appear above and probably should: none. A lot of production agents are a few hundred lines calling the provider SDK directly. Frameworks earn their weight when the harness gets non-trivial. Below that threshold they cost more than they save.

The SWE-bench climb came from harness, not agents

SWE-bench Verified climbed from 1.96% to 87.6% in two years. The natural story is “multi-agent architecture got better.” That story is wrong.

Look at the top of the leaderboard and there’s almost no elaborate orchestration. The shape is one primary agent plus disciplined subagent isolation, top to bottom. The work that moved scores was harness, not agent count.

Three moves carried most of the gain.

Build the agent its own interface. Don’t make it use the raw shell. Purpose-built commands and observation formats are worth roughly 12 points on SWE-bench alone (the SWE-agent paper made this point cleanly). Agents are a new class of user; design tools for them.

Sandbox at the VM, snapshot at the hypervisor. The two longest infrastructure projects at Cognition were both runtime, not model (Devin postmortem). Per-session isolation matters; idle-agent snapshotting matters more than you think.

Treat the cache as a first-class concern. Cached prefix runs at a tenth the cost of uncached. CLAUDE.md is kept short on purpose. /clear and /compact are used aggressively. Boris Cherny’s rule is “Cache Rules Everything Around Me”, and the rule deserves the all-caps.

The minimalist counter-position works too. Pi ships with four tools (Read, Write, Edit, Bash), a system prompt under a page, and an agent that extends itself when it needs more. Still single-primary-agent.

If your scores are losing to the leaderboard, your bottleneck is your harness or your interface. Adding roles will not save you.

88% of failures are harness, not model

You ship an agent. Demos pass. Production traffic kills it within a week.

The traces don’t point at the model. They point at silent tool failures, half-written state, and subtle context drift that nobody can reproduce locally.

Swap the model and the system still falls over. Bigger LLM, same harness, same failure. The model was never the constraint.

Pull the production failure distribution and the picture is consistent. Context Blindness 31.6%, Rogue Actions 30.3%, Silent Degradation 24.9%, Memory Corruption 8.1%, Runaway Execution 5.1% (Latitude’s playbook has the breakdown).

88% of classifiable failures are infrastructure or harness, not model quality. Silent tool-call failures are the most expensive class because no trace signal fires until much later, by which point the damage is downstream.

The structural math is unforgiving. A 1% per-step error rate becomes an 87% chance of failure by step 200. Compounding error is the whole reason long-horizon agents are hard. There’s a separate axis worth measuring: the longest task duration at which a model is 50% reliable, doubling every four to seven months (METR’s time horizons tracks it).

Long-horizon reliability is its own capability, separate from short-task accuracy. Llama 3.3 70B beats Qwen3 30B by 20pp at very-long horizon despite identical short-task scores.

Long context isn’t free either. Performance starts degrading well before the model hits the limit; a 1M window can rot at 50k (Chroma’s context rot study measured this across 18 frontier models).

Tool descriptions are an attack surface. Malicious instructions hidden inside an MCP tool description can persist across sessions and exfiltrate mcp.json credentials or SSH keys (Invariant Labs PoC). Audit what you load.

And there’s one architectural pattern that is unwinnable no matter how hardened your prompts get: private data + untrusted content + external comms. Cut a leg of that trifecta. Hardening will not save you (Simon Willison calls this the lethal trifecta).

Instrumentation, sandboxing, isolated execution, explicit forget policies, asymmetric verification. That stack is most of what production agent engineering actually is. Spend there. The model will keep improving on its own.

Build evals before the agent

You change a prompt and the demo still passes. Two weeks later, four real-customer flows have regressed in different ways. You can’t tell which change broke which flow.

Without evals, every improvement is a coin flip, and you find out which side it landed on in your incident channel.

Public benchmarks are leaderboard signal, not your eval suite. SWE-bench Pro, tau³-bench, AppWorld, METR all tell you whether the field is moving. They don’t tell you whether your agent works on your traffic. WebArena and OSWorld have been gamed to near-perfect scores; treat any leaderboard number as a starting point, not a verdict.

The suite that matters is the one you build from real failures in your real traffic. The highest-ROI improvement source isn’t a new architecture, it’s error analysis on those traces. Domain experts write prompts directly. Process rewards beat outcome rewards by roughly 18x in data efficiency. Asymmetric verification (different model, different scaffold, or programmatic check) catches what same-model self-critique cannot.

The most useful failure mode an eval suite surfaces isn’t capability, it’s unpredictability. The honest read of any agent shipping today is that you can’t tell in advance which tasks will work. Answer.AI’s Devin study landed 3 of 20; the headline was that the 17 failures were unpredictable, not that the capability was missing.

That’s what evals exist to fix. Build them before the agent, not after the regressions.

The rules that survive

Twenty rules distilled across the four posts in this series. The ones that survive contact with production.

  1. An LLM agent runs tools in a loop to achieve a goal. That’s the unit. Everything else is harness.
  2. Workflows for predictable, agents for unpredictable. Most “agent” requests are workflows in disguise.
  3. Pick the smallest action primitive that expresses your task. ReAct for small tool surfaces. CodeAct when tools start composing.
  4. Cut the harness as the model improves. Components survive that the model can’t do. Components die that it can.
  5. Multi-agent only when value justifies 15x cost AND the task is breadth-first parallelizable AND no shared writes.
  6. Subagents isolate context. Return summaries, not transcripts.
  7. KV cache hit rate is the number one production metric. Stable, append-only prefix. Mask tools, don’t swap them.
  8. Five to fifteen tools per server. If a human engineer can’t decide which tool fits, neither can the agent.
  9. Don’t expose 30 MCP tools. Expose code. Skills win over deferred-load MCP.
  10. CLI > MCP for most coding contexts.
  11. Filesystem is unlimited memory. Externalize anything that isn’t load-bearing in context.
  12. Compaction policy is mandatory for any long-running agent.
  13. Separate the generator from the evaluator. Two prompts, two models, two contexts. Same-model self-critique degrades reasoning.
  14. Preserve errors. Don’t sanitize failures out of context. The agent needs evidence to adapt.
  15. Cut a leg of the lethal trifecta. Private data, untrusted content, external comms. Any two is fine.
  16. VM-level isolation per session for any agent running untrusted code. Containers share a kernel.
  17. Evals first, everything else second. Without them you have no flywheel.
  18. Partial autonomy with sliders (Karpathy). Tesla-style FSD staging, not “fully autonomous and hope.”
  19. Treat the agent like a junior engineer. Code review is the relevant skill, not prompt engineering.
  20. The agents you don’t build are the ones that don’t fail. Default to no agent. Reach for one only when nothing simpler works.

One rule

The number of agents is the wrong question. The right question is who owns what context.

Every successful pattern in this post is a different answer to that question. Every failed pattern got the answer wrong.