Writing
20 min read

The action primitive: from chain of thought to code

Every agent paradigm is one answer to the same question. What does the model emit at each turn? Get clear on the action and the harness almost designs itself.

On this page

Open two agent papers from the last three years and put their loops side by side. The thoughts are different. The tools are different. The framings rhyme but never quite line up. One thing genuinely changes between them, and once you see it you’ll stop being surprised every six months: what the model writes at each turn.

Call that the action primitive. Token stream out, environment reaction in, repeat. The shape of those tokens decides what the rest of the harness has to do for a living. Widen the primitive and components fall away. Narrow it and you bolt new ones on to compensate.

This post walks the lineage in that order, asking one question at each step. What is the model emitting now, and what is the harness now off the hook for?

Start with the prompt: chain of thought

The cleanest place to feel the action primitive is the very first move that widened it, which wasn’t an action at all. It was permission to write more.

Hand a 100B-param model a math word problem with the answer slot at the end. Watch it skip ahead and guess. The same autoregressive structure that makes the model fluent makes it confident before it has earned the answer. Each token conditions the next, and once it starts saying the answer, every subsequent token is committed to defending it.

Now change one thing. Prepend a worked example where the answer is preceded by visible arithmetic.

prompt = """
Q: Roger has 5 tennis balls. He buys 2 cans, each with 3. How many?
A: Roger started with 5. Two cans of 3 is 6. 5 + 6 = 11. Answer: 11.

Q: {user_question}
A:"""

That’s it. No fine-tuning. No new tools. Wei et al. (2022) called this chain of thought and showed it sharply lifts multi-step reasoning above ~100B params, and often hurts below. Why? Because writing the work first means the answer is being computed against a much richer state. The scratchpad isn’t a UX feature for the reader; it’s compute the model is doing on itself.

The bigger move was conceptual. Reasoning became a behavior you elicit, not a capability the model has or doesn’t. Once you see that, every later paradigm becomes a different way to elicit a different behavior. That reframe is the whole map.

Greedy decoding is where this first breaks. Pick the most likely token at every step and a single early slip kills the chain. The fix is older than the technique it fixes. Sample many chains at non-zero temperature, take the majority vote. Wang et al. (2022) called it Self-Consistency and posted +17.9 on GSM8K. The interesting thing isn’t the number, it’s what the number means. Inference compute is now a knob you can turn against accuracy. Sample more, get more right. That door, once opened, never closes.

Add the world: ReAct

Give a CoT-only agent a calculator. Watch it think for two paragraphs, assert that 17 × 23 = 408, and explain why with great confidence. The reasoning trace and the world are in different rooms. The model has no way to check its work because its work isn’t grounded in anything.

The fix is the most obvious thing in retrospect. Let the model alternate. Write a thought. Take an action. Read what the world says back. Loop.

state = initial_prompt
while not done:
    thought = llm.complete(state + "\nThought:")
    action = llm.complete(state + thought + "\nAction:")
    observation = tool.run(action)
    state += f"\nThought: {thought}\nAction: {action}\nObservation: {observation}"
    done = is_terminal(action)

That loop is ReAct. It is also, almost word for word, what most production tool-use agents in 2026 still run under the hood. The original paper posted +34 on ALFWorld over imitation+RL baselines and +10 on WebShop, no task-specific training, and those gaps held up in practice.

What’s actually doing the work? The thoughts give the model a place to plan and track state. The actions ground every plan in a real signal. CoT-only agents hallucinate because they have no way to be wrong on purpose. Action-only agents are myopic because they can’t pause to think. Interleaving gives you both.

There are two costs and you’ll meet both. Tokens grow linearly with steps because every step replays the whole trace. A ten-step trajectory becomes tens of thousands of tokens. And every step is sequential because each action waits on a round trip.

There is also one production failure that dominates the others. The model invents tools that don’t exist. Analysis of one 200-task benchmark found that 90.8% of retried calls didn’t match any real tool. Hallucinated names. Misspelled arguments. Schemas the model imagined into being. The model is best-guessing against every API it has ever seen, and that set is bigger than your tool list.

The structural fix is constrained decoding or function-calling APIs that make malformed actions literally impossible to emit. Every serious tool-use stack you’ve ever called wraps a ReAct loop in a JSON schema validator, and now you know why.

Carry lessons across trials: Reflexion

ReAct gives the agent a scratchpad inside one run. What does it have between runs? Nothing. Fail the same task three times in a row and you fail the same way three times in a row.

The move is again obvious once you’ve seen the pattern. After a failed run, ask the model what went wrong. Get a paragraph back. Prepend that paragraph to the next attempt. The weights don’t change, but the prompt does.

memory = []
for trial in range(max_trials):
    trajectory = react_agent.run(task, memory=memory)
    reward = evaluator(trajectory, task)
    if reward >= threshold:
        return trajectory
    lesson = reflect_model.complete(trajectory, reward, memory)
    memory.append(lesson)
return best_trajectory

Shinn et al. (2023) called this Reflexion and framed it as verbal reinforcement learning. The framing is correct. Trial outcomes carry information; language is a way to encode that information without retraining. HumanEval pass@1 went from 80 to 91 with GPT-4. AlfWorld +22, HotPotQA +20.

Here is the part you need to internalize before you ship this. Reflexion works because of a triangle: Actor (the policy), Evaluator (the signal), Self-Reflection (the lesson writer). Pull the Evaluator out and replace it with another LLM grading the first, and the whole thing collapses. One fallible model judging another fallible model produces lessons that compound errors instead of correcting them.

The original paper used programmatic rewards: unit tests, environment-state checks, reference answers. Those are the conditions under which the numbers above hold. Reach for Reflexion only when you have a real signal. If your evaluator is just another prompt, you’ve built a hall of mirrors.

The other failure is what Drew Breunig later named context poisoning. A bad lesson, once written into memory, gets prepended forever and steers every subsequent run further off course. Sliding-window memory, or pruning by evaluator confidence, is how you live with this in production.

The trap of self-critique

It is very tempting to drop the Evaluator and the separation entirely. One model. Generate. Critique. Refine. Loop.

output = llm.generate(task)
for _ in range(max_iters):
    feedback = llm.critique(task, output)
    if feedback.is_satisfied:
        return output
    output = llm.refine(task, output, feedback)

Self-Refine made this clean and reported ~20 absolute points of improvement across seven tasks. For a moment the technique looked free.

Then Huang et al. titled their reply “Large Language Models Cannot Self-Correct Reasoning Yet” and the field had to reckon with what was actually happening. Without an external oracle, intrinsic self-correction degrades reasoning. Tyen et al. pinned down why. Models can fix a reasoning error when shown where it is. They cannot reliably find the error themselves. Valmeekam et al. traced apparent gains in earlier papers to the correct answer leaking into evaluation top-k.

Single-model self-critique is the most-cited and least-reliable pattern in this entire literature. Hold that carefully before you build on top of it.

The production version of this insight is everywhere once you look. Anthropic’s long-running coding harness puts planner, generator, and evaluator in three different contexts. Aider’s architect-plus-editor split puts a reasoning model in charge of describing the change in prose and a separate editor model in charge of turning that prose into diffs. 85% on Aider polyglot with o1-preview architect plus DeepSeek editor, beating any solo configuration. Asymmetric criticism catches errors. Symmetric introspection mostly reproduces them.

Decide up front: plan-then-execute

ReAct decides one step at a time. Great when the task is exploratory. Wasteful when the task has a known shape, because the observation log compounds and the agent could have planned the whole thing in one shot.

The smallest move in this direction is almost embarrassing. Prepend “Let’s first understand the problem and devise a plan. Then carry it out step by step.” That sentence alone, in Plan-and-Solve, closed most of the zero-shot to few-shot gap on math reasoning. One reframing competitive with eight carefully chosen exemplars. If you remember nothing else from this section, remember that prompts that name the meta-strategy can be as valuable as prompts that demonstrate it.

The full version requires more structure. BabyAGI and AutoGPT tried it. A planner LLM emits subtasks. An executor runs the next one. Memory stores results. Replan. They both became the canonical illustrations of how this pattern fails. The agents wandered, hallucinated subtasks, exhausted budgets, chased dead ends.

The lesson isn’t that planning is bad. The lesson is that everything compounded. Fallible planner. Fallible executor. Fuzzy memory retrieval. No reliable success signal anywhere on the path. Plan-first amplifies whatever errors live in your pipeline.

ReWOO is the version that survives contact with reality. The move is small and clever. Make the planner’s plan not depend on tool observations.

plan = planner.generate(task)
# #E1 = Wikipedia[Albert Einstein]
# #E2 = Wikipedia[#E1.birthplace]

evidence = {}
for step in plan:
    args = substitute(step.args, evidence)
    evidence[step.var] = tool[step.name](args)

answer = solver.generate(task, plan, evidence)

The Planner writes everything up front using placeholders. The Worker fills placeholders. The Solver assembles the final answer from plan plus evidence. The Planner never sees a single tool output, which means the model isn’t paying tokens for the observation log on every step. 5x token efficiency and +4 accuracy on HotpotQA over ReAct.

The cost is real and you should feel it. If step one’s observation should change the rest of the plan, ReWOO will execute the wrong plan to the end. Sometimes the Solver catches it. Often it doesn’t. Pick this primitive only when the task’s shape is known up front and observations rarely invalidate later steps.

LLMCompiler pushes the same idea to a directed graph. The Planner emits a DAG of function calls with explicit dependencies. The Task Fetching Unit dispatches every call whose dependencies are met, in parallel. 3.7x latency speedup and 6.7x cost savings over ReAct, with +9 accuracy. Notably, 1.35x faster than the same model using OpenAI’s native parallel function calling at similar accuracy, because the explicit dependency graph dispatches more aggressively than the model’s implicit one.

Plan-first earns its place exactly when the task decomposes into independent or partially independent moves. If the task is genuinely iterative, the DAG collapses to a chain and you’re paying for machinery you didn’t use.

Widen the action: CodeAct

Stop and stack what we have. The harness has grown: CoT prompts, tool schemas, JSON validators, observation logs, optional plans, optional reflections. Every one of those components is doing work the action primitive cannot do on its own.

Now widen the primitive to Python.

state = initial_prompt
sandbox = PersistentPythonSandbox()
while not done:
    code = llm.complete(state + "\nCode:")
    stdout, stderr, value = sandbox.run(code)
    state += f"\nCode:\n{code}\nObservation:\n{stdout}\n{stderr}\nValue: {value}"

Watch components fall away.

The agent can import libraries, define helpers, loop, branch, store intermediate values across turns, and recover from a traceback the way a human would. Tool composition that used to take three sequential ReAct turns becomes one Python block. The JSON schema validator is now redundant because the language itself imposes structure. Many plans become unnecessary because the model can encode the plan in control flow.

Wang et al. (2024) introduced this as CodeAct and the numbers are direct. +20 absolute points in success rate across 17 LLMs on API-Bank and M3ToolEval. ~30% fewer turns. The gains are largest on composition tasks, exactly where ReAct burned the most tokens.

Why does it work? Training distribution. The model has seen millions of lines of real Python and a few thousand synthetic tool-call schemas. “Import a library, call its function, handle the exception” is internalized at a structural level. “Emit a JSON object with these specific keys for this specific server” is something the model learned in the last fine-tune pass. Same operation in either form, the Python version is more reliable because the model has had more practice with it.

Look at one concrete turn:

# Two API calls and a transformation, one model step.
import requests, pandas as pd
events = requests.get("https://api.example.com/calendar?range=week").json()
slots = pd.DataFrame(events).query("status == 'free' and duration_min >= 60")
slots = slots[["start", "attendees"]].sort_values("start").head(5)
print(slots.to_markdown(index=False))

A ReAct trajectory would have taken three or four turns and dragged the full observation log forward on each one. Feel the compression.

The production evidence is now overwhelming. Hugging Face smolagents CodeAgent ranked #1 on GAIA validation at 44.2% with GPT-4o and no fine-tuning, beating the same model’s ToolCallingAgent. Anthropic’s “Code execution with MCP” writeup describes a Google Drive to Salesforce workflow that went from 150,000 tokens to 2,000, a 98.7% reduction, by exposing MCP servers as TypeScript modules instead of tool calls. Cloudflare Code Mode reports about a thousand tokens of context cost to expose 2,500 MCP endpoints as a TypeScript API. Manus runs every task inside an isolated VM and treats Bash, the filesystem, and Python as first-class actions. Claude Code, Cursor Composer, Mario Zechner’s Pi all converged on the same shape. Read, Write, Edit, Bash. Sometimes a system prompt under a page.

One thing keeps tugging at the back of the room. A wrong step kills a chain. Why aren’t we keeping alternatives alive?

The honest answer is that you can, and people did. Tree of Thoughts proposes K thoughts per step, scores each with a value prompt, expands the best b, backtracks on dead ends. LATS unifies ToT, ReAct, and Reflexion under MCTS. RAP runs MCTS over the model’s own world model and lets a smaller LM outperform a bigger one on planning with raw CoT. Agent Q closes the loop by training: guided MCTS generates preference data, DPO fine-tunes, and LLaMA-3 70B went from 18.6 to 81.7 on real-world booking after a day of data collection.

The reason you almost never see these in production is the same reason they work in papers. Search calls the LM five to twenty times per step. Economically unviable at scale. And the entire technique is gated on a calibrated evaluator. If your value prompt is noisy, search amplifies the noise.

Where this lineage went is more interesting than where it stayed. The o-series, R1, and the rest of the 2024-to-2026 reasoning model wave internalized search at training time. Their inference is one (long) forward pass, not a tree expansion. The model learned to do the backtracking inside its own chain of thought, supervised by reward. The next section is about that.

When the prompt fades: o-series, R1, RLVR

Everything above is an inference-time trick. Late 2024 onward, the action moved into training.

Process Reward Models showed step-level supervision beats outcome-level supervision for math. PRM800K (800,000 step labels) gave the field a reusable signal for grading partial chains. Vote weighted by per-step verifier scores, not just by final answer. That’s the bridge between Self-Consistency and what o1 does now.

STaR and Quiet-STaR showed you can fold the chains back into the weights. Sample chains. Keep the ones with correct final answers. Fine-tune. Repeat. Quiet-STaR generalizes to per-token rationales during continued pretraining and lifted GSM8K zero-shot from 5.9 to 10.9 on generic web data, no math-specific corpus.

OpenAI’s o1 and o3 made this the dominant pattern of 2024-2025. The model is reinforcement-learned to produce extended hidden chains, optimized against verifiable rewards. Inference budget now scales with the chain length the model chooses. You no longer prompt “think step by step.” The model decides how much to think and gets RL credit for thinking correctly.

On ARC-AGI, o3 at high compute hit 87.5% versus roughly 5% for GPT-4o, at 172x the per-puzzle compute. Above the 85% human baseline. Cost: $17-20 and 33 million tokens per puzzle at low compute. Billions at high.

DeepSeek-R1 showed it works without supervised fine-tuning at all. Pure RL on verifiable rewards, applied to a base model, elicits long reasoning chains and unprompted self-reflection. Open weights, replicated across the open ecosystem within weeks.

Now turn the question around. What was each harness component compensating for?

CoT prompts assume the model won’t reason without being told. The o-series produces CoT by default, often longer than your prompted version. Delete the prompt.

ReAct’s explicit thought-action interleave assumes the model won’t decompose without scaffolding. A capable reasoning model decomposes inside its own CoT. The interleave is often redundant now.

Self-Refine assumes the model won’t critique unless prompted. R1-style models do verbal reflection unprompted. Delete the critique step.

Reflexion buffers assume the model won’t remember last trial’s lesson. Persistent memory tools are starting to do this through the action surface instead of through prompt engineering.

Search assumes the model can’t explore alternatives internally. It can, now, when trained to.

Picking, in practice

Two defaults will carry you further than any taxonomy.

Reach for ReAct when the tool surface is small, the trajectory is short, and each observation genuinely shapes the next move. Wrap it in constrained decoding so the model can’t emit malformed actions. This is most production tool-use you’ll ever write.

Reach for CodeAct the moment tools start composing or the surface gets large enough that loading every schema hurts context. Pay the sandboxing tax up front, before you have agents touching production data. You will not retrofit your way out of it.

Stack signals where you can. Anywhere you can write a unit test, add a Reflexion-style outer loop with the test as evaluator. Anywhere the task is subjective (tone, style, readability), Self-Refine is genuinely useful. Anywhere the task has ground truth, give the critic a different scaffold or a different model.

Stop reaching for plan-first machinery on exploratory tasks. ReWOO is for known-shape multi-hop. LLMCompiler is for parallel, partially-independent calls. Anywhere else, plan-first amplifies whatever errors live in your pipeline.

Don’t add prompting machinery to a reasoning-bottlenecked task. Upgrade the model. The training-time paradigm has eaten most of the inference-time tricks, and new tricks live shorter and shorter half-lives.

What stays

If you take one thing from this lineage, take this. Build for today’s primitive. Stay willing to delete.

CodeAct with a current reasoning model collapses CoT prompting, explicit planning, and single-model self-critique into one loop because the model handles each internally. What survives in the harness is what the model can’t do for itself: sandboxes, file systems, persistent skills, evaluators with real signals. What dies is what the model now does on its own: prompted “think step by step,” explicit plans, symmetric self-critique.

That’s the whole game. Pick the smallest primitive that expresses your task. Watch your model release notes. Delete components as the assumptions they encoded stop being true.

The harness shrinks. That is the only architectural decision that compounds.