Writing
37 min read

The agent stack I actually ship with

How I direct the agent loop instead of letting it drive. Six tools, a dozen human gates, and one rule I never break: nothing ships unattended. Requirements, brainstorm, grill, plan, test, implement, review, verify, design, compound.

On this page

Most “how I use Claude Code” posts make it sound like the agent does the work. Mine doesn’t. The agent is the amplifier. I’m the conductor. The loop never runs unattended.

What I’ve built is a relay, not a self-driving system. I write the requirements. The agent expands them. I read the expansion. The agent grills me on what I missed. I answer every question. The agent drafts a plan. I read the plan line by line. A second model argues with the plan. I pick the winners. The agent writes the tests. I read the tests line by line. The agent implements. Two models review the diff. I read both reviews. The agent drives the browser. I supply the credentials and watch it run. I drive the feature myself. A designer eyeballs the result. I iterate.

Six tools. A dozen gates. One rule: I am in the room for every gate.

The rest of this post is what each tool does, what each gate catches, and how the flow bends to the kind of work in front of me.

Where one-shot agents fall down

Before naming the tools, name the holes. The reason this stack converged is that each piece corresponds to a failure mode I kept hitting, not to a feature I wanted to try.

Under-specified requirements. Hand the model a feature description and it will write code against the first interpretation that sounds plausible. The decision tree your prompt implied has fifty branches; the model picks one per branch without telling you which. By the time you see the diff, the wrong branches are already load-bearing.

The echo chamber. Ask Claude to write code, then ask Claude to review the code, and you have approximately one opinion. Huang et al. (2023) titled their paper “Large Language Models Cannot Self-Correct Reasoning Yet” for a reason. Symmetric introspection mostly reproduces the original error in a more confident voice.

Stale framework knowledge. Training cutoffs are the boring version of this. The sharper version is that frameworks now ship breaking changes every quarter and the model learned the previous quarter. The code looks right. The import path moved. The hook deprecated. You find out at runtime.

Claimed-done features. The agent writes the change, runs the type checker, watches a green CI, and tells you it is done. The feature it claimed to ship has a button that does nothing. The model never opened a browser. The model cannot tell the difference between “passes the tests it wrote” and “works.”

No compounding. Every new session starts roughly where the last one started. Whatever you learned the hard way last week is in a closed terminal somewhere. The agent works like a new contractor every morning.

Unattended drift. Even when the loop has all of the above, if I leave the room the agent will quietly walk it off a cliff: invent a requirement, paper over a failing test, mark a feature done after a green type check, ship code I never read. The most expensive failure mode isn’t bad output. It’s output I didn’t see.

Six failure modes. Six tools, plus a rule about the chair I do not get up from.

The host: Claude Code

I run everything inside Claude Code. Not because it is the best model under the hood any given week (it isn’t always) but because the harness is where the leverage is. Skills, hooks, MCP servers, sub-agents, slash commands, persistent CLAUDE.md, all in one tree under ~/.claude/. Every other piece of this stack is something I bolted into that tree.

The decision that matters here is plugins-over-prompts. The temptation when you start with Claude Code is to write the perfect system prompt and call it done. That works for a single task. It does not survive composition. A system prompt is one big string the model reads once; a plugin is a set of skills, agents, and commands the model reaches for when the situation calls for it. The first is a setting. The second is a vocabulary.

The rest of this post is the vocabulary I ended up with, and the points in the flow where I stop typing and start reading.

Requirements come from me

The loop starts before any agent runs. I write a short doc, by hand, that names what we are building, who it is for, and what we are explicitly not doing. Not a PRD with sections. A page, sometimes a paragraph, in plain prose. The shape I am after is: I can hand this to a sharp colleague and they would know what to build without asking me twenty questions.

This sounds obvious. It is the step everyone skips. The seductive thing about coding agents is that they will start typing the moment you say “build me X.” If I let them, the thing they build is the thing they inferred, not the thing I wanted. A two-paragraph requirement doc takes me ten minutes and saves an hour of throwing away the wrong feature.

The agent’s job for the next several steps is to expand this doc, stress-test it, and turn it into something executable. The agent’s job is not to decide what the doc says.

Brainstorm, then grill

Once the requirement exists, I run /ce:brainstorm from the compound engineering plugin. The brainstorm step is a dialogue, not a generation. Claude proposes a frame for the feature, I push back, Claude proposes a different frame, I keep pushing. The output is a requirements doc that is sharper than the one I started with, written collaboratively.

What this step is not: a place where the agent makes scope decisions on its own. Every “should we also do X” the agent surfaces is a question for me. The plugin treats it that way structurally; the doc that lands on disk only contains scope I agreed to.

Once the requirements doc is stable, I run Matt Pocock’s grill-me skill against it. It is, by file size, the smallest tool in this stack. Three sentences telling Claude to interview the user relentlessly, walk down each branch of the design tree, and resolve dependencies between decisions one at a time.

In practice it asks me between sixteen and fifty questions per session before letting me write a plan. The questions are not friendly. “Should this run on the same Lambda as X or a different one, and what’s the latency budget? If different, how do they share Y? If same, what happens to the cold start?” Most of those questions I would have skipped. Half of them I would have skipped without noticing I was skipping them.

The variant /grill-with-docs adds a CONTEXT.md file that locks vocabulary across the session. Once I have agreed that “the inbox” means the unread-message queue and not the user’s email account, the model stops asking and starts writing.

The gate at the end of this step is: every grill-me question has an answer in the doc. If a question is open, the plan that follows it will pick a default and the default will be wrong about a third of the time.

The plan, read line by line

/ce:plan reads the requirements doc, the existing codebase, and the directory of past learnings, and produces a tech plan: file paths, the sequence of changes, the interfaces that need to move, the tests that need to exist. Markdown, on disk, readable in five minutes.

I read it line by line. This is non-negotiable. If I can’t read it in five minutes, the plan is the bug, not the implementation. If I read it and disagree with a decision, I rewrite that section of the plan before any code touches my repo. If I read it and don’t understand a decision, I push back until the plan explains itself or changes.

The reason I read it line by line and not just skim is that this is the cheapest place to catch a bad decision. A wrong line in the plan costs me a sentence to fix. The same wrong decision after /ce:work costs me a diff to revert, tests to rewrite, and learnings to unwind. The plan is the document where my judgment lands the hardest.

A second model argues with the plan

After I’ve read and edited the plan, I hand it to a different model in a different scaffold and ask that model to disagree with it. OpenAI’s Codex plugin for Claude Code is the cleanest way to do that without leaving the terminal. The plugin installs as a Claude Code plugin; the relevant commands are /codex:review, /codex:adversarial-review, and /codex:rescue.

/codex:adversarial-review is the one I use on plans. It hands the plan to Codex with instructions to question implementation decisions, surface tradeoffs the plan glossed over, and explicitly disagree where it would have made a different call. The output is not a rubber stamp. The output is a marked-up version of my plan with the parts that are actually weak underlined.

The reason this works is the same reason Aider’s architect-plus-editor split works: asymmetric criticism beats symmetric introspection. The plan was written by Claude under my direction. Asking Claude whether the plan is good gives me Claude’s prior, restated. Asking a different model in a different scaffold, with a system prompt that says “disagree if you would have done this differently,” gives me a second prior. The places where the two priors disagree are usually the places where the spec was genuinely ambiguous.

The output is not the gate. I am the gate. Codex flags ten things; six are reasonable, three are wrong, one is a real find I missed. My job is to triage. I don’t ask Claude to “accept the codex feedback” or auto-apply. The whole point of the second model is that I get two opinions; auto-merging them defeats the purpose.

Ground: Context7 for live framework knowledge

Threaded through every step above is the framework-knowledge problem. The model thinks pages/ is still where Next.js routes live. The model thinks useEffect is still how you fetch on the server. The model is two major versions behind on whichever framework you actually shipped against. The code it writes will look right and break on import.

Context7, shipped by Upstash, is the MCP server I keep coming back to. Two tools: resolve-library-id turns “Next.js 15” into a Context7 ID, and query-docs pulls version-pinned documentation and examples into the model’s window. The catalog is curated from official sources. The fetch is on demand, not eager.

Both properties matter. The on-demand part means I am not paying tens of thousands of tokens to load docs the agent might not need. The version-pinned part means I can say query-docs for astro@5.x and get the migration guide for the version I actually run, not the one the model imagined.

The behavioral shift, once Context7 is wired up, is that the model stops guessing and starts looking things up the way I would. “Let me check the current API for X” became a thing the agent says before it writes the call. I built my own MCP heuristics post partly on what Context7 gets right: minimal verbs, stateful catalog, externally consumed, public data. It is the cleanest reference implementation in the protocol.

The gate here is implicit and worth naming: when the agent surfaces a doc snippet, I read it too, especially when the snippet contradicts something I thought I knew. Context7 isn’t a way for me to opt out of knowing the framework. It is a way for both of us to stop guessing.

Tests are an artifact I review before any code lands

Here is the step that I see almost no one talk about. After the plan exists and has been argued over, I have the agent write a test suite against the plan before any production code gets written.

The test suite is its own artifact: file names, the cases each file covers, the data each case needs, the assertions it makes. I read it line by line, the same way I read the plan. Tests I disagree with are signals that the plan is wrong, not signals that the tests are wrong. If a test enforces a behavior I didn’t intend, the spec was ambiguous; back to the plan.

Two things this catches that nothing else does.

The first is over-fitted tests. Left to itself, an agent will write tests that lock in incidental implementation details (the exact shape of a returned object, the precise wording of an error message). Reading the tests before implementation surfaces those tests as tests of the wrong thing, before they become a constraint that traps a future refactor.

The second is under-tested edges. The plan says “support file uploads.” The agent’s first test pass covers happy-path upload. Reading the test suite, I notice there is no case for the zero-byte file, the file that fails partway through, the file with a Unicode name. Those gaps were invisible in the plan. They are obvious the moment they are a list of test files.

/ce:work then implements against the plan and the tests, in that order. The tests are not added as an afterthought to make CI green. They are the executable form of the spec, written and reviewed before any code was allowed to satisfy them.

Implement, then two reviews

/ce:work is the only step where the agent gets to type a lot without me reading every line in real time. Even here, “unattended” overstates it. I have the diff open in another pane. I read sections as they land. I stop the agent and rewind when a file is going somewhere I didn’t authorize. The agent does the implementation; I do the supervision.

When the work is done, two reviews run before I sign off.

/ce:review is the in-Claude review of the diff against the plan and the tests. It catches the obvious class of bug: a file the plan said would change that didn’t, a function the plan said would exist that doesn’t, a test that was relaxed instead of made to pass.

Then /codex:review runs against the implementation, not just the plan. Same pattern as the adversarial plan review: a different model in a different scaffold, with permission to disagree. The implementation review catches a different class of bug than the plan review: incorrect API usage, race conditions in the new code, security holes the plan didn’t anticipate.

I read both reviews. They overlap maybe a third of the time. The overlap is usually the strongest signal something is actually wrong; the non-overlap is where each model brought a different prior. Codex finding something Claude missed (or vice versa) happens often enough that I would not trust the implementation without both passes.

The reviews don’t gate the diff automatically. I gate the diff. The reviews give me a triaged list of things to look at; I decide which to address, which to defer, which to override. The same way I would treat a human reviewer’s comments.

Verify: the Vercel agent browser, with my credentials

The diff passes both reviews. The tests pass. The type checker is green. This is the moment a one-shot agent declares victory. Mine does not.

Vercel Labs’ agent-browser is the CLI that gets the feature in front of a real browser. It runs a browser optimized for agents rather than humans. The DOM is exposed as ref-based interactions instead of brittle CSS selectors. The model gets the page minus the parts that don’t matter (one writeup measured a 93% reduction in irrelevant context compared to a naive browser bridge). The whole thing is a deterministic shell command Claude Code can run end to end.

What this earns me is the “ralph-wiggum loop” that Pulumi’s writeup named (and which is the funniest framing I have seen for self-verifying agents). After Claude makes a change, the agent browser drives the feature it just shipped. If the button works, the agent observes the success state. If it doesn’t, the agent sees what it sees, writes a follow-up, and iterates.

Two things the agent browser does not do on its own, that I do.

I supply the credentials. Anything behind a login is a flow the agent needs an account to drive. I create the test account, hand the credentials to the session under explicit scope, and watch which routes it touches. I don’t let the agent harvest credentials from the environment or recycle a real user’s session. Credentials in are explicit; credentials out are not a thing.

I watch the browser run. The agent driving itself is not a fire-and-forget. I keep the browser window visible. I am looking for high-level responsiveness: a flow that takes ten seconds instead of one, a click that fires twice, layout that breaks on mobile. The agent doesn’t notice those things reliably yet. I do, in about ten seconds of watching.

Two things changed about my reviews once this was wired in. The diffs got smaller (the model was less inclined to add speculative code if it had to run the feature to prove it worked) and the bugs that survived to PR review moved up the stack (typography, UX writing, edge cases involving real data). The “this doesn’t run at all” class of failure mostly disappeared from the top of the funnel.

Then I drive the feature myself

The agent browser is necessary, not sufficient. After it runs clean, I drive the feature myself.

This is the most boring step in the loop and the one I refuse to skip. I open the same browser, log in with my own account, and use the feature the way a real user would. Not because I distrust the agent browser run. Because there are a class of things the agent has not yet learned to look for: a label that reads wrong, a button that is technically functional but visually broken, an error message that is correct but unhelpful, a flow that works but is exhausting.

Most of what I catch here is not a bug in the diff. It is a gap in the spec that became visible only when I touched the running thing. Those gaps go straight back to the requirements doc, and the next loop iteration is shorter because of it.

Designer in the loop

When UX matters, a human designer reviews the result. Not at the diff. At the live feature, in a real browser, on real data.

I learned this the hard way. The two-model adversarial review catches a lot. The agent browser catches more. Neither catches the design choices that decide whether the feature feels considered or feels generated. Spacing that is technically valid but visually noisy. Color usage that is on-brand but loud. Empty states that are functional but charmless. A flow that works but assumes the user has already decided what they want.

The designer’s feedback comes back as a list. I triage the list, fold it into the requirements doc and the plan, and iterate. Sometimes the iteration is small (a copy change, a spacing tweak). Sometimes it surfaces a structural rework. Either way, the designer is a gate, not a polish pass. Their feedback can send us back to the plan step. It often does.

This is the part of the loop where I am most clearly relying on a human collaborator other than myself. The agent does not yet have taste in the senses that matter. The designer does. The loop is built around making it cheap to incorporate that taste.

Compound: the part that pays for everything else

After all of the above, /ce:compound writes the learning to disk. The learning is short. It names the thing that surprised me (a constraint in the framework, an API quirk, a pattern that turned out to be wrong, a gap that the test review caught) and the rule I want the next session to follow.

The next session starts with that file already loaded. The session after that has the rule still in CLAUDE.md or in a skill the agent reaches for. Over six months, the agent gets noticeably less wrong about my codebase, not because the underlying model improved (although it did) but because the cumulative writeup of what I learned is now in the agent’s working memory by default.

The mental model is the one from Klaassen’s writeup: each unit of engineering work should make the next unit easier, not harder. If a session ends with no learning written down, the next session will hit the same wall. If a session ends with a five-line learnings/foo.md and a one-line skill entry, the next session walks past that wall without noticing.

I write the compound entry myself, often. The agent drafts; I edit. The reason is the same as the rest of the loop: the compound entry is the lesson I will be reading in three months. The agent’s draft is a starting point, not a final answer.

The loop, end to end

The thing it took me a while to see is that the loop is a relay, not a pipeline. Every other step is a human gate. The agent never has the ball for two steps in a row without me touching it.

The relay. Amber capsules are mine. White capsules are the agent's.

Numbered below because the order matters; bolded because each is a place I show up.

  1. I write the requirements doc. One page, plain prose.
  2. Brainstorm. /ce:brainstorm refines the doc with me in the loop.
  3. Grill. /grill-me walks the design tree. I answer every question.
  4. Plan. /ce:plan produces a tech plan. I read it line by line.
  5. Adversarial plan review. /codex:adversarial-review hands the plan to a second model. I triage what comes back.
  6. Test design. The agent writes the test suite against the plan. I read every test case.
  7. Implement. /ce:work writes the code against the plan and tests. Context7 grounds framework calls.
  8. Code review (Claude). /ce:review against the diff, plan, and tests.
  9. Code review (Codex). /codex:review runs a second pass on the implementation. I read both reviews.
  10. Browser verification. Vercel’s agent browser drives the feature. I supply credentials and watch responsiveness.
  11. Manual review. I drive the feature myself.
  12. Designer pass. A human designer reviews UX. I iterate.
  13. Compound. /ce:compound writes the learnings.

The whole thing fits inside one or two Claude Code sessions for most features. The pieces that don’t (long migrations, multi-day features) split across more sessions, each one starting from the artifacts the previous one left on disk. None of those artifacts is a prompt I have to remember to write.

The flow bends to the task

Thirteen steps is not the right number for every task. The right number for a one-line CSS fix is one. The right number for “rip out the old auth middleware and replace it” is more like sixteen, because the plan review needs another pass and the designer is in the loop earlier.

How I scale the loop, roughly:

For a one-file change with no UX surface, I skip /ce:brainstorm, run a tight /grill-me (or skip it), produce a small plan, skip the adversarial review, write the change myself or with /ce:work, run the tests, and call it done. The full stack is overhead the task does not deserve.

For a feature with a real UX surface but no boundary risk (a new view over existing data, say), I run brainstorm through manual review, lean on the designer pass, and skip the adversarial plan review. Codex still runs on the implementation.

For anything that touches a boundary (auth, payments, migrations, an external API, a new data model), I run every step, twice if the plan changes substantially mid-flight. The cost of being wrong is high; the cost of the full loop is small relative to that.

For something fully novel where I do not yet know the shape (a new product surface, a research spike), I run a stripped-down loop on the first prototype to learn, then a full loop on the second pass to ship.

The judgment about which version of the loop to run is mine. The agent does not pick which steps to skip; I do, before the session starts. Picking the wrong shape of loop is the single most expensive way to waste time in this stack.

Never unattended

The thing that holds the whole stack together is the one rule I do not break: I am in the room for every gate. The agent never decides what we are building. The agent never approves its own plan. The agent never lands code I have not read. The agent never declares a feature done; I do, after I drove it myself and a designer signed off.

This is not a religious thing. It is the honest read on where the current models are. They are spectacular at expanding a well-formed input. They are not yet reliable at deciding what counts as well-formed, what counts as done, or what counts as good. Those decisions are mine. The loop’s job is to make my decisions cheaper, not to replace them.

The temptation, once the loop is running, is to let it run longer between checks. The agent is producing artifacts. The artifacts mostly pass review. Each individual gate starts to feel like overhead. Skip one gate and the agent’s outputs drift slightly. Skip the next and they drift more. By the time the drift is visible, the loop has produced a day’s worth of work that points in the wrong direction.

The discipline is to keep showing up. Not because each gate catches something every time. Because the gates I skip are the ones that turn this stack from a tool into a runaway.

Where the wiring is brittle

This is the honest part. The stack is not friction-free. Three places still hurt.

Plugin state across sessions is fragile. Skills, slash commands, and MCP servers each have their own configuration surface. A reinstall, a Claude Code version bump, or a typo in ~/.claude/settings.json will silently disable a piece of the loop. The symptom is usually “the agent stopped using grill-me” or “Context7 didn’t fire on a framework call it would normally fire on.” There is no central health check for “all my tools are working.”

Cross-model handoffs leak context. When the Codex plugin reviews a plan, it reads the plan from disk; it doesn’t get the conversation history that produced the plan. Most of the time that’s fine because the plan is the artifact and the conversation is noise. Some of the time the conversation contained the reason a decision was made and Codex re-litigates the decision in its review. The fix is writing the reasoning into the plan, which is also a discipline I am still learning.

The agent browser is shakier behind a login. Even with credentials handed in explicitly, anything behind a real auth wall is a long detour: cookie injection, headless-friendly auth flows, environment-specific test accounts. Vercel Labs is iterating on this, but my verification loop is most reliable on unauthenticated routes and shakier behind the login wall. I cover the rest with my own manual smoke tests, which is also why the manual review step does not get cut.

None of these are reasons to delete a piece. All of them are reasons to keep the stack small enough that I can keep the wiring in my head.

What this stack will look like in a year

The lesson I keep landing on, in the action primitive post and again here, is that every component in the loop is compensating for something the model can’t do on its own yet. Each component will keep earning its place until the model catches up. When the model catches up, the component starts adding latency and bugs and not much else.

/grill-me exists because the model won’t interrogate a spec on its own. The day a model is trained to refuse to plan against an under-specified spec, that skill becomes redundant. We are not there. We are noticeably closer than we were a year ago.

/codex:adversarial-review exists because one model can’t reliably critique its own plan. The day a reasoning model can run a meaningful adversarial pass over its own output, that step becomes a setting on the same model rather than a separate plugin. I am not confident this one collapses on the same timeline as the others; asymmetric criticism is a structural argument, not a capability one.

Context7 exists because models don’t know what’s current. The day model training has a one-week cutoff, that need shrinks, but the catalog part (curated, version-pinned, externally consumed) still earns its keep.

The Vercel agent browser exists because the model can’t see the page. That one is purely capability-bound; the moment a coding agent reliably drives a real browser as part of its loop, the separate CLI is just an implementation detail.

The manual review and the designer pass are the ones I expect to last the longest. Taste is the part of this stack that is the slowest to automate, and the part I am the least eager to.

The shape that will outlast the specific tools is the shape itself. Requirements, brainstorm, grill, plan, review, test, implement, review, verify, manually verify, design, compound. Each step plugs a failure mode. As the failure modes shrink, so will the steps. The work is to delete components as the assumptions they encoded stop being true, and to add components only when a new assumption shows up that the model genuinely can’t handle.

Frequently asked questions

Why use Claude Code instead of Cursor or Windsurf?

Less because of the model and more because the harness exposes the right surfaces. Skills, MCP servers, plugins, hooks, sub-agents, and persistent CLAUDE.md are all first-class. Most of the rest of this stack is a Claude Code plugin, skill, or slash command. The plugin ecosystem is also where compound engineering, the Codex plugin, grill-me, and Context7 all live. The wiring story would look different on another host; the underlying loop would be roughly the same.

Why do you read every plan and every test case line by line? Isn’t that the agent’s job?

The agent’s job is to produce the plan and the tests. My job is to decide whether they encode what I asked for. Those are different jobs. The cheapest place to catch a bad decision is in the plan, the second cheapest is in the tests, and the most expensive is in production. Reading both end to end takes maybe ten minutes per loop. It saves multiples of that, multiple times a week.

What does the compound engineering plugin actually ship?

A set of agents, slash commands, and skills implementing the brainstorm-plan-work-assess-compound loop. Notable pieces: /ce:brainstorm, /ce:plan, /ce:work, /ce:review-beta, /ce:compound, plus a learnings-researcher subagent that reads past learnings before new plans. The point of the plugin is that each step writes an artifact the next step reads, so the loop survives the session boundary.

How is grill-me different from just asking Claude to “ask clarifying questions”?

The instruction is structurally different. “Ask clarifying questions” tends to produce three or four polite questions and an offer to start coding. Matt Pocock’s grill-me tells the agent to walk down each branch of the design tree until upstream decisions resolve, and to explore the codebase whenever a question can be answered there. The session feel changes: sixteen to fifty questions, most of them about decisions you would have skipped without realizing.

Why write the tests before the implementation, and why review the tests?

Writing tests before the implementation turns the test suite into an executable spec. Reviewing the tests before any code lands catches two specific failures: over-fitted tests that lock in incidental implementation details, and under-tested edges where the agent’s first instinct misses the cases that actually break in production. The test review is also a second pass over the spec; tests I disagree with are signals the plan was wrong, not signals the tests were wrong.

Why run two code reviews after implementation?

Same reason I run two reviews on the plan: asymmetric criticism beats symmetric introspection. /ce:review reads the diff against the plan and tests, in Claude. /codex:review reads the same diff in a different model and scaffold. The overlap is the strongest signal. The non-overlap is where each model brought a different prior. Auto-merging the two defeats the purpose; I read both and decide.

Does Context7 replace RAG?

No, and it isn’t meant to. Context7 is curated framework and library documentation. RAG over your own codebase or your own docs is a separate concern. The two compose; Context7 keeps the framework knowledge fresh while internal RAG keeps the project knowledge accessible. Both load on demand, not eagerly.

Why supply credentials to the agent browser yourself? Why not let it harvest from the environment?

Two reasons. First, scope: I want to know exactly which routes the agent is touching with which account, and I don’t want a session to recycle a real user’s credentials by accident. Second, blast radius: if the agent goes off the rails, the worst case is bounded by the permissions of the test account I created for the session. Credentials are an explicit handoff, not an inherited one.

How is the Vercel agent browser different from Playwright?

Playwright is built for humans to write scripts that drive a browser; the agent browser is built for an agent to interact with a browser conversationally. The DOM surface is ref-based rather than CSS-selector-based, the model only sees actionable elements, and the CLI shape is designed to fit inside an agent loop without a sandbox of its own. Playwright is still right when you are writing fixed test suites; agent-browser is right when the model is deciding the next click.

Why bring a designer into a flow that already has a manual review?

The manual review catches functional gaps: copy that reads wrong, edge cases I’d only see in real data, flows that are slower than they should be. The designer catches taste: visual hierarchy, spacing, color, empty states, the difference between a feature that works and a feature that feels considered. The two reviews overlap less than you would think. Skipping the designer pass on anything user-facing is the single fastest way to ship a feature that “works” and feels generated.

Is this overkill for small features?

Yes. The full loop is overhead. For a one-file change I jump straight to /ce:work against a one-paragraph description and skip most of the rest. The flow scales with the cost of being wrong. Pick the shape of loop that matches the task; running the full thirteen-step relay on a CSS tweak is its own kind of mistake.

How do you handle the cost of running two models?

The Codex passes only fire when the work justifies them, which in practice is several times a week, not every session. Context7’s tool descriptions are tiny and its docs load only on demand. The agent browser runs locally. The biggest cost line is still the long-form coding sessions inside Claude Code, and that cost was there before the rest of the stack. The other tools add maybe 5 to 10 percent to total spend in exchange for catching a class of bug that would have otherwise eaten an afternoon.