How to build an MCP server agents will actually use
What makes a great MCP server: workflow-shaped design, five to fifteen tools, descriptions treated as system prompt fragments, and responses budgeted in tokens.
On this page
A great MCP server doesn’t look like a REST API. It looks like a small set of moves an agent would actually want to make, shaped around what the user is trying to accomplish rather than around the endpoints you happen to have lying around.
That distinction shows up in every concrete decision the design forces on you. Five to fifteen tools per server beats thirty, because every tool description gets reloaded into every session and silently taxes the prompt. Descriptions get written as prompt fragments and not as reference docs, because that’s the shape they’re going to live in once the model is reading them. Responses get budgeted in tokens, not rows, because every byte the tool returns lands in the conversation history and changes the next decision the agent makes.
The rules are not subtle once you’ve spent any time staring at agent transcripts. Most published servers still miss them. The rest of this post is what each one looks like in practice, with the measurements and migrations people have published along the way.
Yesterday I argued that MCP is not the problem. The protocol is fine. The interesting question is when to reach for it, and what to put inside when you do.
If you have not decided whether to build one at all, start there. The previous post has a decision tree for MCP, a CLI, a direct API call, or a Skill.
This post assumes you already answered yes.
Once you have, the next question is what good looks like inside the server. Good MCP design has little to do with the protocol and almost everything to do with the question, “what would an agent actually want to call?”
Almost every team that has shipped a serious server has arrived at roughly the same answers. They disagree on details. The shape is the same.
Below is that shape, with sources cited where someone measured a number I am borrowing.
Design from the workflow, not the API
This is the single biggest distinction between MCP servers that work and the ones agents quietly route around.
A REST API is designed for a developer who can compose calls, hold intermediate state in their head, and accept that the endpoints are atomic on purpose. The user is the one doing the composition. An MCP server has a different consumer. The agent is reading the entire tool list every turn, paying for each round trip in both latency and conversation tokens, and trying to make a plan out of names and parameters it has never seen before. The two surfaces look similar on the wire and serve completely different jobs.
Take a Google Calendar MCP. The naive first cut exposes the raw API surface: list_calendars, list_calendar_events, retrieve_timezone, retrieve_calendar_free_busy_slots. Functionally complete, satisfying to look at, easy to test. To answer something like “find a one-hour slot where Alice, Bob and Carol are all free this week,” the agent now has to make four to six chained calls, juggle timezones explicitly, and hold the intermediate results in conversation history while it reasons over them.
The redesigned version replaces all of that with one tool: query_database(sql), backed by a DuckDB view of the calendar with a free_slots() macro built in. Four to six tool calls collapse into a single SQL query. The agent’s prompt gets smaller, the plan gets shorter, and the failure modes shift from “lost track of which call returned what” to “wrote a bad query,” which is far easier for the model to recover from on its own.
This is the migration Block walks through for both their internal Calendar and Linear servers. Linear in particular went through three iterations:
| Version | Surface | What “list bob@example.com’s issues” cost |
|---|---|---|
| v1 | 30+ tools mirroring GraphQL endpoints (get_issue, get_issue_labels, get_issue_comments, …) | 4–6 tool calls |
| v2 | get_issue_info(issue_id, info_category) with a category enum | 2–3 tool calls |
| v3 | execute_readonly_query(query, variables) + execute_mutation_query() accepting raw GraphQL, schema in instructions | 1 call |
The table summarises it; the cost is easier to feel than to read. Same query, three Linear servers:
Linear MCP · same query, three implementations
What ruthless curation actually buys you
Query
List issues assigned to bob@example.com, with labels and current state.
30+ tools mirroring GraphQL endpoints
Tool defs: 3,000 tok
- get_user_by_email("bob@example.com")24 tok
- { id: "u_4f2a" }12 tok
- get_user_issues("u_4f2a")18 tok
- [{ id, title }, …×12]312 tok
- get_issue_labels("i_1f2b")22 tok
- [{ name: "bug" }, { name: "p1" }]38 tok
- get_issue_state("i_1f2b")22 tok
- { state: "in_progress" }24 tok
- get_issue_assignees("i_1f2b")24 tok
- [{ name: "Bob" }]24 tok
Round trips
5
Round-trip cost
520 tok
Plus tool defs
3,520 tok total
Total session cost, scaled to v1
3,520 tok(100%)
1,180 tok(34%)
668 tok(19%)
Tool definitions loadedRound-trip request and response
Token counts are illustrative; ratios match the per-call collapse Block reports across the three iterations. The dominant cost in v1 is the standing prompt of 30+ tool descriptions, not the calls themselves.
Jeremiah Lowin makes the same point in Stop Converting Your REST APIs to MCP, and once you’ve internalised the shift it becomes hard to unsee. “Sophisticated for a human” means composable and atomic, because the human is the one doing the composing. “Sophisticated for an agent” means curated and minimalist, because the agent is paying for every composition step in both tokens and turns.
The mental move is to build tools around agent affordances, not around human workflows. schedule_event, which finds the availability and books the slot in one shot, is the tool you actually want. list_users, list_events, create_event is the API surface you happen to be wrapping. It’s tempting to expose the second one because it feels more “complete,” but the first is the one that gets used.
Five tools beat thirty
Once you start designing from workflows, the tool count drops fast, and the practitioner consensus has converged on five to fifteen tools per server, with one server doing one job. The numbers aren’t arbitrary. In Anthropic’s evaluations on internal Slack and Asana servers, Claude-optimized tool sets consistently beat the human-written ones. The human versions lost the same way every time: they kept multiplying tools as the underlying API grew, and the agent’s selection accuracy fell off a cliff past a certain point. Cursor caps configurations at forty tools for exactly this reason.
The mechanism is unglamorous. The agent reloads every tool description into the prompt at the start of every session, so each tool you add is a permanent tax on every conversation that loads the server. That tax compounds in two directions at once: more tokens spent before the user has typed anything, and a noisier selection problem when the model has to choose between thirty similarly named tools instead of seven distinct ones.
The heuristic I lean on: when you find yourself adding a third tool that fetches a different facet of the same resource, you don’t want three tools. You want one tool with a category parameter. Linear’s v2 did exactly that, collapsing seven get_issue_* tools into one get_issue_info(issue_id, info_category) with an enum. The v3 collapse to a single GraphQL passthrough is the same instinct taken further.
Deletions matter as much as additions. If a tool hasn’t been called in a week of real dogfooding, take it out. The cost of leaving it in is paid by every session that loads the server, every day, forever. It’s the cheapest perf optimisation you’ll ever ship.
Names are part of the prompt
Tool names are the agent’s first signal about what each tool does, and they get read before any description. The model treats them as keywords more than as identifiers, and weighs them when deciding which tool to call. Treat them accordingly.
The shape the field has converged on:
- Service-prefixed, so the agent can tell which server it is reaching into:
slack_send_message,linear_list_issues,sentry_get_error_details. - Snake case, which both Claude and GPT-4o tokenize cleanly.
- Verb-led, with the noun specific enough that the action is unambiguous:
schedule_event, notevent.
Prefix versus suffix naming isn’t a stylistic question. It produces measurable differences in tool-selection accuracy on Anthropic’s evals, and the win is not subtle. Tools clustered under a common prefix get retrieved together, which is what you want when the agent is searching for “the Slack things.” Tools that share a generic suffix collide, which is what you don’t.
Consistency within a server matters more than picking the optimal scheme. If you’ve decided on service_resource_action, hold the line across every tool you ship. The agent’s mental model of your server is downstream of how regular your names are, and the cost of switching conventions halfway through is paid by every selection the model makes after.
Descriptions are prompt, not docs
This is where most servers leave the most performance on the floor, and the framing mistake is usually the same: people write tool descriptions like reference docs, intended to be read once by a developer integrating against the API. They aren’t read like that. The agent loads every description into every session, and the descriptions collectively steer behaviour for the rest of the conversation. They are a system prompt fragment that runs every time, whether you wrote them like one or not.
Anthropic reports that careful refinements to tool descriptions moved Claude Sonnet 3.5 to state of the art on SWE-bench Verified. The lift came from text the model read on every turn, not from changing the model. Every piece of text the agent sees becomes part of the context that decides what it does next, and tool descriptions are usually the longest piece of that text by far.
So write them the way you’d onboard a new teammate who has fifteen minutes before they have to start using the tool. Be explicit about when to reach for it and when not to. Be explicit about what each parameter means and what the valid values look like. Describe the shape of the return value. Describe the failure modes you expect. Each of those lines is one less round trip the agent has to spend learning your tool by failure.
The unambiguous-parameter rule deserves its own beat. Name parameters as if the agent will only ever see the name, because half the time, that’s exactly what happens. user_id is better than user. start_iso_timestamp is better than start. team_slug is better than team. The agent doesn’t always have the schema in front of it when it’s writing the call. It has the name.
Nested arguments are a trap
Nested config objects look elegant in a REST design. In an MCP server, they are a trap.
Avoid:
def search_orders(filters: dict): ...
Prefer:
def search_orders(
email: str,
status: Literal["pending", "shipped", "delivered"],
limit: int = 10,
): ...
The flat version gives the agent everything it needs from the schema alone. The literal values for status. The default for limit. The type for email. The model can construct a valid call on the first attempt, because the schema told it everything it needed to know up front.
The nested version forces the agent to guess at what filters accepts, watch its first call fail, parse the error, and try again. Same data, two extra round trips, and a transcript polluted by a failed call that the agent will keep referring back to for the rest of the session.
The recipe in practice is unremarkable: Pydantic models serialised to JSONSchema, every enum hard-constrained with Literal, every field carrying its own description string. Those field-level descriptions land in the prompt the same way the tool description does, so write them the same way.
There’s a second reason to flatten that has nothing to do with ergonomics. Deeply nested arguments are a known weak spot for LLMs in strict JSON mode. They miss commas. They drop quotes. They invent keys to fit the structure they think they’re producing. The flatter the shape, the fewer retry loops you’ll watch in production.
Responses are part of the prompt
Every byte of tool output goes into the conversation history and gets read by the next decision the agent makes. That feedback loop is where the cost of a sloppy response compounds silently. You don’t see it as a single bad invoice; you see it as a context window that filled up too fast and an agent that lost the plot four turns in.
A handful of tactics show up across every published writeup on this:
Pick the right format for the data. JSON is the default for historical reasons, not for token-efficiency reasons, and you can usually beat it. Datadog measured roughly 50% token savings moving tabular data from JSON to CSV, and 20% moving nested data to YAML. For narrative content, Markdown or XML beats verbose JSON. The AXI project pushes this further with TOON, claimed at about 40% smaller than JSON on typical payloads. None of these formats are exotic; the agent reads all of them fluently. The default is just not free.
Offer concise and detailed response modes. Anthropic’s Slack tool showed the most striking version of this. The detailed response used 206 tokens. The same payload at response_format="concise" used 72 tokens. That’s a third of the cost, on a tool the agent calls many times per session, and most flows never actually needed the detailed payload. Adding a single response_format: Literal["concise", "detailed"] parameter is one of the highest-leverage changes you can make to a working server.
Trim defaults aggressively. Datadog measured a 5× improvement in records-per-token by removing unused fields from default responses and switching format at the same time. The trimmed fields are still available; the agent has to ask for them or pass an explicit flag. Most of the time, it doesn’t need to. The cost of optionally returning a field on request is much smaller than the cost of unconditionally returning it in every response.
Prefer semantic names over technical ones. name, image_url, file_type beats uuid, 256px_image_url, mime_type. Same data, fewer tokens, less translation for the model to do before it can reason about the field. Reserve the technical names for the cases where the agent genuinely has to round-trip them back to your API.
Production agent clients truncate oversized tool responses for a reason. Claude Code caps at 25,000 tokens by default; others draw similar lines. By the time the client is truncating, you’ve already lost; the agent is now making decisions on a payload it can’t verify is complete. Don’t make the client intervene on your behalf. Budget the response yourself.
Paginate by tokens, not by rows
Pagination by record count is the most common bug in production MCP servers, and it’s a deceptively friendly one. It works fine in development, where every record is small. It works fine in light dogfooding, where the corpus is well behaved. Then in production, one of your records turns out to be huge, like a transcript, a stack trace, or a deeply nested order, and a page of fifty rows blows past the context budget the agent had planned for. The agent doesn’t recover from that gracefully. It just keeps making increasingly confused decisions on a truncated payload.
The fix is to cut at a token threshold and return a cursor. Records get paginated by how much room they take, not by how many of them exist. The agent gets predictable page sizes. The server gets to bound the worst case. Both sides win.
Pair this with explicit metadata on every paginated response: has_more, next_offset or next_cursor, and ideally total_count if it’s cheap to compute. The agent should never be in a position where it has to guess whether to keep paging.
Goose’s file reader has my favorite version of the bounded-output pattern. Files over 400KB raise a tool error whose message is a literal recovery script:
File is 1.2MB, which exceeds the 400KB threshold. Try
head -n 100 /path/to/file,tail -n 100 /path/to/file, orsed -n '500,600p' /path/to/fileto read parts of it.
The agent reads the error, generates the right narrower call, and recovers without ever escalating to the user. The cost of designing the error this way was a few minutes of writing. The benefit gets paid back on every oversized file the user happens to throw at the agent. Which brings us to the next rule.
Errors are recovery instructions
The single biggest unforced error in MCP server design is returning "invalid query" to an agent and expecting it to figure out what to try next. Agents are not great at debugging errors with no signal. They retry, slightly randomly, until they either luck into a valid call or give up and surface the failure to the user. Both outcomes are bad.
The rule is to prefer actionable error messages that enable recovery over generic prompt-time instructions about what not to do. The prompt is where you tell the agent how to use the tool. The error is where you tell it how to fix the specific call it just made. Those are different jobs, and most servers conflate them.
When the tool can guess what the agent meant, say so. "unknown field 'staus', did you mean 'status'?" is a much better error than "invalid field". The agent reads the suggestion and recovers on the next call. Without the suggestion, it shotguns through whatever fields it can think of and burns turns doing it.
The pattern that works is to treat every error path as a one-line micro-prompt with three beats:
- What failed.
- Why it failed, specifically.
- What the agent should try next, with a concrete example if possible.
Agents recover gracefully from errors that tell them how to recover. They spiral on errors that don’t. The cost of making your errors recovery-shaped is mostly paid once, in design. The cost of leaving them vague is paid every time the agent fails in production.
One risk level per tool
This rule matters more than it sounds, and it’s the one most servers get carelessly wrong because the bundling instinct cuts the other way.
Each tool should operate at a single risk level. Read-only or write/delete, not both. The reason is downstream of how clients actually treat approvals: they hand off approval decisions to the user based on tool annotations like readOnlyHint and destructiveHint. A tool that “lists issues and optionally archives them based on a flag” looks convenient in code. In the client, it forces every call, including the perfectly safe list, through the same approval threshold as the archive. The user approves the read-only call. Then they approve the next read-only call. Then they start approving without reading. Friction compounds, and so does habituation, and now the destructive call slips through on momentum.
The corollary is that you can and should bundle related read-only operations into one tool. get_issue_info consolidating label, comment, subscriber and parent lookups is exactly the right kind of bundling. It’s still one risk level; the agent is still only ever reading. Adding delete_issue to that bundle would not be.
Mark every tool’s readOnlyHint, destructiveHint, and idempotentHint honestly, and assume someone is reading them downstream. Clients that support smart approvals (Claude Code, Cursor, Claude Desktop) lean on the annotations to decide what to prompt the user about. A truthful set of hints is part of the contract you’re shipping.
Use OAuth, not API keys
OAuth 2.1 with the Authorization Code Grant flow, triggered on first use of the extension, requesting the minimum scopes the server actually needs. Direct API-key entry is the fallback for the cases where OAuth genuinely isn’t available, not the default.
The MCP spec’s authorization section bakes OAuth into the protocol for exactly this reason: api-key-in-config flows are how secrets end up in shell history, in dotfiles, in screenshots, and eventually in incident reports. You can’t reasonably ask users to remember which configs hold credentials and which don’t. OAuth makes the question moot.
Never write secrets to disk
Use the platform keyring: macOS Keychain, Windows Credential Locker, libsecret on Linux, Python’s keyring library on top. Never write tokens to disk in plaintext, not even temporarily, not even in a hidden file under the user’s home directory. Anything you write in plaintext will end up in a find output, a backup, a tar of someone’s home directory, or the next IDE that indexes their workspace.
Invalidate the credentials when the user uninstalls the extension. Don’t make them go hunting for stale tokens later; the ones that get forgotten are the ones that leak.
Don’t poison the prompt cache
This one is subtle, and it’s responsible for a surprising amount of “why did my server suddenly feel slower” mystery. Prompt caches (Anthropic, OpenAI, others) hash the prefix of the conversation, including your tool descriptions, to decide whether they can reuse the previous prefill. Any dynamic value in your server’s instructions, like a current timestamp embedded in the description, a session ID, or anything else that varies per request, invalidates that cache and forces a fresh re-read of every tool description on the next turn.
The practical version is: use a session-start timestamp captured once if you need a “current time” in the instructions, not datetime.now() interpolated per request. Monitor cache_read_input_tokens versus cache_creation_input_tokens in your traces and treat any sustained drop in the read ratio as a regression worth investigating. The cache is doing real work for you; you just have to not poison it.
Curate first, defer-load second
If your server still has more than thirty tools after ruthless curation, opt into the defer-loading patterns from the previous post: Anthropic’s Tool Search (defer_loading: true), a search-and-execute pair the way Cloudflare’s Code Mode does, or lazy schema expansion at the protocol level. Pick whichever one your SDK and runtime actually support.
The order matters, though. Defer-loading is a mitigation, not a substitute for curation. A poorly designed tool surface with deferred loading is still a poorly designed tool surface; the agent just pays a slightly different cost to walk it, and you’ve added a layer of indirection between the user and the failure mode.
Get the design right first. Then reach for the heavier patterns once you actually have the scale to justify them.
Build the loop
Prototype locally first. Stand the server up with a handful of candidate tools, wire it into an actual agent client you use daily (Claude Desktop, Claude Code, ChatGPT, Cursor, an IDE plugin), and use it for real for a day. Not for a synthetic happy-path demo: for whatever you’d genuinely want the agent to do with it.
Evaluate with real tasks. Write five to ten realistic prompts you’d actually issue and run them programmatically, with extended thinking turned on. Capture, per task: completion accuracy, total tool calls, total tokens consumed, error count, and wall-clock time. Those are your dashboard. They give you a baseline you can compare against after every change.
Collaborate with the agent on improvements. Feed the full transcript back to a frontier model alongside the current tool definitions and ask what it would change. Anthropic reports this loop produces bigger gains than any single hand optimization, and that the agent is particularly good at spotting redundant tools and ambiguous descriptions, the kinds of issues that are hard to see when you wrote the tools yourself.
The split is worth holding clearly in your head. Tests catch regressions in things you already know matter. The agent-driven rewrite finds the gains you didn’t know you were missing. They’re complementary, not interchangeable.
If you build nothing else from this list, build evals. Once you have them, every other improvement compounds. Without them, you’re tuning by vibes and shipping by hope.
The tool list is the prompt
Every name, every description, every parameter, every response shape, every error message in your server is part of the prompt the agent reads before it decides what to do. Once you see the tool list as a prompt, the rest of the rules in this post stop feeling like a checklist and start feeling like the obvious consequences of treating it as one.
You can run that prompt past every reviewer on your team and tune it the way you’d tune any other prompt. You can A/B it against real tasks and measure the result. You can hand it to a frontier model and ask for an edit pass and get back a sharper version than you’d write alone.
What you can’t do is treat it like a REST surface that just happens to be served over JSON-RPC and hope an LLM figures it out. That’s the failure mode most published servers ship into, and it’s also the failure mode responsible for most of the “MCP doesn’t work” essays I’ve read.
A great MCP server is mostly the absence of the obvious mistakes. Not one tool per endpoint. Not a wall of options. Not a JSON blob where a sentence would do. Not a silent, contextless error. What remains, after the mistakes are taken out, is actually fairly small: five to fifteen well-named tools, described as if they were system prompt fragments, returning the smallest payload that carries the data, failing with recovery hints, sitting behind OAuth, at one risk level apiece.
That’s the bar. It’s much lower than the discourse suggests, and much higher than most published servers actually clear.
If the question of whether you should be building one at all is still open, the companion post is MCP is not the problem. Decide there. Build here.
Frequently asked questions
When should I build an MCP server versus a CLI, a direct API call, or a Skill?
Build an MCP server when the consumer is a host you do not control (Claude Desktop, ChatGPT, Cursor, an IDE plugin) and the surface is stateful.
Use a CLI when you write both ends of the loop; the protocol is friction with no upside. Use the API directly when the surface is stable and stateless and the agent can hold the spec.
Use a Skill when you are packaging procedural knowledge rather than a capability. The companion post MCP is not the problem has an interactive decision picker.
How many tools should an MCP server expose?
Five to fifteen. One server, one job. Cursor caps configurations at forty tools for a reason.
Every tool description is loaded into context on every session, so each tool you add taxes the prompt.
Anthropic’s evaluations on internal Slack and Asana servers showed Claude-optimised small tool sets consistently beat large human-written ones.
How should I name MCP tools?
Service-prefixed, snake_case, verb-led, with a noun specific enough that the action is unambiguous: slack_send_message, linear_list_issues, sentry_get_error_details.
Prefix versus suffix naming produces measurable differences in tool-selection accuracy on Anthropic’s evals. Consistency within a server matters more than the exact convention you pick.
What format should MCP tool responses use?
Pick the lightest format that carries the data. Datadog reports roughly 50% token savings moving tabular data from JSON to CSV and roughly 20% moving nested data from JSON to YAML.
For narrative content, Markdown or XML beats verbose JSON. Offer concise and detailed response modes via a response_format parameter.
Anthropic’s Slack tool dropped a 206-token response to 72 tokens just by adding response_format="concise".
How should I paginate MCP tool results?
By token budget, not by record count. Pagination by row count breaks when individual records are large.
Cut at a token threshold, return a cursor, and include explicit metadata (has_more, next_offset or next_cursor, and total_count if it is cheap to compute) on every paginated response.
This is also how Datadog redesigned its pagination after row-count pagination kept blowing the agent’s context budget on oversized records.
Should an MCP tool be read-only or read-write?
Each tool should operate at a single risk level: read-only or write/delete, not both.
Clients hand off approval decisions to the user based on tool annotations like readOnlyHint and destructiveHint. Bundling a destructive path into an otherwise-safe tool forces every call through the higher approval threshold.
Bundle related read-only operations together; never mix in a destructive one. Mark readOnlyHint, destructiveHint, and idempotentHint honestly.
How should I write MCP tool error messages?
Treat every error message as a one-line micro-prompt for the agent. Include what failed, why specifically, and what the agent should try next, with a concrete example if possible.
Goose’s file reader is the canonical example: when a file exceeds 400KB, return a one-line error suggesting head, tail, or sed commands the agent can use to read parts of it.
Vague errors like invalid query cause the agent to spiral. Recovery-shaped errors get it to the right call on the next try.
Does my MCP server need OAuth?
Yes, unless the data is trivially public. Use OAuth 2.1 with the Authorization Code Grant flow, triggered on first use of the extension, requesting the minimum scopes the server actually needs.
Store tokens in the platform keyring (macOS Keychain, Windows Credential Locker, libsecret on Linux, Python’s keyring library on top), never in plaintext files.
Invalidate on uninstall. The MCP spec’s authorization section bakes OAuth into the protocol for exactly this reason.
How do I evaluate whether my MCP server is well-designed?
Build evals first, optimize after. Anthropic’s recommended loop: prototype locally, run five to ten realistic prompts through the server programmatically with extended thinking on.
Capture completion accuracy, total tool calls, total tokens consumed, and error count per task. Then feed the full transcripts back to a frontier model alongside your tool definitions and ask what it would change.
The agent-driven rewrite finds bigger gains than any single optimization done by hand, and is particularly good at spotting redundant tools and ambiguous descriptions.
Related writing
-
The agent stack I actually ship with
How I direct the agent loop instead of letting it drive. Six tools, a dozen human gates, and one rule I never break: nothing ships unattended. Requirements, brainstorm, grill, plan, test, implement, review, verify, design, compound.
-
Multi-agent is where context goes to die
More agents was supposed to mean more capability. Production says otherwise: every handoff between agents loses information, every parallel decision conflicts, and the system fails where context crosses. The number of agents is rarely the question worth asking.
-
Less memory, more context
The agent loop forgot, so we built memory systems. Memory streams, hierarchical paging, self-organizing notes, bi-temporal graphs. Some of it works. Most of it solves a problem context engineering has already fixed.