Table of Contents
Why MCP Integrations Burn Tokens
Teams running MCP servers in production tend to notice the same anomaly on the bill: input token counts are large even when the user has barely typed anything. The reason is mechanical — MCP tool definitions (name, description, inputSchema, annotations) ship at the front of every prompt, ahead of the user message.
Independent reports put the cost into numbers: five MCP servers and 58 tools consume 55,000+ tokens before the user types, and adding integrations like Jira pushes that past 100,000 tokens of context window absorbed by tool definitions alone ✅ (we audited this claim in "Are 'MCP Tool Definitions Eat 40-50% of Context' True?"). That is the invisible floor under your agent ops bill.
The key insight is that tool definitions are a stable prefix. What changes per request is the user message and a few context variables — not the tool block. So most of the MCP input cost is just full-price billing for tokens that should have been cache reads. Anthropic's prompt caching is the official mechanism that closes that gap, by up to 90% on the cached portion ✅.
Anthropic's Spec — The cache_write / cache_read Economics
From Anthropic's official documentation ✅:
- cache_read (tokens served from cache) = 0.1x the base input price — a 90% discount on the cached portion.
- cache_write (first time a region is stored): 1.25x base input at 5-minute TTL, 2.0x at 1-hour TTL.
- The write surcharge pays back on subsequent reads. 5-min TTL (1.25x) breaks even at the 2nd read; 1-hour TTL (2.0x) breaks even at the 3rd read ✅.
- Minimum cacheable region per model: 1,024 tokens (Sonnet/Opus), 2,048 tokens (Haiku). Anything smaller is ignored and billed at full input rate ✅.
A concrete example. Suppose an Opus agent ships an 80K-token fixed prefix on every request. Without caching, you pay full input price on those 80K every time. Add 5-min TTL caching and the first request costs 80K × 1.25 = 100K-equivalent input; every subsequent hit costs 80K × 0.1 = 8K. From the second request on, the cached input drops to 10% of its original price — that is the "up to 90% reduction" headline ✅.
cache_read vs cache_write Economics
Where cache_control Goes — tools → system → messages
Anthropic builds the cache key by walking the prompt in a fixed order: tools → system → messages ✅. That order is decisive — the key is constructed from the front; the moment a dynamic token appears, every byte after it becomes a cache miss.
For MCP integrations, attach cache_control: {type: "ephemeral"} to the last tool definition. That marks the entire tools block as one cacheable region:
// Anthropic Messages API — cache_control on the last tool
const response = await anthropic.messages.create({
model: "claude-opus-4-7",
max_tokens: 4096,
tools: [
{ name: "tool_a", description: "...", input_schema: {...} },
{ name: "tool_b", description: "...", input_schema: {...} },
// ... 58 tools total ...
{
name: "tool_last",
description: "...",
input_schema: {...},
cache_control: { type: "ephemeral" } // ← caches the whole tools block
}
],
system: [
{
type: "text",
text: "Long instruction prompt...",
cache_control: { type: "ephemeral" } // ← also cache the system block
}
],
messages: [{ role: "user", content: "..." }]
});
// Check the hit rate in the response
console.log(response.usage);
// {
// input_tokens: 320, ← non-cached input
// cache_creation_input_tokens: 55200, ← first write
// cache_read_input_tokens: 0, ← flips on subsequent calls
// output_tokens: 250
// }
The principle: "Mark the stable region from the back inward." Cache keys build front-to-back, so cache_control on the last item of a stable region captures everything from the prompt start up to that mark. For MCP integrations, the tools block is the biggest and most stable region, so cache_control at the tail of tools is the standard placement. If you also ship long system instructions, mark the end of system as well.
Dynamic values (timestamps, session IDs, usernames) belong in messages, not in the prefix. A timestamp at the top of system invalidates every cached byte after it.
Four Ways Prompt Caching Backfires
"Up to 90%" is the best case. In real workloads, these four patterns flip the math and you pay more, not less.
5-min TTL writes are 1.25x; 1-hour TTL writes are 2x. If the cached prefix is read only once, the 5-min option costs 1.25x base input and the 1-hour option costs 2x — and you go home. For short, single-shot sessions (a cold-start user fires one request and disconnects), the write surcharge never gets amortized. Caching pays off on agent loops where the same prefix is hit repeatedly.
Sonnet/Opus require 1,024 tokens; Haiku requires 2,048 ✅. Below that, cache_control is silently ignored — the region is billed at full input rate. For small in-house MCP servers (a handful of tools), either combine cache blocks or merge with the system prompt to clear the threshold.
Got "Today is {date}" or "User: {username}" at the top of system? The cache key is order-sensitive, built front-to-back. A daily-changing date invalidates the entire region behind it — every day, your 80K tokens of tool definitions get rewritten as a fresh cache_creation. Keep all dynamic values in messages; tools and system stay stable.
If MCP tools are fetched in non-deterministic order, or a config change shuffles the array, the cache key changes and you pay for a fresh write. Stable sort the tool array (by name, alphabetical, or any deterministic key) — an easy-to-miss prerequisite.
Measuring Hit Rate and the Optimization Loop
Caching is not "set and forget." You need to measure the hit rate continuously and trace misses back to their cause.
The Anthropic API response includes a usage object that splits writes from reads ✅: cache_creation_input_tokens (writes) and cache_read_input_tokens (reads). Hit rate is straightforward:
hit_rate = cache_read_input_tokens
/ (cache_read_input_tokens + input_tokens + cache_creation_input_tokens)
// e.g. { input_tokens: 320, cache_creation: 0, cache_read: 55200, output: 250 }
// → hit_rate = 55200 / (55200 + 320 + 0) = 99.4% (ideal)
// e.g. { input_tokens: 320, cache_creation: 55200, cache_read: 0, output: 250 }
// → hit_rate = 0% (the prefix is changing every request)
| Hit Rate | Diagnosis | Common Cause | Action |
|---|---|---|---|
| 0-20% | Critical miss | Dynamic token in the prefix | Remove timestamps / IDs from system/tools |
| 20-60% | Partial | Unstable tool order / wrong TTL | Stable sort tools / reassess 5m vs 1h TTL |
| 60-80% | Room to improve | cache_control only on tools | Mark end of system / add 2nd breakpoint |
| 80%+ | Healthy | — | Maintain; move to output-side optimization |
Published case studies put healthy agent workloads at 74-84% hit rate ✅. If you are below that, work the table top-down.
(1) Stable-sort the tools array by name. (2) Put cache_control: ephemeral on the last tool definition. (3) Add a second cache block at the end of the system prompt. (4) Move all dynamic values (timestamps, IDs, user info) into messages. Most MCP integrations move from the 0-20% band to 80%+ with these four steps alone.
A Realistic Savings Simulation
Take an Opus 4.7 agent at 10,000 requests/day, with a fixed prefix (MCP tool definitions + system) of 80,000 tokens, a user message of 500 tokens, and 1,000 output tokens per call. We compare three scenarios — no cache, 5-min TTL, 1-hour TTL — using Opus pricing from Anthropic's official docs ✅.
| Scenario | Daily input cost | vs no cache | Notes |
|---|---|---|---|
| No cache | 100% (baseline) | — | 80K × 10,000 = 800M input tokens/day, all at full price |
| 5-min TTL (80% hit assumed) | ~24% | -76% | 0.5% cache write + 79.5% read (0.1x) + 20% miss |
| 1-hour TTL (90% hit assumed) | ~19% | -81% | Longer sessions; 2x write absorbed by 90% hits |
This is input-side only — output tokens are unaffected. But agent ops is an input-heavy workload, and the larger the fixed prefix, the harder caching works for you.
FAQ
Does Anthropic's prompt caching really cut costs by 90%?
"Up to 90%" — on the cached portion of input. cache_read is 0.1x base input ✅. Total savings depend on hit rate; stable agent workloads see 74-84% in published case studies. Output tokens are not discounted.
Why do MCP integrations burn so many input tokens?
MCP tool definitions ship at the front of every prompt, before the user message. Five MCP servers and 58 tools = 55,000+ tokens before the user types; add Jira and you cross 100,000 ✅. Since these definitions barely change between requests, they are textbook caching targets.
Where should cache_control go?
Anthropic walks the prefix tools → system → messages ✅. Attach cache_control: {type: "ephemeral"} to the LAST tool — that caches the entire tools block as one region. If you have a long system prompt, mark the end of system too. Rule: "Mark the stable region from the back inward."
When does prompt caching actually backfire?
Four cases: (1) the cached region is read only once (1.25x or 2x write surcharge with no payback), (2) the region is below the minimum cacheable size (1,024 for Sonnet/Opus, 2,048 for Haiku), (3) dynamic tokens (timestamps, IDs) are inside the prefix, (4) the tool array is in unstable order across requests.
How do I measure the cache hit rate?
The Anthropic Messages API response includes a usage object that splits writes from reads: cache_creation_input_tokens vs cache_read_input_tokens ✅. Hit rate = cache_read / (cache_read + input + cache_creation). Healthy is 80%+. Observability tools (Datadog, Helicone, Langfuse) surface the breakdown.
Anthropic spec figures cited here (cache_read 0.1x, cache_write 1.25x/2x, minimum cacheable tokens 1,024 for Sonnet/Opus and 2,048 for Haiku, the tools → system → messages cache key order) come from Anthropic's official Prompt Caching documentation and pricing page ✅ (as of 2026-05-20). The "5 MCP servers / 58 tools / 55,000+ tokens" and "100,000+ tokens with Jira" figures come from independent observability vendors (Helicone, ChatForest et al.) and MCP community write-ups ✅. The "74-84% hit rate" and "80%/90% scenario assumptions" reflect representative figures from public case studies and KanseiLink estimates — actual values vary by workload. The 10K req/day simulation is illustrative; pricing follows Anthropic's docs at this point in time and may change — always confirm with the official pricing page before basing financial decisions on it.