Cut Agent Costs by Up to 90% With MCP Prompt Caching 2026 — Anthropic's Official Spec and the Pitfalls That Make It Backfire

Q: Does Anthropic's prompt caching really cut costs by 90%?

Per the spec, yes. Anthropic's official docs price cache-read tokens at 0.1x the base input rate — a 90% discount on the cached portion of input. That figure applies only to the cached input segment, not to output tokens or non-cached prefix segments. There is also a write surcharge: 5-minute cache writes cost 1.25x and 1-hour cache writes cost 2x base input. Real workload-level savings depend on your hit rate; stable agent loops achieve 74-84% in published case studies.

Q: Where should cache_control go?

Anthropic caches the prefix strictly in this order: tools → system → messages. For MCP integrations, attach cache_control: {type: 'ephemeral'} to the LAST tool definition — that marks the entire tool block for caching. If you also have a long system prompt, mark the end of system as well so the tools+system region is cached together. The rule is 'mark from the back of the stable region inward.' Dynamic tokens (timestamps, session IDs) must live in messages, not in the prefix — putting them inside tools or system pushes every byte after them into cache miss.

Q: When does prompt caching actually backfire?

Four cases. (1) The cached prefix is only read once — at a 1.25x (5-min TTL) or 2x (1-hour TTL) write surcharge, a single-read prefix simply costs more. The breakeven is the 2nd read for 5-min TTL and the 3rd read for 1-hour TTL. (2) The cached region is below the per-model minimum: 1,024 tokens for Sonnet/Opus, 2,048 for Haiku — anything smaller is ignored and billed at full input rate. (3) Dynamic tokens are inside the prefix (a timestamp at the top of system, a username inside a tool description) — every byte after that point is a miss. (4) Your tool array is in unstable order across requests — the cache key is order-sensitive, so a shuffled array means a fresh write every time.

Q: How do I measure the cache hit rate?

The Anthropic Messages API response includes a usage object that splits cache_creation_input_tokens (writes) from cache_read_input_tokens (reads). Hit rate is cache_read / (cache_read + input_tokens + cache_creation). In practice, a hit rate below 60% signals dynamic tokens in the prefix; 80% or higher is the right operating zone. Published case studies put healthy agent workloads at 74-84%. Observability tools (Datadog, Helicone, Langfuse) surface this breakdown automatically.

Why MCP Integrations Burn Tokens
Anthropic's Spec — The cache_write / cache_read Economics
Where cache_control Goes — tools → system → messages
Four Ways Prompt Caching Backfires
Measuring Hit Rate and the Optimization Loop
A Realistic Savings Simulation
FAQ

Why MCP Integrations Burn Tokens

Teams running MCP servers in production tend to notice the same anomaly on the bill: input token counts are large even when the user has barely typed anything. The reason is mechanical — MCP tool definitions (name, description, inputSchema, annotations) ship at the front of every prompt, ahead of the user message.

Independent reports put the cost into numbers: five MCP servers and 58 tools consume 55,000+ tokens before the user types, and adding integrations like Jira pushes that past 100,000 tokens of context window absorbed by tool definitions alone ✅ (we audited this claim in "Are 'MCP Tool Definitions Eat 40-50% of Context' True?"). That is the invisible floor under your agent ops bill.

Editorial take, May 2026

The key insight is that tool definitions are a stable prefix. What changes per request is the user message and a few context variables — not the tool block. So most of the MCP input cost is just full-price billing for tokens that should have been cache reads. Anthropic's prompt caching is the official mechanism that closes that gap, by up to 90% on the cached portion ✅.

Anthropic's Spec — The cache_write / cache_read Economics

From Anthropic's official documentation ✅:

cache_read (tokens served from cache) = 0.1x the base input price — a 90% discount on the cached portion.
cache_write (first time a region is stored): 1.25x base input at 5-minute TTL, 2.0x at 1-hour TTL.
The write surcharge pays back on subsequent reads. 5-min TTL (1.25x) breaks even at the 2nd read; 1-hour TTL (2.0x) breaks even at the 3rd read ✅.
Minimum cacheable region per model: 1,024 tokens (Sonnet/Opus), 2,048 tokens (Haiku). Anything smaller is ignored and billed at full input rate ✅.

A concrete example. Suppose an Opus agent ships an 80K-token fixed prefix on every request. Without caching, you pay full input price on those 80K every time. Add 5-min TTL caching and the first request costs 80K × 1.25 = 100K-equivalent input; every subsequent hit costs 80K × 0.1 = 8K. From the second request on, the cached input drops to 10% of its original price — that is the "up to 90% reduction" headline ✅.

cache_read vs cache_write Economics

0.1×

cache_read multiplier — 10% of base input

1.25×

cache_write at 5-min TTL

2.0×

cache_write at 1-hour TTL

Where cache_control Goes — tools → system → messages

Anthropic builds the cache key by walking the prompt in a fixed order: tools → system → messages ✅. That order is decisive — the key is constructed from the front; the moment a dynamic token appears, every byte after it becomes a cache miss.

For MCP integrations, attach cache_control: {type: "ephemeral"} to the last tool definition. That marks the entire tools block as one cacheable region:

// Anthropic Messages API — cache_control on the last tool
const response = await anthropic.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 4096,
  tools: [
    { name: "tool_a", description: "...", input_schema: {...} },
    { name: "tool_b", description: "...", input_schema: {...} },
    // ... 58 tools total ...
    {
      name: "tool_last",
      description: "...",
      input_schema: {...},
      cache_control: { type: "ephemeral" }  // ← caches the whole tools block
    }
  ],
  system: [
    {
      type: "text",
      text: "Long instruction prompt...",
      cache_control: { type: "ephemeral" }  // ← also cache the system block
    }
  ],
  messages: [{ role: "user", content: "..." }]
});

// Check the hit rate in the response
console.log(response.usage);
// {
//   input_tokens: 320,                      ← non-cached input
//   cache_creation_input_tokens: 55200,     ← first write
//   cache_read_input_tokens: 0,             ← flips on subsequent calls
//   output_tokens: 250
// }

The principle: "Mark the stable region from the back inward." Cache keys build front-to-back, so cache_control on the last item of a stable region captures everything from the prompt start up to that mark. For MCP integrations, the tools block is the biggest and most stable region, so cache_control at the tail of tools is the standard placement. If you also ship long system instructions, mark the end of system as well.

Dynamic values (timestamps, session IDs, usernames) belong in messages, not in the prefix. A timestamp at the top of system invalidates every cached byte after it.

Four Ways Prompt Caching Backfires

"Up to 90%" is the best case. In real workloads, these four patterns flip the math and you pay more, not less.

⚠️ Pitfall 1: Caching a prefix that's only read once

5-min TTL writes are 1.25x; 1-hour TTL writes are 2x. If the cached prefix is read only once, the 5-min option costs 1.25x base input and the 1-hour option costs 2x — and you go home. For short, single-shot sessions (a cold-start user fires one request and disconnects), the write surcharge never gets amortized. Caching pays off on agent loops where the same prefix is hit repeatedly.

⚠️ Pitfall 2: Caching below the minimum cacheable size

Sonnet/Opus require 1,024 tokens; Haiku requires 2,048 ✅. Below that, cache_control is silently ignored — the region is billed at full input rate. For small in-house MCP servers (a handful of tools), either combine cache blocks or merge with the system prompt to clear the threshold.

⚠️ Pitfall 3: Dynamic tokens slipped into the prefix

Got "Today is {date}" or "User: {username}" at the top of system? The cache key is order-sensitive, built front-to-back. A daily-changing date invalidates the entire region behind it — every day, your 80K tokens of tool definitions get rewritten as a fresh cache_creation. Keep all dynamic values in messages; tools and system stay stable.

⚠️ Pitfall 4: Unstable tool ordering across requests

If MCP tools are fetched in non-deterministic order, or a config change shuffles the array, the cache key changes and you pay for a fresh write. Stable sort the tool array (by name, alphabetical, or any deterministic key) — an easy-to-miss prerequisite.

Measuring Hit Rate and the Optimization Loop

Caching is not "set and forget." You need to measure the hit rate continuously and trace misses back to their cause.

The Anthropic API response includes a usage object that splits writes from reads ✅: cache_creation_input_tokens (writes) and cache_read_input_tokens (reads). Hit rate is straightforward:

hit_rate = cache_read_input_tokens
         / (cache_read_input_tokens + input_tokens + cache_creation_input_tokens)

// e.g. { input_tokens: 320, cache_creation: 0, cache_read: 55200, output: 250 }
//   → hit_rate = 55200 / (55200 + 320 + 0) = 99.4% (ideal)

// e.g. { input_tokens: 320, cache_creation: 55200, cache_read: 0, output: 250 }
//   → hit_rate = 0% (the prefix is changing every request)

Hit Rate	Diagnosis	Common Cause	Action
0-20%	Critical miss	Dynamic token in the prefix	Remove timestamps / IDs from system/tools
20-60%	Partial	Unstable tool order / wrong TTL	Stable sort tools / reassess 5m vs 1h TTL
60-80%	Room to improve	cache_control only on tools	Mark end of system / add 2nd breakpoint
80%+	Healthy	—	Maintain; move to output-side optimization

Published case studies put healthy agent workloads at 74-84% hit rate ✅. If you are below that, work the table top-down.

✅ Four steps to push hit rate above 80%

(1) Stable-sort the tools array by name. (2) Put cache_control: ephemeral on the last tool definition. (3) Add a second cache block at the end of the system prompt. (4) Move all dynamic values (timestamps, IDs, user info) into messages. Most MCP integrations move from the 0-20% band to 80%+ with these four steps alone.

A Realistic Savings Simulation

Take an Opus 4.7 agent at 10,000 requests/day, with a fixed prefix (MCP tool definitions + system) of 80,000 tokens, a user message of 500 tokens, and 1,000 output tokens per call. We compare three scenarios — no cache, 5-min TTL, 1-hour TTL — using Opus pricing from Anthropic's official docs ✅.

Scenario	Daily input cost	vs no cache	Notes
No cache	100% (baseline)	—	80K × 10,000 = 800M input tokens/day, all at full price
5-min TTL (80% hit assumed)	~24%	-76%	0.5% cache write + 79.5% read (0.1x) + 20% miss
1-hour TTL (90% hit assumed)	~19%	-81%	Longer sessions; 2x write absorbed by 90% hits

This is input-side only — output tokens are unaffected. But agent ops is an input-heavy workload, and the larger the fixed prefix, the harder caching works for you.

FAQ

Does Anthropic's prompt caching really cut costs by 90%?

"Up to 90%" — on the cached portion of input. cache_read is 0.1x base input ✅. Total savings depend on hit rate; stable agent workloads see 74-84% in published case studies. Output tokens are not discounted.

Why do MCP integrations burn so many input tokens?

MCP tool definitions ship at the front of every prompt, before the user message. Five MCP servers and 58 tools = 55,000+ tokens before the user types; add Jira and you cross 100,000 ✅. Since these definitions barely change between requests, they are textbook caching targets.

Where should cache_control go?

Anthropic walks the prefix tools → system → messages ✅. Attach cache_control: {type: "ephemeral"} to the LAST tool — that caches the entire tools block as one region. If you have a long system prompt, mark the end of system too. Rule: "Mark the stable region from the back inward."

When does prompt caching actually backfire?

Four cases: (1) the cached region is read only once (1.25x or 2x write surcharge with no payback), (2) the region is below the minimum cacheable size (1,024 for Sonnet/Opus, 2,048 for Haiku), (3) dynamic tokens (timestamps, IDs) are inside the prefix, (4) the tool array is in unstable order across requests.

How do I measure the cache hit rate?

The Anthropic Messages API response includes a usage object that splits writes from reads: cache_creation_input_tokens vs cache_read_input_tokens ✅. Hit rate = cache_read / (cache_read + input + cache_creation). Healthy is 80%+. Observability tools (Datadog, Helicone, Langfuse) surface the breakdown.

Data Disclosure & Disclaimer

Anthropic spec figures cited here (cache_read 0.1x, cache_write 1.25x/2x, minimum cacheable tokens 1,024 for Sonnet/Opus and 2,048 for Haiku, the tools → system → messages cache key order) come from Anthropic's official Prompt Caching documentation and pricing page ✅ (as of 2026-05-20). The "5 MCP servers / 58 tools / 55,000+ tokens" and "100,000+ tokens with Jira" figures come from independent observability vendors (Helicone, ChatForest et al.) and MCP community write-ups ✅. The "74-84% hit rate" and "80%/90% scenario assumptions" reflect representative figures from public case studies and KanseiLink estimates — actual values vary by workload. The 10K req/day simulation is illustrative; pricing follows Anthropic's docs at this point in time and may change — always confirm with the official pricing page before basing financial decisions on it.

Cut Agent Costs by Up to 90% With MCP Prompt Caching 2026 — Anthropic's Official Spec and the Pitfalls That Make It Backfire

Table of Contents

Why MCP Integrations Burn Tokens

Anthropic's Spec — The cache_write / cache_read Economics

cache_read vs cache_write Economics

Where cache_control Goes — tools → system → messages

Four Ways Prompt Caching Backfires

Measuring Hit Rate and the Optimization Loop

A Realistic Savings Simulation

Is your MCP integration actually caching?

FAQ

Does Anthropic's prompt caching really cut costs by 90%?

Why do MCP integrations burn so many input tokens?

Where should cache_control go?

When does prompt caching actually backfire?

How do I measure the cache hit rate?

For AI Agents

Table of Contents

Why MCP Integrations Burn Tokens

Anthropic's Spec — The cache_write / cache_read Economics

cache_read vs cache_write Economics

Where cache_control Goes — tools → system → messages

Four Ways Prompt Caching Backfires

Measuring Hit Rate and the Optimization Loop

A Realistic Savings Simulation

Is your MCP integration actually caching?

FAQ

Does Anthropic's prompt caching really cut costs by 90%?

Why do MCP integrations burn so many input tokens?

Where should cache_control go?

When does prompt caching actually backfire?

How do I measure the cache hit rate?

Related Articles

Are "MCP Tool Definitions Eat 40-50% of Context" and "MCP Is Dead" Really True? — Fact-Checking the Perplexity Defection 2026

Claude Haiku vs Sonnet vs Opus: Task-Based Cost Optimization Guide for Japanese SaaS 2026

The Economics of the Retry Tax 2026 — How Low-Success MCP Servers Quietly Inflate Your API Bill

MCP Tool Schema Design Guide 2026 — 7 Principles for name, description, inputSchema, and annotations

For AI Agents