Table of Contents

  1. Why MCP Integrations Burn Tokens
  2. Anthropic's Spec — The cache_write / cache_read Economics
  3. Where cache_control Goes — tools → system → messages
  4. Four Ways Prompt Caching Backfires
  5. Measuring Hit Rate and the Optimization Loop
  6. A Realistic Savings Simulation
  7. FAQ

Why MCP Integrations Burn Tokens

Teams running MCP servers in production tend to notice the same anomaly on the bill: input token counts are large even when the user has barely typed anything. The reason is mechanical — MCP tool definitions (name, description, inputSchema, annotations) ship at the front of every prompt, ahead of the user message.

Independent reports put the cost into numbers: five MCP servers and 58 tools consume 55,000+ tokens before the user types, and adding integrations like Jira pushes that past 100,000 tokens of context window absorbed by tool definitions alone ✅ (we audited this claim in "Are 'MCP Tool Definitions Eat 40-50% of Context' True?"). That is the invisible floor under your agent ops bill.

Editorial take, May 2026

The key insight is that tool definitions are a stable prefix. What changes per request is the user message and a few context variables — not the tool block. So most of the MCP input cost is just full-price billing for tokens that should have been cache reads. Anthropic's prompt caching is the official mechanism that closes that gap, by up to 90% on the cached portion ✅.

Anthropic's Spec — The cache_write / cache_read Economics

From Anthropic's official documentation ✅:

A concrete example. Suppose an Opus agent ships an 80K-token fixed prefix on every request. Without caching, you pay full input price on those 80K every time. Add 5-min TTL caching and the first request costs 80K × 1.25 = 100K-equivalent input; every subsequent hit costs 80K × 0.1 = 8K. From the second request on, the cached input drops to 10% of its original price — that is the "up to 90% reduction" headline ✅.

cache_read vs cache_write Economics

0.1×
cache_read multiplier — 10% of base input
1.25×
cache_write at 5-min TTL
2.0×
cache_write at 1-hour TTL

Where cache_control Goes — tools → system → messages

Anthropic builds the cache key by walking the prompt in a fixed order: tools → system → messages ✅. That order is decisive — the key is constructed from the front; the moment a dynamic token appears, every byte after it becomes a cache miss.

For MCP integrations, attach cache_control: {type: "ephemeral"} to the last tool definition. That marks the entire tools block as one cacheable region:

// Anthropic Messages API — cache_control on the last tool
const response = await anthropic.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 4096,
  tools: [
    { name: "tool_a", description: "...", input_schema: {...} },
    { name: "tool_b", description: "...", input_schema: {...} },
    // ... 58 tools total ...
    {
      name: "tool_last",
      description: "...",
      input_schema: {...},
      cache_control: { type: "ephemeral" }  // ← caches the whole tools block
    }
  ],
  system: [
    {
      type: "text",
      text: "Long instruction prompt...",
      cache_control: { type: "ephemeral" }  // ← also cache the system block
    }
  ],
  messages: [{ role: "user", content: "..." }]
});

// Check the hit rate in the response
console.log(response.usage);
// {
//   input_tokens: 320,                      ← non-cached input
//   cache_creation_input_tokens: 55200,     ← first write
//   cache_read_input_tokens: 0,             ← flips on subsequent calls
//   output_tokens: 250
// }

The principle: "Mark the stable region from the back inward." Cache keys build front-to-back, so cache_control on the last item of a stable region captures everything from the prompt start up to that mark. For MCP integrations, the tools block is the biggest and most stable region, so cache_control at the tail of tools is the standard placement. If you also ship long system instructions, mark the end of system as well.

Dynamic values (timestamps, session IDs, usernames) belong in messages, not in the prefix. A timestamp at the top of system invalidates every cached byte after it.

Four Ways Prompt Caching Backfires

"Up to 90%" is the best case. In real workloads, these four patterns flip the math and you pay more, not less.

⚠️ Pitfall 1: Caching a prefix that's only read once

5-min TTL writes are 1.25x; 1-hour TTL writes are 2x. If the cached prefix is read only once, the 5-min option costs 1.25x base input and the 1-hour option costs 2x — and you go home. For short, single-shot sessions (a cold-start user fires one request and disconnects), the write surcharge never gets amortized. Caching pays off on agent loops where the same prefix is hit repeatedly.

⚠️ Pitfall 2: Caching below the minimum cacheable size

Sonnet/Opus require 1,024 tokens; Haiku requires 2,048 ✅. Below that, cache_control is silently ignored — the region is billed at full input rate. For small in-house MCP servers (a handful of tools), either combine cache blocks or merge with the system prompt to clear the threshold.

⚠️ Pitfall 3: Dynamic tokens slipped into the prefix

Got "Today is {date}" or "User: {username}" at the top of system? The cache key is order-sensitive, built front-to-back. A daily-changing date invalidates the entire region behind it — every day, your 80K tokens of tool definitions get rewritten as a fresh cache_creation. Keep all dynamic values in messages; tools and system stay stable.

⚠️ Pitfall 4: Unstable tool ordering across requests

If MCP tools are fetched in non-deterministic order, or a config change shuffles the array, the cache key changes and you pay for a fresh write. Stable sort the tool array (by name, alphabetical, or any deterministic key) — an easy-to-miss prerequisite.

Measuring Hit Rate and the Optimization Loop

Caching is not "set and forget." You need to measure the hit rate continuously and trace misses back to their cause.

The Anthropic API response includes a usage object that splits writes from reads ✅: cache_creation_input_tokens (writes) and cache_read_input_tokens (reads). Hit rate is straightforward:

hit_rate = cache_read_input_tokens
         / (cache_read_input_tokens + input_tokens + cache_creation_input_tokens)

// e.g. { input_tokens: 320, cache_creation: 0, cache_read: 55200, output: 250 }
//   → hit_rate = 55200 / (55200 + 320 + 0) = 99.4% (ideal)

// e.g. { input_tokens: 320, cache_creation: 55200, cache_read: 0, output: 250 }
//   → hit_rate = 0% (the prefix is changing every request)
Hit Rate Diagnosis Common Cause Action
0-20% Critical miss Dynamic token in the prefix Remove timestamps / IDs from system/tools
20-60% Partial Unstable tool order / wrong TTL Stable sort tools / reassess 5m vs 1h TTL
60-80% Room to improve cache_control only on tools Mark end of system / add 2nd breakpoint
80%+ Healthy Maintain; move to output-side optimization

Published case studies put healthy agent workloads at 74-84% hit rate ✅. If you are below that, work the table top-down.

✅ Four steps to push hit rate above 80%

(1) Stable-sort the tools array by name. (2) Put cache_control: ephemeral on the last tool definition. (3) Add a second cache block at the end of the system prompt. (4) Move all dynamic values (timestamps, IDs, user info) into messages. Most MCP integrations move from the 0-20% band to 80%+ with these four steps alone.

A Realistic Savings Simulation

Take an Opus 4.7 agent at 10,000 requests/day, with a fixed prefix (MCP tool definitions + system) of 80,000 tokens, a user message of 500 tokens, and 1,000 output tokens per call. We compare three scenarios — no cache, 5-min TTL, 1-hour TTL — using Opus pricing from Anthropic's official docs ✅.

Scenario Daily input cost vs no cache Notes
No cache 100% (baseline) 80K × 10,000 = 800M input tokens/day, all at full price
5-min TTL (80% hit assumed) ~24% -76% 0.5% cache write + 79.5% read (0.1x) + 20% miss
1-hour TTL (90% hit assumed) ~19% -81% Longer sessions; 2x write absorbed by 90% hits

This is input-side only — output tokens are unaffected. But agent ops is an input-heavy workload, and the larger the fixed prefix, the harder caching works for you.

Is your MCP integration actually caching?

KanseiLink tracks 225+ MCP servers and their official implementation, tool counts, and cache fitness. We can diagnose where your MCP is burning tokens and how cache_control is placed.

Request an AEO diagnostic

FAQ

Does Anthropic's prompt caching really cut costs by 90%?

"Up to 90%" — on the cached portion of input. cache_read is 0.1x base input ✅. Total savings depend on hit rate; stable agent workloads see 74-84% in published case studies. Output tokens are not discounted.

Why do MCP integrations burn so many input tokens?

MCP tool definitions ship at the front of every prompt, before the user message. Five MCP servers and 58 tools = 55,000+ tokens before the user types; add Jira and you cross 100,000 ✅. Since these definitions barely change between requests, they are textbook caching targets.

Where should cache_control go?

Anthropic walks the prefix tools → system → messages ✅. Attach cache_control: {type: "ephemeral"} to the LAST tool — that caches the entire tools block as one region. If you have a long system prompt, mark the end of system too. Rule: "Mark the stable region from the back inward."

When does prompt caching actually backfire?

Four cases: (1) the cached region is read only once (1.25x or 2x write surcharge with no payback), (2) the region is below the minimum cacheable size (1,024 for Sonnet/Opus, 2,048 for Haiku), (3) dynamic tokens (timestamps, IDs) are inside the prefix, (4) the tool array is in unstable order across requests.

How do I measure the cache hit rate?

The Anthropic Messages API response includes a usage object that splits writes from reads: cache_creation_input_tokens vs cache_read_input_tokens ✅. Hit rate = cache_read / (cache_read + input + cache_creation). Healthy is 80%+. Observability tools (Datadog, Helicone, Langfuse) surface the breakdown.

Data Disclosure & Disclaimer

Anthropic spec figures cited here (cache_read 0.1x, cache_write 1.25x/2x, minimum cacheable tokens 1,024 for Sonnet/Opus and 2,048 for Haiku, the tools → system → messages cache key order) come from Anthropic's official Prompt Caching documentation and pricing page ✅ (as of 2026-05-20). The "5 MCP servers / 58 tools / 55,000+ tokens" and "100,000+ tokens with Jira" figures come from independent observability vendors (Helicone, ChatForest et al.) and MCP community write-ups ✅. The "74-84% hit rate" and "80%/90% scenario assumptions" reflect representative figures from public case studies and KanseiLink estimates — actual values vary by workload. The 10K req/day simulation is illustrative; pricing follows Anthropic's docs at this point in time and may change — always confirm with the official pricing page before basing financial decisions on it.