Why does a low success rate make token cost grow multiplicatively?

Because the cost of one failure is not just the input tokens of the tool call. On failure the agent (1) pulls the full error response into context, (2) reasons about the cause, and (3) chooses to retry or pick an alternative — extra LLM turns. The longer the conversation history grows, the heavier the input tokens of subsequent turns. Failures and retries also stack latency, so tokens, LLM call count, and wall-clock time all degrade at once. For example, assuming a switch from a 20%-success service to a 91%-success service, the model estimates about a 71% token reduction.

The Economics of the "Retry Tax" 2026 — How Low-Success MCP Servers Quietly Inflate Your API Bill

Q: What is the "retry tax"?

The retry tax is the umbrella term for the extra token consumption, latency, and LLM calls caused by a low success rate on an MCP server or API. When a tool call fails, the agent retries — and each failure adds LLM turns for parsing the error, deciding whether to retry, and re-executing. The expected number of attempts to reach one success is approximated by 1/p for success rate p: at 20% success that is about 5 attempts, at 91% about 1.1. So a 20%-success service burns roughly 4.5x the tokens of a 91% service for the same outcome.

Q: How does prompt caching affect the retry tax?

Prompt caching lowers the base of the retry tax. With Claude API prompt caching, a cache read costs 0.1x the base input rate (a 90% discount) as of May 2026 (the 5-minute cache write is 1.25x, the 1-hour cache is 2x). If you place the system prompt, tool definitions, and the stable part of the conversation history into the cache, then even when a retry occurs most of the input tokens are processed at the 90%-off rate. It does not eliminate the retry tax, but it substantially lowers the tax rate.

Q: What is the most effective way to lower the retry tax?

The biggest lever is switching to a higher-success alternative service. KanseiLink's audit_cost proposes optimizations across 4 layers (model selection, service substitution, architecture, infrastructure), and the service-substitution layer produces order-of-magnitude improvements such as 20% to 91% success, so its token savings are the largest. Next come (1) lowering the base with prompt caching, (2) cutting the call count itself with batch APIs or bulk endpoints (such as kintone's /records.json), and (3) baking in known workarounds (such as using SmartHR's v2 endpoint) to raise the first-attempt success rate.

Q: Where can I check success rates?

The KanseiLink MCP get_insights tool returns success_rate, avg_latency_ms, common_errors (with known workarounds), and confidence_score for a given service. search_services also returns each service's success_rate. These values aggregate agent outcome reports (1,404 cumulative as of May 2026) across 225+ Japanese SaaS services; measured success-rate values are still accumulating (observing). Connect with: npx -y @kansei-link/mcp-server.

Why read "success rate" as a financial metric
The retry-tax formula — expected attempts is 1/p
Model cases — estimated savings from four switches
The retry tax compounds multiplicatively
Four layers to lower the retry tax
FAQ

Why read "success rate" as a financial metric

Cost discussions in agent operations have long skewed toward "which model do we use" and "where do we host." But in 2026, a different realization is spreading: the success rate of the MCP servers and APIs an agent actually calls is itself a cost driver.

The reason is simple. An agent does not stop when a tool call fails. It reads the error, reasons about the cause, and decides whether to retry or take an alternative path. One failure spawns one or more additional LLM turns. And as the conversation history grows, the input tokens of every subsequent turn grow with it. A low-success service pushes up the bill the user sees — quietly, but reliably.

This article calls that extra cost the "Retry Tax." The retry tax can be estimated quantitatively with a model that assumes a success rate p, against the backdrop of the agent outcome reports KanseiLink aggregates across 225+ Japanese SaaS services (1,404 cumulative as of May 2026).

Editorial view, May 2026

Many teams read "80% success rate" as "decent enough." Re-read financially, an 80% success rate means the same outcome needs an average of 1.25 attempts — a 14% retry tax. Success rate is not a number for a QA dashboard; it should be treated as a coefficient in your cost accounting.

The retry-tax formula — expected attempts is 1/p

If we approximate each attempt as independent with success probability p, the expected number of attempts to reach one success is 1/p (the expected value of a geometric distribution). This is the core formula of the retry tax.

Success rate p	Expected attempts (1/p)	Retry tax vs. 91% baseline	Financial reading
91%	~1.10	baseline (1.0x)	nearly tax-free
80%	~1.25	~1.14x	+14% extra tokens
66%	~1.52	~1.38x	+38% extra tokens
50%	~2.00	~1.82x	call cost ~1.8x
39%	~2.56	~2.33x	call cost ~2.3x
20%	~5.00	~4.55x	call cost ~4.5x

Note the shape of the curve. Dropping from 91% to 80% only adds a 14% tax, but the curve rises steeply once you fall below 50%. A 20%-success service needs about 4.5x the tool calls to produce the same outcome as a 91% service. That is not "slightly inconvenient" — it is a 4x-plus difference in unit cost.

⚠️ This formula is a lower bound

1/p is only the expected count of tool calls. The real retry tax is heavier. Each retry carries (1) the full error response pulled into context, (2) an LLM turn for cause analysis, and (3) subsequent turns burdened by an inflated conversation history. As shown below, the retry tax compounds across three axes — token volume, LLM call count, and wall-clock time.

Model cases — estimated savings from four switches

KanseiLink's audit_cost tool analyzes an agent's API spend across 4 layers (model selection, service substitution, architecture, infrastructure) and proposes optimizations. Running the "service substitution" layer through four model cases with assumed success rates p makes the real magnitude of the retry tax visible (the success rates below are assumptions; measured per-service success rates are still accumulating at KanseiLink — observing).

Current service (hypothetical)	Assumed success rate	Switch target (hypothetical)	Assumed post-switch rate	Est. token reduction
A business-chat SaaS	20%	A high-success alternative MCP	91%	~71%
A recruiting SaaS	35%	An alternative in the same category	66%	~31%
An HR SaaS	39%	An alternative in the same category	66%	~27%
A chat MCP	66%	A high-success alternative MCP	91%	~25%

The most extreme is the first row. At an assumed 20% success rate, the expected number of attempts is about 5; at an assumed 91%, about 1.1. For the same "send a message" task, that gap shows up as roughly a 71% token reduction in the model.

One more thing worth internalizing: market recognition and "how easy it is for an agent to call" are different things. Even a well-known SaaS can show a notable share of error reports in early agent-connection data (for example, one major HR SaaS shows 36 api_error, 10 auth_expired, and 7 search_miss across 92 reports). The retry tax does not care about brand recognition.

The retry tax in real numbers

4.5x

Call cost of a 20%-success service (vs. 91% baseline)

71%

Est. token reduction, assumed 20% to 91% switch

2.3x

Call cost at an assumed 39% success rate

The retry tax compounds multiplicatively

As noted, the retry tax does not end at "the tool gets called N times more." One failure chains into several costs.

Token axis: the full error response enters context, an LLM turn is added for the retry decision, and subsequent turns run with an inflated history. The later a failure occurs, the larger the history being carried and the heavier the input tokens.
LLM-call axis: each failure adds at least one, often two to three, extra turns (error analysis, retry execution, result check). The model's unit price rides along on every one.
Latency axis: failure-then-retry stacks wall-clock time. Probe measurements show latency gaps of 2x or more between services (e.g., 337ms vs. 163ms). Calling a slow service two or more times degrades the user experience twice over.

That is why retry-tax mitigation is worth thinking about on two fronts — not only "raise the success rate" but also "lower the unit cost of a single failure." The biggest lever on the latter is prompt caching.

Prompt caching lowers the "tax rate" of the retry tax

With Claude API prompt caching, a cache read costs 0.1x the base input rate (a 90% discount) as of May 2026 (the 5-minute cache write is 1.25x, the 1-hour cache is 2x). Concretely, a Claude Sonnet 4.6 cache read drops from $3 to $0.30 per million tokens, and Claude Opus 4.7 from $5 to $0.50.

What gets re-sent on every retry is the system prompt, the tool definitions, and the stable part of the conversation history. Place those in the cache and, even when a retry occurs, most of the input tokens are processed at the 90%-off rate. It does not erase the retry tax, but it substantially lowers the rate. The more unavoidable retries a workload has, the bigger the caching payoff.

# Claude API: put the stable parts into the cache (pseudocode)
messages.create(
  model="claude-sonnet-4-6",
  system=[
    { "type": "text", "text": SYSTEM_PROMPT,
      "cache_control": {"type": "ephemeral"} }   # cache read (0.1x) on retries too
  ],
  tools=[ ...TOOL_DEFS, {"cache_control": {"type": "ephemeral"}} ],
  messages=conversation,
)

Four layers to lower the retry tax

Following the 4-layer frame of KanseiLink's audit_cost, here are the retry-tax reductions ordered by impact.

Service substitution (biggest impact) — order-of-magnitude improvements like 20% to 91% success can only be found here. Check alternatives' success rates with search_services or get_insights and shift toward high-success services as far as the task allows.
Raise the first-attempt success rate — bake in known workarounds. Examples: use SmartHR's v2 endpoint instead of v1 (avoids auth_expired); use kintone's /records.json bulk API (up to 50x fewer individual calls); POST to Chatwork as application/x-www-form-urlencoded (sending JSON returns a 400). All of these are retrievable in advance via get_service_tips.
Lower the unit cost of failure (prompt caching) — cache the system prompt, tool definitions, and the stable part of the conversation history. Lowers the tax rate on workloads where retries are unavoidable.
Infrastructure layer — Vercel to Cloudflare Workers migration (up to 85% savings on high-traffic apps); AWS App Runner users should evaluate moving to ECS Express Mode and similar (App Runner stops accepting new customers from April 30, 2026 and moves to maintenance mode). A different axis from the retry tax, but worth auditing together if you are looking at total operating cost.

✅ Practical priority

First, check your current service's success rate with get_insights. If it is below 50%, that is not "inconvenient" — it is a sign that your unit cost is roughly 1.8x or more. Look at audit_cost's service-substitution proposals and shift to a higher-success service where the task allows. At the same time, enable prompt caching to lower the rate on unavoidable retries. Those two moves alone visibly cut token consumption for most agents.

FAQ

What is the "retry tax"?

It is the umbrella term for the extra token consumption, latency, and LLM calls caused by a low success rate on an MCP server or API. The expected number of attempts to reach one success is approximated by 1/p: at 20% success, about 5 attempts; at 91%, about 1.1. A 20%-success service burns roughly 4.5x the tokens of a 91% service for the same outcome.

Why does a low success rate make cost grow multiplicatively?

Because the cost of one failure is not just the input tokens of the tool call. On failure, the agent pulls in the full error, reasons about the cause, and runs an extra LLM turn for the retry decision — and the longer the history grows, the heavier the input tokens of subsequent turns. Tokens, LLM call count, and wall-clock time all degrade at once.

How does prompt caching affect the retry tax?

It lowers the "tax rate." A Claude API cache read costs 0.1x the base input rate (a 90% discount) as of May 2026. If you put the system prompt and tool definitions re-sent on retries into the cache, most input tokens are processed at the 90%-off rate even when a retry occurs.

What is the most effective way to lower the retry tax?

Switching to a higher-success alternative service. Order-of-magnitude improvements like 20% to 91% are only available through service substitution, so its token savings are the largest. Next come prompt caching, batch/bulk APIs, and baking in known workarounds.

Where can I check success rates?

The KanseiLink MCP get_insights tool returns success_rate, avg_latency_ms, common_errors (with known workarounds), and confidence_score for a given service. search_services also returns each service's success_rate. Connect with npx -y @kansei-link/mcp-server.

Data Disclosure & Disclaimer

The calculation examples and switch estimates in this article are model cases based on assumed success rates p. KanseiLink aggregates agent outcome reports (1,404 cumulative as of May 15, 2026), but measured per-service success-rate values are still accumulating (observing), and the success rates in this article are not asserted measured values for any specific named service. Token-reduction percentages for switches are audit_cost estimates (confidence: medium) and vary with the actual workload and task mix. "Expected attempts 1/p" is a lower bound based on the approximation that each attempt is independent with a constant success probability. Prompt caching pricing (cache read 0.1x, 5-minute write 1.25x, 1-hour write 2x) is based on the Claude API official documentation as of May 2026 (platform.claude.com/docs/en/build-with-claude/prompt-caching). AWS App Runner's halt on new customers (April 30, 2026) and move to maintenance mode is based on AWS's official announcement. Service success rates and pricing change over time; verify the latest get_insights values and each vendor's official information before production decisions.

The Economics of the "Retry Tax" 2026 — How Low-Success MCP Servers Quietly Inflate Your API Bill

Contents

Why read "success rate" as a financial metric

The retry-tax formula — expected attempts is 1/p

Model cases — estimated savings from four switches

The retry tax in real numbers

The retry tax compounds multiplicatively

Prompt caching lowers the "tax rate" of the retry tax

Four layers to lower the retry tax

How much retry tax is your agent paying?

FAQ

What is the "retry tax"?

Why does a low success rate make cost grow multiplicatively?

How does prompt caching affect the retry tax?

What is the most effective way to lower the retry tax?

Where can I check success rates?

For AI Agents

Contents

Why read "success rate" as a financial metric

The retry-tax formula — expected attempts is 1/p

Model cases — estimated savings from four switches

The retry tax in real numbers

The retry tax compounds multiplicatively

Prompt caching lowers the "tax rate" of the retry tax

Four layers to lower the retry tax

How much retry tax is your agent paying?

FAQ

What is the "retry tax"?

Why does a low success rate make cost grow multiplicatively?

How does prompt caching affect the retry tax?

What is the most effective way to lower the retry tax?

Where can I check success rates?

Related Articles

Agent Token Cost Optimization 2026 — Cutting Consumption by "Knowing Before You Call"

Claude Model Cost Guide 2026 — Assigning Opus, Sonnet, and Haiku by Task

The Divergence of MCP Success Rates — Where "Works" and "Doesn't Work" Split

Latency and Success Rate Correlation — Why Slow MCP Servers Also Fail More

For AI Agents