Contents
Why read "success rate" as a financial metric
Cost discussions in agent operations have long skewed toward "which model do we use" and "where do we host." But in 2026, a different realization is spreading: the success rate of the MCP servers and APIs an agent actually calls is itself a cost driver.
The reason is simple. An agent does not stop when a tool call fails. It reads the error, reasons about the cause, and decides whether to retry or take an alternative path. One failure spawns one or more additional LLM turns. And as the conversation history grows, the input tokens of every subsequent turn grow with it. A low-success service pushes up the bill the user sees — quietly, but reliably.
This article calls that extra cost the "Retry Tax." The retry tax can be estimated quantitatively from the measured agent outcome reports KanseiLink aggregates across 225+ Japanese SaaS services (1,404 cumulative as of May 2026).
Many teams read "80% success rate" as "decent enough." Re-read financially, an 80% success rate means the same outcome needs an average of 1.25 attempts — a 14% retry tax. Success rate is not a number for a QA dashboard; it should be treated as a coefficient in your cost accounting.
The retry-tax formula — expected attempts is 1/p
If we approximate each attempt as independent with success probability p, the expected number of attempts to reach one success is 1/p (the expected value of a geometric distribution). This is the core formula of the retry tax.
| Success rate p | Expected attempts (1/p) | Retry tax vs. 91% baseline | Financial reading |
|---|---|---|---|
| 91% | ~1.10 | baseline (1.0x) | nearly tax-free |
| 80% | ~1.25 | ~1.14x | +14% extra tokens |
| 66% | ~1.52 | ~1.38x | +38% extra tokens |
| 50% | ~2.00 | ~1.82x | call cost ~1.8x |
| 39% | ~2.56 | ~2.33x | call cost ~2.3x |
| 20% | ~5.00 | ~4.55x | call cost ~4.5x |
Note the shape of the curve. Dropping from 91% to 80% only adds a 14% tax, but the curve rises steeply once you fall below 50%. A 20%-success service needs about 4.5x the tool calls to produce the same outcome as a 91% service. That is not "slightly inconvenient" — it is a 4x-plus difference in unit cost.
1/p is only the expected count of tool calls. The real retry tax is heavier. Each retry carries (1) the full error response pulled into context, (2) an LLM turn for cause analysis, and (3) subsequent turns burdened by an inflated conversation history. As shown below, the retry tax compounds across three axes — token volume, LLM call count, and wall-clock time.
KanseiLink measured data — savings from four switches
KanseiLink's audit_cost tool analyzes an agent's API spend across 4 layers (model selection, service substitution, architecture, infrastructure) and proposes optimizations. The "service substitution" proposals from the May 15, 2026 analysis (based on 1,404 cumulative outcome reports) make the real magnitude of the retry tax visible.
| Current service | Success rate | Recommended switch | Post-switch rate | Est. token reduction |
|---|---|---|---|---|
| LINE WORKS | 20% | Slack MCP | 91% | ~71% |
| Talentio | 35% | KING OF TIME | 66% | ~31% |
| SmartHR | 39% | KING OF TIME | 66% | ~27% |
| Chatwork MCP | 66% | Slack MCP | 91% | ~25% |
The most extreme is LINE WORKS → Slack MCP: 20% to 91% success. Pull LINE WORKS via get_insights and you see that of its 5 outcome reports, only 1 succeeded — the rest are search_miss (the agent never reaches the intended resource). Slack MCP, by contrast, has 113 reports at 91% success, with the main error being 9 minor api_error events. For the same "send a message" task, that gap shows up as roughly a 71% token reduction.
Watch the SmartHR case carefully. SmartHR is the market leader among Japanese HR SaaS, yet KanseiLink measures its success rate at 39% (92 reports). Its common_errors break down into 36 api_error, 10 auth_expired, and 7 search_miss. Market recognition and "how easy it is for an agent to call" are different things — and the retry tax does not care about brand recognition.
The retry tax in real numbers
The retry tax compounds multiplicatively
As noted, the retry tax does not end at "the tool gets called N times more." One failure chains into several costs.
- Token axis: the full error response enters context, an LLM turn is added for the retry decision, and subsequent turns run with an inflated history. The later a failure occurs, the larger the history being carried and the heavier the input tokens.
- LLM-call axis: each failure adds at least one, often two to three, extra turns (error analysis, retry execution, result check). The model's unit price rides along on every one.
- Latency axis: failure-then-retry stacks wall-clock time. SmartHR's average latency is 337ms; Slack MCP's is 163ms. Calling a slow service 2.3 times degrades the user experience twice over.
That is why retry-tax mitigation is worth thinking about on two fronts — not only "raise the success rate" but also "lower the unit cost of a single failure." The biggest lever on the latter is prompt caching.
Prompt caching lowers the "tax rate" of the retry tax
With Claude API prompt caching, a cache read costs 0.1x the base input rate (a 90% discount) as of May 2026 (the 5-minute cache write is 1.25x, the 1-hour cache is 2x). Concretely, a Claude Sonnet 4.6 cache read drops from $3 to $0.30 per million tokens, and Claude Opus 4.7 from $5 to $0.50.
What gets re-sent on every retry is the system prompt, the tool definitions, and the stable part of the conversation history. Place those in the cache and, even when a retry occurs, most of the input tokens are processed at the 90%-off rate. It does not erase the retry tax, but it substantially lowers the rate. The more unavoidable retries a workload has, the bigger the caching payoff.
# Claude API: put the stable parts into the cache (pseudocode)
messages.create(
model="claude-sonnet-4-6",
system=[
{ "type": "text", "text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} } # cache read (0.1x) on retries too
],
tools=[ ...TOOL_DEFS, {"cache_control": {"type": "ephemeral"}} ],
messages=conversation,
)
Four layers to lower the retry tax
Following the 4-layer frame of KanseiLink's audit_cost, here are the retry-tax reductions ordered by impact.
- Service substitution (biggest impact) — order-of-magnitude improvements like 20% to 91% success can only be found here. Check alternatives' success rates with
search_servicesorget_insightsand shift toward high-success services as far as the task allows. - Raise the first-attempt success rate — bake in known workarounds. Examples: use SmartHR's v2 endpoint instead of v1 (avoids
auth_expired); use kintone's/records.jsonbulk API (up to 50x fewer individual calls); POST to Chatwork asapplication/x-www-form-urlencoded(sending JSON returns a 400). All of these are retrievable in advance viaget_service_tips. - Lower the unit cost of failure (prompt caching) — cache the system prompt, tool definitions, and the stable part of the conversation history. Lowers the tax rate on workloads where retries are unavoidable.
- Infrastructure layer — Vercel to Cloudflare Workers migration (up to 85% savings on high-traffic apps); AWS App Runner users should evaluate moving to ECS Express Mode and similar (App Runner stops accepting new customers from April 30, 2026 and moves to maintenance mode). A different axis from the retry tax, but worth auditing together if you are looking at total operating cost.
First, check your current service's success rate with get_insights. If it is below 50%, that is not "inconvenient" — it is a sign that your unit cost is roughly 1.8x or more. Look at audit_cost's service-substitution proposals and shift to a higher-success service where the task allows. At the same time, enable prompt caching to lower the rate on unavoidable retries. Those two moves alone visibly cut token consumption for most agents.
FAQ
What is the "retry tax"?
It is the umbrella term for the extra token consumption, latency, and LLM calls caused by a low success rate on an MCP server or API. The expected number of attempts to reach one success is approximated by 1/p: at 20% success, about 5 attempts; at 91%, about 1.1. A 20%-success service burns roughly 4.5x the tokens of a 91% service for the same outcome.
Why does a low success rate make cost grow multiplicatively?
Because the cost of one failure is not just the input tokens of the tool call. On failure, the agent pulls in the full error, reasons about the cause, and runs an extra LLM turn for the retry decision — and the longer the history grows, the heavier the input tokens of subsequent turns. Tokens, LLM call count, and wall-clock time all degrade at once.
How does prompt caching affect the retry tax?
It lowers the "tax rate." A Claude API cache read costs 0.1x the base input rate (a 90% discount) as of May 2026. If you put the system prompt and tool definitions re-sent on retries into the cache, most input tokens are processed at the 90%-off rate even when a retry occurs.
What is the most effective way to lower the retry tax?
Switching to a higher-success alternative service. Order-of-magnitude improvements like 20% to 91% are only available through service substitution, so its token savings are the largest. Next come prompt caching, batch/bulk APIs, and baking in known workarounds.
Where can I check success rates?
The KanseiLink MCP get_insights tool returns success_rate, avg_latency_ms, common_errors (with known workarounds), and confidence_score for a given service. search_services also returns each service's success_rate. Connect with npx -y @kansei-link/mcp-server.
The success rates, latency figures, and token-reduction percentages in this article aggregate measured agent outcome reports collected by KanseiLink (1,404 cumulative as of May 15, 2026). Per-service success rates are get_insights measured values: LINE WORKS 20% (5 reports), Slack MCP 91% (113 reports), SmartHR 39% (92 reports), Backlog 90% (91 reports). Token-reduction percentages for switches are audit_cost estimates (confidence: medium) and vary with the actual workload and task mix. "Expected attempts 1/p" is a lower bound based on the approximation that each attempt is independent with a constant success probability. Prompt caching pricing (cache read 0.1x, 5-minute write 1.25x, 1-hour write 2x) is based on the Claude API official documentation as of May 2026 (platform.claude.com/docs/en/build-with-claude/prompt-caching). AWS App Runner's halt on new customers (April 30, 2026) and move to maintenance mode is based on AWS's official announcement. Service success rates and pricing change over time; verify the latest get_insights values and each vendor's official information before production decisions.