Contents

  1. Why read "success rate" as a financial metric
  2. The retry-tax formula — expected attempts is 1/p
  3. KanseiLink measured data — savings from four switches
  4. The retry tax compounds multiplicatively
  5. Four layers to lower the retry tax
  6. FAQ

Why read "success rate" as a financial metric

Cost discussions in agent operations have long skewed toward "which model do we use" and "where do we host." But in 2026, a different realization is spreading: the success rate of the MCP servers and APIs an agent actually calls is itself a cost driver.

The reason is simple. An agent does not stop when a tool call fails. It reads the error, reasons about the cause, and decides whether to retry or take an alternative path. One failure spawns one or more additional LLM turns. And as the conversation history grows, the input tokens of every subsequent turn grow with it. A low-success service pushes up the bill the user sees — quietly, but reliably.

This article calls that extra cost the "Retry Tax." The retry tax can be estimated quantitatively from the measured agent outcome reports KanseiLink aggregates across 225+ Japanese SaaS services (1,404 cumulative as of May 2026).

Editorial view, May 2026

Many teams read "80% success rate" as "decent enough." Re-read financially, an 80% success rate means the same outcome needs an average of 1.25 attempts — a 14% retry tax. Success rate is not a number for a QA dashboard; it should be treated as a coefficient in your cost accounting.

The retry-tax formula — expected attempts is 1/p

If we approximate each attempt as independent with success probability p, the expected number of attempts to reach one success is 1/p (the expected value of a geometric distribution). This is the core formula of the retry tax.

Success rate p Expected attempts (1/p) Retry tax vs. 91% baseline Financial reading
91%~1.10baseline (1.0x)nearly tax-free
80%~1.25~1.14x+14% extra tokens
66%~1.52~1.38x+38% extra tokens
50%~2.00~1.82xcall cost ~1.8x
39%~2.56~2.33xcall cost ~2.3x
20%~5.00~4.55xcall cost ~4.5x

Note the shape of the curve. Dropping from 91% to 80% only adds a 14% tax, but the curve rises steeply once you fall below 50%. A 20%-success service needs about 4.5x the tool calls to produce the same outcome as a 91% service. That is not "slightly inconvenient" — it is a 4x-plus difference in unit cost.

⚠️ This formula is a lower bound

1/p is only the expected count of tool calls. The real retry tax is heavier. Each retry carries (1) the full error response pulled into context, (2) an LLM turn for cause analysis, and (3) subsequent turns burdened by an inflated conversation history. As shown below, the retry tax compounds across three axes — token volume, LLM call count, and wall-clock time.

KanseiLink measured data — savings from four switches

KanseiLink's audit_cost tool analyzes an agent's API spend across 4 layers (model selection, service substitution, architecture, infrastructure) and proposes optimizations. The "service substitution" proposals from the May 15, 2026 analysis (based on 1,404 cumulative outcome reports) make the real magnitude of the retry tax visible.

Current service Success rate Recommended switch Post-switch rate Est. token reduction
LINE WORKS20%Slack MCP91%~71%
Talentio35%KING OF TIME66%~31%
SmartHR39%KING OF TIME66%~27%
Chatwork MCP66%Slack MCP91%~25%

The most extreme is LINE WORKS → Slack MCP: 20% to 91% success. Pull LINE WORKS via get_insights and you see that of its 5 outcome reports, only 1 succeeded — the rest are search_miss (the agent never reaches the intended resource). Slack MCP, by contrast, has 113 reports at 91% success, with the main error being 9 minor api_error events. For the same "send a message" task, that gap shows up as roughly a 71% token reduction.

Watch the SmartHR case carefully. SmartHR is the market leader among Japanese HR SaaS, yet KanseiLink measures its success rate at 39% (92 reports). Its common_errors break down into 36 api_error, 10 auth_expired, and 7 search_miss. Market recognition and "how easy it is for an agent to call" are different things — and the retry tax does not care about brand recognition.

The retry tax in real numbers

4.5x
Call cost of a 20%-success service (vs. 91% baseline)
71%
Est. token reduction, LINE WORKS to Slack switch
2.3x
Call cost at 39% success (SmartHR measured)

The retry tax compounds multiplicatively

As noted, the retry tax does not end at "the tool gets called N times more." One failure chains into several costs.

That is why retry-tax mitigation is worth thinking about on two fronts — not only "raise the success rate" but also "lower the unit cost of a single failure." The biggest lever on the latter is prompt caching.

Prompt caching lowers the "tax rate" of the retry tax

With Claude API prompt caching, a cache read costs 0.1x the base input rate (a 90% discount) as of May 2026 (the 5-minute cache write is 1.25x, the 1-hour cache is 2x). Concretely, a Claude Sonnet 4.6 cache read drops from $3 to $0.30 per million tokens, and Claude Opus 4.7 from $5 to $0.50.

What gets re-sent on every retry is the system prompt, the tool definitions, and the stable part of the conversation history. Place those in the cache and, even when a retry occurs, most of the input tokens are processed at the 90%-off rate. It does not erase the retry tax, but it substantially lowers the rate. The more unavoidable retries a workload has, the bigger the caching payoff.

# Claude API: put the stable parts into the cache (pseudocode)
messages.create(
  model="claude-sonnet-4-6",
  system=[
    { "type": "text", "text": SYSTEM_PROMPT,
      "cache_control": {"type": "ephemeral"} }   # cache read (0.1x) on retries too
  ],
  tools=[ ...TOOL_DEFS, {"cache_control": {"type": "ephemeral"}} ],
  messages=conversation,
)

Four layers to lower the retry tax

Following the 4-layer frame of KanseiLink's audit_cost, here are the retry-tax reductions ordered by impact.

✅ Practical priority

First, check your current service's success rate with get_insights. If it is below 50%, that is not "inconvenient" — it is a sign that your unit cost is roughly 1.8x or more. Look at audit_cost's service-substitution proposals and shift to a higher-success service where the task allows. At the same time, enable prompt caching to lower the rate on unavoidable retries. Those two moves alone visibly cut token consumption for most agents.

How much retry tax is your agent paying?

KanseiLink provides measured data — success rates, latency, error types, and known workarounds — for 225+ Japanese SaaS services via MCP. Pinpoint the sources of your retry tax, service by service.

Talk to us about a cost audit

FAQ

What is the "retry tax"?

It is the umbrella term for the extra token consumption, latency, and LLM calls caused by a low success rate on an MCP server or API. The expected number of attempts to reach one success is approximated by 1/p: at 20% success, about 5 attempts; at 91%, about 1.1. A 20%-success service burns roughly 4.5x the tokens of a 91% service for the same outcome.

Why does a low success rate make cost grow multiplicatively?

Because the cost of one failure is not just the input tokens of the tool call. On failure, the agent pulls in the full error, reasons about the cause, and runs an extra LLM turn for the retry decision — and the longer the history grows, the heavier the input tokens of subsequent turns. Tokens, LLM call count, and wall-clock time all degrade at once.

How does prompt caching affect the retry tax?

It lowers the "tax rate." A Claude API cache read costs 0.1x the base input rate (a 90% discount) as of May 2026. If you put the system prompt and tool definitions re-sent on retries into the cache, most input tokens are processed at the 90%-off rate even when a retry occurs.

What is the most effective way to lower the retry tax?

Switching to a higher-success alternative service. Order-of-magnitude improvements like 20% to 91% are only available through service substitution, so its token savings are the largest. Next come prompt caching, batch/bulk APIs, and baking in known workarounds.

Where can I check success rates?

The KanseiLink MCP get_insights tool returns success_rate, avg_latency_ms, common_errors (with known workarounds), and confidence_score for a given service. search_services also returns each service's success_rate. Connect with npx -y @kansei-link/mcp-server.

Data Disclosure & Disclaimer

The success rates, latency figures, and token-reduction percentages in this article aggregate measured agent outcome reports collected by KanseiLink (1,404 cumulative as of May 15, 2026). Per-service success rates are get_insights measured values: LINE WORKS 20% (5 reports), Slack MCP 91% (113 reports), SmartHR 39% (92 reports), Backlog 90% (91 reports). Token-reduction percentages for switches are audit_cost estimates (confidence: medium) and vary with the actual workload and task mix. "Expected attempts 1/p" is a lower bound based on the approximation that each attempt is independent with a constant success probability. Prompt caching pricing (cache read 0.1x, 5-minute write 1.25x, 1-hour write 2x) is based on the Claude API official documentation as of May 2026 (platform.claude.com/docs/en/build-with-claude/prompt-caching). AWS App Runner's halt on new customers (April 30, 2026) and move to maintenance mode is based on AWS's official announcement. Service success rates and pricing change over time; verify the latest get_insights values and each vendor's official information before production decisions.