Contents
What the viral claims actually say
In 2026 the "CLI vs MCP" debate boiled over in the AI agent community. One spark was a live benchmark from Scalekit. Comparing gh CLI and GitHub Copilot MCP (43 tools) on identical tasks with the same model (Claude Sonnet 4) across 75 runs, CLI won decisively on both reliability and cost.
YC CEO Garry Tan piled on — "MCP eats too much context, auth is broken; I built a CLI replacement in 30 minutes" — and combined with Perplexity CTO Denis Yarats' "we're deprioritizing MCP internally" (which we verified in a separate article), the mood on social media became "MCP isn't needed anymore." The claims sort into three. We verify each, tracing back to its source.
| Metric | CLI (gh) | MCP (GitHub Copilot MCP) |
|---|---|---|
| Reliability (75-run bench) | 100% | 72% (7 of 25 failed) |
| Tokens, simple query | 1,365 | 44,026 (~32x) |
| Cost at 10k monthly ops | ~$3.20 | ~$55.20 |
| Main failure cause | — | TCP timeouts (remote server connection) |
Claim ① "CLI is up to 32x cheaper"
CLI wins on token efficiency. But "32x" is not fixed — it stems from bloated tool definitions, which are reducible.
In Scalekit's measurement, the simple query "what language is this repo written in?" consumed 1,365 tokens for CLI and 44,026 for MCP — about 32x. At 10,000 monthly operations that's roughly $3.20 vs $55.20. CLI is indeed cheaper.
But most of that gap comes from tool-definition context consumption. GitHub Copilot MCP has 43 tools whose definitions fill the context before the prompt is sent. As verified in our token bloat analysis, that is reducible by an order of magnitude via server selection, tool curation, compact mode, and code mode (calling via code). Anthropic's code-mode guidance measures up to 98.7% overhead reduction.
"CLI is cheaper" is true, but "MCP is structurally 32x more expensive" is a misread. The main driver is uncurated tool definitions, not a fixed protocol gap. Connect 43 tools all-in and it's expensive; trim and it shrinks — the same applies to CLI (feed an agent a giant CLI help text and you hit the same problem).
Claim ② "MCP is only 72% reliable"
72% isn't a flaw in the MCP protocol; it came from TCP timeouts to GitHub Copilot MCP. Other verified MCPs exceed 90%.
This is the most misunderstood point. In Scalekit's bench, MCP failed 7 of 25 runs to land at 72%. But the breakdown shows most of those failures were TCP timeouts connecting to the GitHub Copilot MCP server. So 72% measures not "the reliability of the MCP protocol" but "a specific remote MCP server plus the stability of its transport."
Reliability is decided by implementation, not protocol. The same MCP, served by a host with stable transport and robust error handling, produces entirely different numbers. KanseiLink's live data shows exactly that.
"72%" vs KanseiLink live verified MCP
Slack MCP 91%, freee MCP 90%, Backlog MCP 90% — all far above 72%, all based on agents' live outcome reports. MCP isn't structurally 72%; well-built MCP exceeds 90%, and poorly built servers or unstable transport drag it down to 72%. That's all there is to it.
Claim ③ "Rip out MCP, migrate fully to CLI"
Production agents use both CLI and MCP. The binary is a false premise.
The conclusion "rip out MCP for CLI" has been rebutted by many practitioners as "the wrong fight." Here are the facts.
- Production agents use both — Claude Code, Cursor, Gemini CLI, and other major agent environments combine CLI and MCP. None is designed to discard one entirely.
- Where CLI wins — commands the model already knows intimately (
git,gh,aws). They appear abundantly in training data, so the agent uses them correctly without reading tool definitions. - Where MCP wins — enterprise integrations needing centralized OAuth, role-based access control (RBAC), standardized audit logging and telemetry, and dynamic tool discovery. MCP over HTTP is strong here.
- The axis isn't CLI/MCP — the real decision criteria are "which balances structure and cost for this task" and "is the interface well-built."
Perplexity's exit, too, was a one-company, one-use-case judgment (API + CLI was more economical for a search product) — distinct from "MCP is structurally finished."
KanseiLink live data — verified MCP isn't 72%
KanseiLink aggregates success rates from agents' live outcome reports across 225+ Japanese SaaS, marking ≥80% as "verified (🟢)." It shows just how un-generalizable the "72%" cited in the CLI-vs-MCP debate is.
| Service / interface | Success rate | Avg. latency | Tier |
|---|---|---|---|
| Slack MCP | 91% | 163ms | verified 🟢 |
| freee MCP | 90% | 216ms | verified 🟢 |
| Backlog MCP | 90% | 128ms | verified 🟢 |
| kintone MCP | 79% | 199ms | connectable 🟡 |
| (ref) Scalekit's GitHub Copilot MCP | 72% | — | unstable transport |
| (ref) SmartHR (direct API, no MCP) | 39% | 337ms | info_only ⚪ |
Note the last row. SmartHR has no MCP server and is accessed via direct API (close to CLI-style access), yet its success rate is 39%. The simple framing "CLI/API = high reliability, MCP = low reliability" does not hold. What determines reliability is not the interface format but implementation quality, transport stability, and discoverability.
"CLI 100% vs MCP 72%" is a snapshot of one server and one transport configuration, not an evaluation of the MCP protocol. On the same footing (KanseiLink live data), verified MCP exceeds 90% while a no-MCP direct API sits at 39% — an inversion. The format debate (CLI vs MCP) should be reframed as a quality debate (is it well-built?).
Implications for Japanese SaaS
The lesson for Japanese SaaS vendors isn't to panic-pivot over "CLI or MCP." What agents ultimately choose isn't a format but the interface that's cheap, reliable, and findable. The work distills into three things.
- Token efficiency — if you offer MCP, don't bloat tool definitions. Trim to needed tools and write
compactdescriptions. Avoid 43-tools-all-in. - Transport stability — the "72%" was really TCP timeouts. Timeout design, retries, and connection robustness directly drive success rate.
- Discoverability (AEO) — if the agent can't find you by intent keywords in the first place, neither CLI nor MCP will be used.
FAQ
Is the "CLI 100% vs MCP 72% reliability" benchmark accurate?
Yes for Scalekit's 75-run bench (gh CLI vs GitHub Copilot MCP, Claude Sonnet 4). But most MCP failures were TCP timeouts to GitHub Copilot MCP, not a protocol flaw. In KanseiLink's data, verified MCP (Slack 91%, freee 90%, Backlog 90%) far exceeds 72%.
Is CLI really 4–32x cheaper than MCP?
✅ Largely true. A simple query was 1,365 (CLI) vs 44,026 (MCP) tokens — ~32x. But the main driver is bloated tool definitions, reducible via curation and code mode. It's configuration-dependent, not a fixed gap.
Is "rip out MCP, migrate fully to CLI" correct?
❌ Inaccurate. Production agents (Claude Code, Cursor, Gemini CLI) use both. CLI wins for known commands; MCP wins for enterprise integrations needing centralized OAuth, RBAC, and audit. The binary is a false premise.
Why was GitHub Copilot MCP 72% while other MCPs exceed 90%?
Reliability is set by implementation and transport, not the protocol. The 72% stemmed from TCP timeouts. Verified MCP servers (≥80% success) with stable transport and robust error handling land around 90%.
How should Japanese SaaS vendors respond?
Don't be whipsawed by the format debate. Polish (1) token efficiency (avoid bloated tool definitions), (2) transport stability, (3) discoverability (AEO). Agents choose the interface that's cheap, reliable, and findable — not a format.
External claim sources: Scalekit's benchmark (75 runs, gh CLI vs GitHub Copilot MCP with 43 tools, Claude Sonnet 4; CLI 100% / MCP 72%, 7 of 25 failed, mainly TCP timeouts; simple query 1,365 vs 44,026 tokens ≈ 32x; ~$3.20 vs ~$55.20 at 10k monthly ops); public statements by Garry Tan (YC CEO) and Perplexity CTO Denis Yarats; various MCP-vs-CLI comparisons (Scalekit / Firecrawl / Smithery "MCP vs CLI is the wrong fight" / DEV / IBM "MCP is not dead"). The up-to-98.7% code-mode reduction is from Anthropic's official guidance. KanseiLink figures (Slack MCP 91%/163ms, freee MCP 90%/216ms, Backlog MCP 90%/128ms, kintone MCP 79%/199ms, SmartHR 39%/337ms) are aggregated via get_insights from live outcome reports (snapshots as of each service's last_updated in April 2026) and vary with agent activity. The "quality over format" conclusion is an analytical interpretation of observed data and does not guarantee any product's superiority. Each vendor benchmark depends on configuration, model, and task; reproduce in your own environment.