Contents
Why MCP testing is harder than Web API testing
"An MCP server is just an API — write some unit tests and an E2E suite and you're done." Most teams that ship on this assumption notice agent regressions within weeks of release. The reason is three asymmetries a typical API test pyramid cannot model.
Asymmetry 1: the caller is an LLM (non-deterministic)
A Web API test assumes "same input, same output." MCP turns that on its head: the caller is Claude or GPT, and the same user intent produces different argument shapes each time. "Create this month's invoice" sometimes arrives as {date: "2026-05"} and sometimes as {from: "2026-05-01", to: "2026-05-31"}. If you only test the shape you expect, production will fail intermittently in ways that look like flakes.
Asymmetry 2: the tool schema is also a prompt
An MCP tool's description isn't documentation — it's a signal the LLM uses to decide which tool to call. Editing "Send a message to Slack" into "Post to Slack channel" can shift selection rates noticeably across model families. Your code is unchanged; your behaviour isn't. A description-only refactor can ship a regression, and that is uniquely an MCP problem.
Asymmetry 3: the protocol itself keeps moving
MCP has evolved continuously between 2024 and 2026 — Streamable HTTP transport, expanded metadata, OAuth 2.1-style auth flows. SDKs and clients track the latest spec; if your server stays frozen, newer agents simply cannot talk to it. CI needs an explicit way to verify ongoing spec compliance, not a one-time audit.
"All unit tests pass, so we're safe." Not in MCP. Whether the LLM picks the right tool, fills it with sensible arguments, and recovers gracefully from errors is a property of the system, not of any single function — and no amount of unit-test coverage will surface those failures.
The MCP test pyramid — four layers
An MCP server that survives 2026 production runs four layers in CI/CD. The lower the layer, the cheaper and more frequent it runs; the higher, the more expensive and selective.
| Layer | What it verifies | Recommended tool | When it runs |
|---|---|---|---|
| 1. Unit | Per-tool implementation logic | FastMCP Client (in-memory) | Every PR / push |
| 2. Contract | MCP spec compliance / protocol shape | mcp-testing-framework | Every PR |
| 3. Schema drift | Tool-definition snapshot diff | Custom snapshot + git diff | Every PR |
| 4. E2E (real LLM) | Tool selection, argument synthesis, recovery | Anthropic SDK + eval set | Nightly / release tag |
Layer 1: Unit tests with FastMCP Client
FastMCP (a popular Python MCP framework) ships a Client that talks to a server in-memory, with no process or network overhead. No latency, no auth handshake, freely parallel — each test finishes in tens of milliseconds. It's the obvious first choice for the unit layer of CI.
# tests/test_create_invoice.py
import pytest
from fastmcp import FastMCP, Client
# Import the production server instance directly
from my_mcp_server import mcp
@pytest.mark.asyncio
async def test_create_invoice_success():
async with Client(mcp) as client:
result = await client.call_tool(
"create_invoice",
{"amount": 10000, "client": "Acme Inc"}
)
assert result.is_error is False
assert "invoice_id" in result.content[0].text
@pytest.mark.asyncio
async def test_create_invoice_invalid_amount():
async with Client(mcp) as client:
result = await client.call_tool(
"create_invoice",
{"amount": -100, "client": "Acme Inc"}
)
assert result.is_error is True
# Does the error message tell the LLM which field failed?
assert "amount" in result.content[0].text.lower()
The point of the second test is subtle but important: verify that errors are useful to an LLM. An API that returns only "Internal error" leaves the agent with nowhere to go and frequently triggers infinite retry loops. Encode the requirement that error messages include field names and remediation hints into the test suite, not just the docs.
Layer 2: Contract tests with mcp-testing-framework
Even if every individual tool works, your server still needs to be MCP-spec-compliant as a whole. Does initialize return the right capabilities? Does tools/list use the right JSON Schema? Do error responses follow JSON-RPC 2.0? That's what the contract layer covers, on every PR.
mcp-testing-framework on PyPI launches the server binary and drives the real protocol against it. The same tests run locally and in CI.
# pip install mcp-testing-framework
# tests/test_mcp_contract.py
from mcp_testing_framework import ContractTester
def test_mcp_protocol_compliance():
tester = ContractTester(
command=["python", "-m", "my_mcp_server"],
transport="stdio",
)
report = tester.run_full_compliance_check()
assert report.passed, f"Compliance failures: {report.failures}"
def test_tool_schemas_valid():
tester = ContractTester(command=["python", "-m", "my_mcp_server"])
tools = tester.list_tools()
for tool in tools:
# JSON Schema Draft 2020-12 compliant?
assert tester.validate_input_schema(tool), \
f"Invalid inputSchema for tool: {tool.name}"
Layer 3: Schema drift detection
The most overlooked layer. Its job is to make sure that "trivial" changes to the tool schema don't quietly change production agent behaviour — and to surface that risk during code review, not after release.
The snapshot pattern
Save the server's tools/list response as JSON in the repo and fail any PR with a diff. Simple, reliable, and it forces a conversation about every change.
# scripts/snapshot_tools.py
import json
from fastmcp import Client
from my_mcp_server import mcp
async def main():
async with Client(mcp) as client:
tools = await client.list_tools()
normalized = sorted(
[t.model_dump() for t in tools],
key=lambda x: x["name"],
)
with open("tests/snapshots/tools.json", "w") as f:
json.dump(normalized, f, indent=2, ensure_ascii=False, sort_keys=True)
# In CI: python scripts/snapshot_tools.py && git diff --exit-code tests/snapshots/
If git diff --exit-code returns non-zero the PR fails, and reviewers must explicitly answer the question: "how does this description change shift tool selection?" That single forcing function is one of the cheapest reliability wins available.
Selection-rate regression (advanced)
Going further: build a small set of representative queries (10–20) and assert that Claude / GPT / Gemini all pick the right tool above some threshold (say 90%). Any drop blocks the PR. This is exactly the kind of cross-model stability that KanseiLink's Agent Voice dataset has shown to correlate strongly with production reliability.
Layer 4: End-to-end with a real LLM
The last line of defense is an E2E test that uses a real LLM. With a small model like Claude Haiku 4.5, you can verify on a budget that the agent, given a representative user intent, picks the right tool, fills it with the right arguments, interprets the response, and replies sensibly.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
EVAL_CASES = [
{
"user": "Show me all of this month's invoices",
"expected_tool": "list_invoices",
"expected_args_contain": ["2026-05"],
},
{
"user": "Create a 10,000 yen invoice for Acme",
"expected_tool": "create_invoice",
"expected_args_contain": ["10000", "Acme"],
},
]
def test_eval_case(case):
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
tools=load_mcp_tools_as_anthropic_format(),
messages=[{"role": "user", "content": case["user"]}],
)
tool_use = next(
(b for b in response.content if b.type == "tool_use"),
None,
)
assert tool_use is not None, "No tool was called"
assert tool_use.name == case["expected_tool"]
args_str = json.dumps(tool_use.input)
for needle in case["expected_args_contain"]:
assert needle in args_str, f"Missing {needle} in {args_str}"
20 representative queries × ~2,000 tokens in/out × 30 days = roughly 1.2M tokens/month. At Claude Haiku 4.5's pricing tier that lands in the order of a few hundred to a few thousand yen per month — orders of magnitude cheaper than a single production incident with downtime and customer impact. Pricing changes; verify before forecasting.
A working GitHub Actions workflow
A minimal four-layer workflow. Unit, contract and snapshot tests run on every PR. E2E only triggers on a nightly schedule and on release tags.
# .github/workflows/mcp-test.yml
name: MCP Server Tests
on:
push:
branches: [main]
pull_request:
schedule:
- cron: "0 16 * * *" # 01:00 JST daily (16:00 UTC)
jobs:
fast-tests:
name: Unit + Contract + Schema Drift
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements-dev.txt
- name: Unit tests (FastMCP Client)
run: pytest tests/unit -v
- name: Contract tests (mcp-testing-framework)
run: pytest tests/contract -v
- name: Snapshot drift check
run: |
python scripts/snapshot_tools.py
git diff --exit-code tests/snapshots/ \
|| (echo "::error::Tool schema drift detected. Review and commit snapshot." && exit 1)
e2e-tests:
name: E2E with Real LLM
if: github.event_name == 'schedule' || startsWith(github.ref, 'refs/tags/')
runs-on: ubuntu-latest
environment: production # gated environment for secrets
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements-dev.txt
- name: Run E2E evaluation
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: pytest tests/e2e -v --json-report
- name: Upload artifact on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: e2e-failure-traces
path: .pytest_cache/
(1) Isolate secrets: keep API keys in a gated environment that fork-based PR builds cannot read. (2) Tame flakiness: every LLM call needs exponential backoff with jitter and a hard retry cap. (3) Save artifacts: upload failing requests/responses with upload-artifact for repro. (4) Watch parallelism: many SaaS APIs serialize OAuth refresh-token rotation, so parallel test workers fight each other — give each worker its own account or serialize the refresh step.
Production checklist
The minimum bar before putting your MCP server on a CI/CD pipeline:
- Unit tests written with FastMCP Client (or equivalent in-memory runner)
- Contract tests with mcp-testing-framework or equivalent assert MCP spec compliance
- tools/list snapshot is checked in and surfaces in PR diffs
- Tool errors return LLM-consumable structure (field names, remediation hints)
- Real-LLM E2E tests trigger on nightly schedule or release tags
- Failing requests/responses are saved as workflow artifacts
- Secrets live in a gated environment that fork PRs cannot read
- Test OAuth accounts are isolated from production
- Parallel tests do not collide on refresh-token rotation
- Optional: selection-rate regression suite watches for tool-sprawl symptoms
FAQ
Q1. Why is MCP testing harder than Web API testing?
Three asymmetries: (1) the caller is a non-deterministic LLM; (2) the tool schema doubles as a prompt — a single-word edit can shift selection; (3) the MCP spec keeps moving. A standard request/response test pyramid simply doesn't cover this surface.
Q2. FastMCP Client vs mcp-testing-framework?
Different jobs. FastMCP Client is in-memory unit testing — fast and free. mcp-testing-framework is real-protocol contract testing — verifies spec compliance. In production you almost always want both.
Q3. Can CI catch a description-only PR?
Yes — snapshot the tools/list response and fail on diff. Going further, run a 10–20 query selection-rate regression suite that asserts the right tool is chosen above a threshold.
Q4. Doesn't real-LLM E2E get expensive?
Not if you trigger it surgically. Run unit/contract/snapshot tests on every PR, and E2E only on nightly + release tags + schema-changing PRs. On Claude Haiku 4.5 a nightly run is typically a few hundred to a few thousand yen per month — far cheaper than a single production incident. Verify pricing before forecasting.
Q5. What are the GitHub Actions traps?
(1) Secrets in a gated environment — fork PRs must not read them. (2) Backoff + retry cap on every LLM call. (3) Upload failing traces as artifacts. (4) Serialize or partition tests that share an OAuth refresh token.
Q6. How do I provision SaaS test accounts?
Keep them strictly separate from production. Most major SaaS (freee, Slack, kintone, etc.) offer sandbox environments — put CI accounts there. For services without sandboxes, run a dedicated low-tier account and a weekly reset script.
The technical content reflects publicly available information and official documentation as of May 2026. FastMCP Client (github.com/jlowin/fastmcp) and mcp-testing-framework (pypi.org/project/mcp-testing-framework/) are both publicly available on their official repositories / PyPI as of May 2026. Code samples are illustrative pseudocode; in production, validate against the latest API of the MCP SDK you use. Anthropic Claude Haiku 4.5 pricing changes; recompute the cost estimate using current official pricing before relying on it. The "description changes shift selection" effect is widely observed in the industry, but the magnitude varies by model and tool surface. Pricing and specifications may change without notice.