Contents

  1. Why MCP testing is harder than Web API testing
  2. The MCP test pyramid — four layers
  3. Layer 1: Unit tests with FastMCP Client
  4. Layer 2: Contract tests with mcp-testing-framework
  5. Layer 3: Schema drift detection
  6. Layer 4: End-to-end with a real LLM
  7. A working GitHub Actions workflow
  8. Production checklist
  9. FAQ

Why MCP testing is harder than Web API testing

"An MCP server is just an API — write some unit tests and an E2E suite and you're done." Most teams that ship on this assumption notice agent regressions within weeks of release. The reason is three asymmetries a typical API test pyramid cannot model.

Asymmetry 1: the caller is an LLM (non-deterministic)

A Web API test assumes "same input, same output." MCP turns that on its head: the caller is Claude or GPT, and the same user intent produces different argument shapes each time. "Create this month's invoice" sometimes arrives as {date: "2026-05"} and sometimes as {from: "2026-05-01", to: "2026-05-31"}. If you only test the shape you expect, production will fail intermittently in ways that look like flakes.

Asymmetry 2: the tool schema is also a prompt

An MCP tool's description isn't documentation — it's a signal the LLM uses to decide which tool to call. Editing "Send a message to Slack" into "Post to Slack channel" can shift selection rates noticeably across model families. Your code is unchanged; your behaviour isn't. A description-only refactor can ship a regression, and that is uniquely an MCP problem.

Asymmetry 3: the protocol itself keeps moving

MCP has evolved continuously between 2024 and 2026 — Streamable HTTP transport, expanded metadata, OAuth 2.1-style auth flows. SDKs and clients track the latest spec; if your server stays frozen, newer agents simply cannot talk to it. CI needs an explicit way to verify ongoing spec compliance, not a one-time audit.

⚠️ A common misconception

"All unit tests pass, so we're safe." Not in MCP. Whether the LLM picks the right tool, fills it with sensible arguments, and recovers gracefully from errors is a property of the system, not of any single function — and no amount of unit-test coverage will surface those failures.

The MCP test pyramid — four layers

An MCP server that survives 2026 production runs four layers in CI/CD. The lower the layer, the cheaper and more frequent it runs; the higher, the more expensive and selective.

Layer What it verifies Recommended tool When it runs
1. Unit Per-tool implementation logic FastMCP Client (in-memory) Every PR / push
2. Contract MCP spec compliance / protocol shape mcp-testing-framework Every PR
3. Schema drift Tool-definition snapshot diff Custom snapshot + git diff Every PR
4. E2E (real LLM) Tool selection, argument synthesis, recovery Anthropic SDK + eval set Nightly / release tag

Layer 1: Unit tests with FastMCP Client

FastMCP (a popular Python MCP framework) ships a Client that talks to a server in-memory, with no process or network overhead. No latency, no auth handshake, freely parallel — each test finishes in tens of milliseconds. It's the obvious first choice for the unit layer of CI.

# tests/test_create_invoice.py
import pytest
from fastmcp import FastMCP, Client

# Import the production server instance directly
from my_mcp_server import mcp

@pytest.mark.asyncio
async def test_create_invoice_success():
    async with Client(mcp) as client:
        result = await client.call_tool(
            "create_invoice",
            {"amount": 10000, "client": "Acme Inc"}
        )
        assert result.is_error is False
        assert "invoice_id" in result.content[0].text

@pytest.mark.asyncio
async def test_create_invoice_invalid_amount():
    async with Client(mcp) as client:
        result = await client.call_tool(
            "create_invoice",
            {"amount": -100, "client": "Acme Inc"}
        )
        assert result.is_error is True
        # Does the error message tell the LLM which field failed?
        assert "amount" in result.content[0].text.lower()

The point of the second test is subtle but important: verify that errors are useful to an LLM. An API that returns only "Internal error" leaves the agent with nowhere to go and frequently triggers infinite retry loops. Encode the requirement that error messages include field names and remediation hints into the test suite, not just the docs.

Layer 2: Contract tests with mcp-testing-framework

Even if every individual tool works, your server still needs to be MCP-spec-compliant as a whole. Does initialize return the right capabilities? Does tools/list use the right JSON Schema? Do error responses follow JSON-RPC 2.0? That's what the contract layer covers, on every PR.

mcp-testing-framework on PyPI launches the server binary and drives the real protocol against it. The same tests run locally and in CI.

# pip install mcp-testing-framework
# tests/test_mcp_contract.py
from mcp_testing_framework import ContractTester

def test_mcp_protocol_compliance():
    tester = ContractTester(
        command=["python", "-m", "my_mcp_server"],
        transport="stdio",
    )
    report = tester.run_full_compliance_check()
    assert report.passed, f"Compliance failures: {report.failures}"

def test_tool_schemas_valid():
    tester = ContractTester(command=["python", "-m", "my_mcp_server"])
    tools = tester.list_tools()
    for tool in tools:
        # JSON Schema Draft 2020-12 compliant?
        assert tester.validate_input_schema(tool), \
            f"Invalid inputSchema for tool: {tool.name}"

Layer 3: Schema drift detection

The most overlooked layer. Its job is to make sure that "trivial" changes to the tool schema don't quietly change production agent behaviour — and to surface that risk during code review, not after release.

The snapshot pattern

Save the server's tools/list response as JSON in the repo and fail any PR with a diff. Simple, reliable, and it forces a conversation about every change.

# scripts/snapshot_tools.py
import json
from fastmcp import Client
from my_mcp_server import mcp

async def main():
    async with Client(mcp) as client:
        tools = await client.list_tools()
        normalized = sorted(
            [t.model_dump() for t in tools],
            key=lambda x: x["name"],
        )
    with open("tests/snapshots/tools.json", "w") as f:
        json.dump(normalized, f, indent=2, ensure_ascii=False, sort_keys=True)

# In CI: python scripts/snapshot_tools.py && git diff --exit-code tests/snapshots/

If git diff --exit-code returns non-zero the PR fails, and reviewers must explicitly answer the question: "how does this description change shift tool selection?" That single forcing function is one of the cheapest reliability wins available.

Selection-rate regression (advanced)

Going further: build a small set of representative queries (10–20) and assert that Claude / GPT / Gemini all pick the right tool above some threshold (say 90%). Any drop blocks the PR. This is exactly the kind of cross-model stability that KanseiLink's Agent Voice dataset has shown to correlate strongly with production reliability.

Layer 4: End-to-end with a real LLM

The last line of defense is an E2E test that uses a real LLM. With a small model like Claude Haiku 4.5, you can verify on a budget that the agent, given a representative user intent, picks the right tool, fills it with the right arguments, interprets the response, and replies sensibly.

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

EVAL_CASES = [
    {
        "user": "Show me all of this month's invoices",
        "expected_tool": "list_invoices",
        "expected_args_contain": ["2026-05"],
    },
    {
        "user": "Create a 10,000 yen invoice for Acme",
        "expected_tool": "create_invoice",
        "expected_args_contain": ["10000", "Acme"],
    },
]

def test_eval_case(case):
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        tools=load_mcp_tools_as_anthropic_format(),
        messages=[{"role": "user", "content": case["user"]}],
    )
    tool_use = next(
        (b for b in response.content if b.type == "tool_use"),
        None,
    )
    assert tool_use is not None, "No tool was called"
    assert tool_use.name == case["expected_tool"]
    args_str = json.dumps(tool_use.input)
    for needle in case["expected_args_contain"]:
        assert needle in args_str, f"Missing {needle} in {args_str}"
Cost estimate — nightly E2E on Claude Haiku 4.5

20 representative queries × ~2,000 tokens in/out × 30 days = roughly 1.2M tokens/month. At Claude Haiku 4.5's pricing tier that lands in the order of a few hundred to a few thousand yen per month — orders of magnitude cheaper than a single production incident with downtime and customer impact. Pricing changes; verify before forecasting.

A working GitHub Actions workflow

A minimal four-layer workflow. Unit, contract and snapshot tests run on every PR. E2E only triggers on a nightly schedule and on release tags.

# .github/workflows/mcp-test.yml
name: MCP Server Tests

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: "0 16 * * *"  # 01:00 JST daily (16:00 UTC)

jobs:
  fast-tests:
    name: Unit + Contract + Schema Drift
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-dev.txt
      - name: Unit tests (FastMCP Client)
        run: pytest tests/unit -v
      - name: Contract tests (mcp-testing-framework)
        run: pytest tests/contract -v
      - name: Snapshot drift check
        run: |
          python scripts/snapshot_tools.py
          git diff --exit-code tests/snapshots/ \
            || (echo "::error::Tool schema drift detected. Review and commit snapshot." && exit 1)

  e2e-tests:
    name: E2E with Real LLM
    if: github.event_name == 'schedule' || startsWith(github.ref, 'refs/tags/')
    runs-on: ubuntu-latest
    environment: production  # gated environment for secrets
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-dev.txt
      - name: Run E2E evaluation
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: pytest tests/e2e -v --json-report
      - name: Upload artifact on failure
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: e2e-failure-traces
          path: .pytest_cache/
✅ Four practical tips

(1) Isolate secrets: keep API keys in a gated environment that fork-based PR builds cannot read. (2) Tame flakiness: every LLM call needs exponential backoff with jitter and a hard retry cap. (3) Save artifacts: upload failing requests/responses with upload-artifact for repro. (4) Watch parallelism: many SaaS APIs serialize OAuth refresh-token rotation, so parallel test workers fight each other — give each worker its own account or serialize the refresh step.

Production checklist

The minimum bar before putting your MCP server on a CI/CD pipeline:

Live agent-behaviour data on 225+ services, refreshed weekly

KanseiLink delivers live success-rate, timeout, and selection-rate data for Japanese SaaS and major global APIs through MCP. Find out how your MCP behaves across models before you ship.

Validate your MCP with production-grade data

FAQ

Q1. Why is MCP testing harder than Web API testing?

Three asymmetries: (1) the caller is a non-deterministic LLM; (2) the tool schema doubles as a prompt — a single-word edit can shift selection; (3) the MCP spec keeps moving. A standard request/response test pyramid simply doesn't cover this surface.

Q2. FastMCP Client vs mcp-testing-framework?

Different jobs. FastMCP Client is in-memory unit testing — fast and free. mcp-testing-framework is real-protocol contract testing — verifies spec compliance. In production you almost always want both.

Q3. Can CI catch a description-only PR?

Yes — snapshot the tools/list response and fail on diff. Going further, run a 10–20 query selection-rate regression suite that asserts the right tool is chosen above a threshold.

Q4. Doesn't real-LLM E2E get expensive?

Not if you trigger it surgically. Run unit/contract/snapshot tests on every PR, and E2E only on nightly + release tags + schema-changing PRs. On Claude Haiku 4.5 a nightly run is typically a few hundred to a few thousand yen per month — far cheaper than a single production incident. Verify pricing before forecasting.

Q5. What are the GitHub Actions traps?

(1) Secrets in a gated environment — fork PRs must not read them. (2) Backoff + retry cap on every LLM call. (3) Upload failing traces as artifacts. (4) Serialize or partition tests that share an OAuth refresh token.

Q6. How do I provision SaaS test accounts?

Keep them strictly separate from production. Most major SaaS (freee, Slack, kintone, etc.) offer sandbox environments — put CI accounts there. For services without sandboxes, run a dedicated low-tier account and a weekly reset script.

Disclosures and notes

The technical content reflects publicly available information and official documentation as of May 2026. FastMCP Client (github.com/jlowin/fastmcp) and mcp-testing-framework (pypi.org/project/mcp-testing-framework/) are both publicly available on their official repositories / PyPI as of May 2026. Code samples are illustrative pseudocode; in production, validate against the latest API of the MCP SDK you use. Anthropic Claude Haiku 4.5 pricing changes; recompute the cost estimate using current official pricing before relying on it. The "description changes shift selection" effect is widely observed in the industry, but the magnitude varies by model and tool surface. Pricing and specifications may change without notice.