Contents

  1. Why MCP needs rate limiting in 2026
  2. Error format — 429 and JSON-RPC parity
  3. Server pattern 1: Token bucket
  4. Server pattern 2: Sliding window
  5. Agent side: Full-jitter exponential backoff
  6. Pairing with circuit breakers
  7. Observability — metrics that drive tuning
  8. Production checklist
  9. FAQ

Why MCP needs rate limiting in 2026

MCP servers in 2025 were mostly thin API wrappers. In 2026 they have become shared infrastructure that several agents call in parallel. It's no longer unusual for Claude, GPT, Gemini, and an in-house agent to share a single MCP instance. One agent's runaway loop now poisons everyone else's experience.

Industry reports have documented cases where a single agent stuck in a loop fires over 1,000 requests per minute. An MCP server without rate limiting (1) burns through upstream SaaS quotas in seconds, (2) blows past cloud-cost forecasts by several multiples, and (3) earns a temporary ban from the upstream API, taking down every connected agent at once.

MCP's new role — May 2026

The MCP server has become the membrane that absorbs agent overruns. Rate limits, timeouts, and circuit breakers are not optional — they are baseline requirements for shared infrastructure. ✅ Splunk MCP Server v1.1.0 (in beta as of April 2026) already implements both global and per-tool rate limiting.

Error format — 429 and JSON-RPC parity

Before implementing the limit itself, decide how to surface it. MCP supports Streamable HTTP, stdio, and WebSocket — so the format branches per transport.

HTTP transport: 429 + Retry-After

The industry standard is "HTTP 429 Too Many Requests" plus a "Retry-After" header (integer seconds or HTTP-date). The agent reads that value and bases its backoff on it.

HTTP/1.1 429 Too Many Requests
Retry-After: 12
Content-Type: application/json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Per-tool rate limit hit on 'create_invoice'.",
    "scope": "tool",
    "tool": "create_invoice",
    "retry_after_seconds": 12,
    "limit": "10 requests / 60s"
  }
}

stdio / non-HTTP transport: JSON-RPC error

On stdio MCP, embed retry_after in the JSON-RPC error object. Agent-side libraries (the MCP SDKs) read it and feed it into the auto-retry path.

{
  "jsonrpc": "2.0",
  "id": 42,
  "error": {
    "code": -32029,
    "message": "Rate limit exceeded",
    "data": {
      "scope": "global",
      "retry_after_seconds": 8,
      "limit": "100 requests / 60s",
      "current_usage": 100
    }
  }
}

The key field is "scope". With a global limit, switching tools won't help. With a per-tool limit, the agent can usefully fall back to a different tool while waiting.

Server pattern 1: Token bucket

The default choice when you want "usually relaxed but okay with bursts within bounds". Bucket capacity encodes burst tolerance, refill rate encodes the average rate. Matches the natural usage pattern of an agent (a flurry of calls, then idle).

// TypeScript / Node.js (simplified — production needs distributed locks via Redis)
class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private capacity: number,    // peak burst size
    private refillRate: number,  // tokens added per second
  ) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  tryConsume(): { ok: boolean; retryAfter?: number } {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;

    if (this.tokens >= 1) {
      this.tokens -= 1;
      return { ok: true };
    }
    const needed = 1 - this.tokens;
    const retryAfter = Math.ceil(needed / this.refillRate);
    return { ok: false, retryAfter };
  }
}

// Example: 100 requests/min average, peak burst 20
const bucket = new TokenBucket(20, 100 / 60);

For multi-instance production, store the token count in a single Redis key and update it atomically with SET key value NX PX + INCRBY. Cloudflare Workers — use Durable Objects. AWS — ElastiCache + Lua script.

Server pattern 2: Sliding window

Use this when "no more than M calls in the last N seconds" must hold strictly — typically because the upstream API has a hard limit you must not cross (invoicing, money movement).

// Redis Sorted Set based (pseudo-code)
async function checkSlidingWindow(
  key: string,
  windowSeconds: number,
  maxRequests: number,
): Promise<{ ok: boolean; retryAfter?: number }> {
  const now = Date.now();
  const windowStart = now - windowSeconds * 1000;

  // 1. Remove old entries
  await redis.zremrangebyscore(key, 0, windowStart);
  // 2. Count current entries
  const count = await redis.zcard(key);

  if (count >= maxRequests) {
    // Compute when the oldest entry exits the window
    const oldest = await redis.zrange(key, 0, 0, "WITHSCORES");
    const retryAfter = Math.ceil(
      (parseInt(oldest[1]) + windowSeconds * 1000 - now) / 1000
    );
    return { ok: false, retryAfter };
  }

  // 3. Add the current request
  await redis.zadd(key, now, `${now}-${crypto.randomUUID()}`);
  await redis.expire(key, windowSeconds);
  return { ok: true };
}

The 2026 default for MCP is a hybrid: token bucket per tool, sliding window globally. The first gives you burst flexibility; the second guarantees you never exceed the upstream's hard ceiling.

Agent side: Full-jitter exponential backoff

Returning 429 is meaningless if the agent immediately retries in tight loops. The agent side must implement exponential backoff with jitter.

Why plain exponential backoff is not enough

Imagine 10 agents hitting 429 at the same instant. With plain exponential backoff (1s, 2s, 4s, 8s...) all 10 retry on the same schedule and the spike re-synchronizes. This is the thundering herd problem.

Full Jitter — the AWS recommendation

AWS Architecture Blog has long recommended sleep = random(0, base * 2^attempt). Random delay, scattered across time.

// TypeScript (the same pattern as in the Anthropic SDK and others)
async function callWithBackoff<T>(
  fn: () => Promise<T>,
  options: { maxRetries?: number; baseMs?: number; capMs?: number } = {}
): Promise<T> {
  const maxRetries = options.maxRetries ?? 5;
  const baseMs = options.baseMs ?? 500;
  const capMs = options.capMs ?? 30_000;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      // Only retry on 429 or 5xx
      const status = err.status ?? err.response?.status;
      if (status !== 429 && (status < 500 || status > 599)) {
        throw err;  // 4xx (other than 429) fails immediately
      }
      if (attempt === maxRetries) throw err;

      // Honor server-supplied Retry-After first
      const retryAfter = err.headers?.["retry-after"];
      let delay: number;
      if (retryAfter) {
        delay = parseInt(retryAfter) * 1000;
      } else {
        // Full Jitter: random(0, base * 2^attempt), capped
        const exp = Math.min(capMs, baseMs * Math.pow(2, attempt));
        delay = Math.floor(Math.random() * exp);
      }
      await new Promise(r => setTimeout(r, delay));
    }
  }
  throw new Error("unreachable");
}
✅ Important rule

Fail immediately on 4xx other than 429. If the server says "your request itself is wrong", retrying will not make it work. Retries only make sense for 429, 5xx, and network timeouts.

Pairing with circuit breakers

Rate limiting and circuit breakers protect different things.

Mechanism Protects Triggers when Primary purpose
Rate limiting Yourself (server / upstream API) Requests/unit-time exceeds budget Don't shoot too fast
Circuit breaker The upstream API Consecutive failures cross a threshold Don't pile on a service that's already down

A typical circuit breaker has three states: closed / open / half-open. A reasonable production starting point for MCP: open after 5 consecutive 5xx, transition to half-open after 30 seconds, send 1–2 probe requests; if those succeed, close again.

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failures = 0;
  private openedAt = 0;

  constructor(
    private threshold: number = 5,
    private resetMs: number = 30_000,
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.openedAt > this.resetMs) {
        this.state = "half-open";
      } else {
        throw new Error("circuit_open");
      }
    }
    try {
      const r = await fn();
      this.failures = 0;
      this.state = "closed";
      return r;
    } catch (err: any) {
      const status = err.status ?? err.response?.status;
      if (status >= 500 && status <= 599) {
        this.failures += 1;
        if (this.failures >= this.threshold) {
          this.state = "open";
          this.openedAt = Date.now();
        }
      }
      throw err;
    }
  }
}

Observability — metrics that drive tuning

If you can't see what is being rejected, you can't tune the limit. Minimum metrics:

Prometheus + Grafana, Cloudflare Analytics, or AWS CloudWatch all work, but the operational must-have is being able to slice "by tool × by agent ID". Only when you can see "create_invoice on freee MCP is being throttled specifically for agent X" can you tell whether the cause is a misconfigured limit or a buggy agent.

Production checklist

Before shipping rate limiting to production, walk through this minimum list.

Get rate-limit data for 225+ services

KanseiLink exposes measured rate limits, success rates, and timeout behavior for Japanese SaaS and major global APIs via MCP. Find out which services frequently return 429 and which surface Retry-After correctly — based on real agent behavior.

Reinforce implementation choices with rate-limit data

FAQ

Q1. Why does an MCP server need rate limiting?

By 2026 MCP servers are shared infrastructure for multiple agents. A single agent stuck in a loop has been documented to fire over 1,000 requests per minute. Without limiting, you get (1) collapsed experience for other agents, (2) cost spikes, and (3) bans from the upstream API.

Q2. How should rate-limit errors be returned?

HTTP transport: HTTP 429 + Retry-After header. stdio / non-HTTP: JSON-RPC error.data with retry_after_seconds and scope (global vs tool). The agent uses these values to decide both timing and strategy of the next retry.

Q3. Why is jitter required?

Without jitter, exponential backoff causes the thundering herd: many agents synchronized on (1s → 2s → 4s) re-spike together. AWS-recommended Full Jitter — sleep = random(0, base * 2^attempt) — scatters retries across time.

Q4. Token bucket vs sliding window — which?

Token bucket for "usually relaxed, but tolerate bursts". Sliding window for "must never exceed M in the last N seconds" — typical when matching an upstream's hard cap. The 2026 default in MCP servers is a hybrid: token bucket per tool, sliding window globally.

Q5. What's the difference vs a circuit breaker?

Different protection target. Rate limiting is "don't fire too fast"; circuit breakers are "don't pile on an upstream that's already down". Use both. A reasonable starting point: open after 5 consecutive 5xx, half-open after 30 s.

Q6. Is in-memory token bucket enough for production?

Single-instance dev or PoC, sure. As soon as you scale to multiple instances, atomic operations on Redis (Durable Objects, ElastiCache) are required. With in-memory state each instance counts independently, so an "N rps cap" effectively becomes 3× or 5×.

Data Disclosures & Caveats

The "1,000+ requests per minute from a looping agent" figure originates from MintMCP Blog (mintmcp.com/blog/rate-limiting-with-mcp). MCP rate-limit implementation patterns (token bucket, sliding window, 429 + Retry-After, JSON-RPC retry-after) are based on Fast.io (fast.io/resources/mcp-server-rate-limiting/), WebScraping.AI FAQ, and the Splunk MCP Server v1.1.0 docs (beta feature, April 2026). Full Jitter is documented in the AWS Architecture Blog post "Exponential Backoff And Jitter". Code samples are illustrative pseudo-code; production adoption requires multi-instance consistency, time-zone correctness, and integrated monitoring. Pricing and specifications can change without notice — verify against the official docs before production use.