PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 27 mins

Claude in Production: MCP Server Patterns

Master MCP server patterns for Claude in production. Architecture, failure scenarios, code patterns, and deployment strategies for enterprise AI systems.

The PADISO Team ·2026-06-13

Claude in Production: MCP Server Patterns

Table of Contents

  1. Why MCP Server Patterns Matter
  2. Understanding the Model Context Protocol
  3. Core MCP Server Architecture
  4. Production-Ready Server Patterns
  5. Failure Scenarios and Prevention
  6. Implementation Examples and Code Patterns
  7. Deployment and Scaling Considerations
  8. Security and Compliance in MCP Deployments
  9. Monitoring and Observability
  10. Next Steps and Getting Started

Why MCP Server Patterns Matter

Running Claude in production is fundamentally different from using it in a sandbox. The stakes are higher, the failure modes are more complex, and the cost of downtime or data leaks is measured in revenue and reputation. If you’re shipping Claude-powered features to customers, you need patterns that work at scale.

MCP—the Model Context Protocol—is Anthropic’s answer to a critical problem: how do you give Claude reliable, controlled access to your systems, data, and tools without building a bespoke integration for every use case? Instead of writing custom code to bolt Claude onto your database, your APIs, your document store, or your workflow engine, you define that connection once, through a standardised protocol, and Claude can use it consistently.

But “standardised protocol” doesn’t mean “one right way to deploy it.” In practice, production Claude deployments require thoughtful patterns around server lifecycle, error handling, resource management, and observability. Get these wrong, and you’ll ship something that works in testing but fails under load, or worse, fails silently.

This guide covers the patterns that prevent those failures. We’ll walk through architecture, code, failure scenarios, and the specific decisions you need to make before your MCP servers handle real traffic.


Understanding the Model Context Protocol

What MCP Is and Why It Exists

Anthropic’s official announcement of the Model Context Protocol introduced MCP as a standardised way for Claude to interact with external systems. The protocol solves a practical problem: Claude needs access to your data and tools, but you need control, auditability, and security. MCP sits between Claude and your systems, translating requests and responses in a predictable, auditable way.

Think of MCP as a contract. Claude makes a request—“fetch the user’s account balance”—and the MCP server either fulfils it or rejects it. Claude doesn’t know or care how the server works internally. It just knows that if it sends a valid request, it gets a valid response (or a documented error).

The official MCP documentation provides the canonical overview of the protocol, its concepts, and how to get started. The key concepts are:

  • Resources: Data that Claude can read (files, database records, API responses)
  • Tools: Actions Claude can take (create a record, send an email, run a calculation)
  • Prompts: Templates that guide Claude’s reasoning for specific tasks
  • Sampling: Claude’s ability to request data or actions in a loop, with the server responding each time

In production, you’ll use all of these, but the patterns around them differ significantly from development.

MCP vs. Direct API Integration

You could skip MCP entirely and have Claude call your APIs directly via function calling. Why don’t most production teams do that?

Control and auditability. With MCP, every interaction between Claude and your systems flows through a single, versioned server. You can log it, rate-limit it, and audit it in one place. With direct API calls, you’re managing permissions and logging at the API level, which is messier and more error-prone.

Consistency. If you have multiple Claude instances (one for customer support, one for internal operations, one for analytics), they all use the same MCP server. Changes to the server propagate everywhere at once. Direct API calls mean managing multiple integrations.

Failure isolation. If your MCP server fails, you know exactly what broke. If Claude is calling your APIs directly and one of them times out, the failure is distributed across your system.

Cost control. MCP servers can cache data, batch requests, and filter what Claude sees. Direct API calls don’t have those levers.

For production systems, especially those handling sensitive data or high traffic, MCP is the right abstraction.


Core MCP Server Architecture

The Basic Server Structure

An MCP server is fundamentally a long-running process that listens for requests and sends responses. It can be written in Python, Node.js, Go, Rust, or any language with JSON-RPC support. Here’s the conceptual flow:

Claude Client

  (JSON-RPC over stdio, HTTP, or WebSocket)

MCP Server

(Internal request handling)

Your Systems (databases, APIs, files, services)

The MCP server receives a JSON-RPC request, validates it, executes the corresponding action (reading from a database, calling an API, processing a file), and returns a response. If something goes wrong, it returns an error.

In production, you’re adding layers:

  • Authentication: Who is Claude, and is it allowed to make this request?
  • Rate limiting: How many requests per second or minute?
  • Caching: Should we cache responses to reduce backend load?
  • Logging and tracing: What happened, when, and why?
  • Graceful degradation: If a backend system is down, what do we do?
  • Resource limits: Memory, CPU, timeout for each request?

These aren’t optional. They’re the difference between a working prototype and a system that survives contact with production.

Request-Response Flow with Error Handling

A practical step-by-step guide to building an MCP server walks through the basic flow, but production adds complexity. Here’s a realistic request-response cycle:

  1. Claude sends a request: {"jsonrpc": "2.0", "id": 1, "method": "tools/call", "params": {...}}
  2. MCP server receives it on stdin, via HTTP, or via WebSocket.
  3. Server parses the JSON and validates the request structure.
  4. Server checks authentication (is this request signed correctly?).
  5. Server checks rate limits (has Claude exceeded its quota?).
  6. Server checks authorisation (is Claude allowed to call this tool?).
  7. Server executes the tool: calls a database, an API, or internal logic.
  8. If execution succeeds, server returns the result.
  9. If execution fails, server returns an error with a code and message.
  10. Claude receives the response and continues reasoning.

Each of these steps is a place where things can go wrong. Production patterns handle each one.


Production-Ready Server Patterns

Pattern 1: The Adapter Server (Isolation and Abstraction)

The adapter pattern isolates your MCP server from your core systems. Instead of having Claude call your production database directly, the MCP server acts as a thin adapter: it receives requests, translates them, calls your backend, and translates responses back.

Why? Resilience. If your MCP server crashes, your backend keeps running. If your backend is under load, the MCP server can queue requests or return cached data. If you need to change how Claude accesses your systems, you change the adapter, not your core code.

Claude → MCP Server (adapter) → Your Backend

              (caching, rate limiting, auth)

In this pattern:

  • The MCP server defines what Claude can see and do. It’s a whitelist, not a passthrough.
  • The server caches frequently accessed data (user profiles, configuration, reference data) to reduce backend load.
  • The server validates all inputs before passing them to the backend.
  • The server handles backend failures gracefully: timeouts, 5xx errors, connection issues.

This pattern works well for read-heavy workloads and for systems where Claude needs access to multiple backends. It’s how PADISO’s platform development teams in Sydney architect multi-system integrations for scale-ups and enterprises.

Pattern 2: The Event-Driven Server (Asynchronous Processing)

If Claude is triggering long-running operations—generating reports, processing files, sending notifications—a synchronous request-response pattern breaks down. Claude times out waiting for the response, or the server runs out of memory holding the result.

The event-driven pattern decouples the request from the response:

Claude requests: "Generate a report for Q4"

MCP server: "I'll do that. Here's your job ID: 12345"

Server enqueues the job (to a queue: RabbitMQ, SQS, Kafka)

Worker processes the job asynchronously

Claude can poll for status: "Is job 12345 done?"

When done, Claude retrieves the result

This pattern requires:

  • A job queue (RabbitMQ, AWS SQS, Google Cloud Tasks, or similar)
  • A way for Claude to poll for status (a tool that returns job status)
  • A way for Claude to retrieve results (another tool, or a callback)
  • Idempotency: if Claude retries a request, it shouldn’t create duplicate jobs

Use this pattern whenever a tool might take more than a few seconds. The trade-off is complexity: you’re managing job lifecycle, error handling, and retries. But you avoid timeouts and resource exhaustion.

Pattern 3: The Multi-Tenant Server (Isolation and Resource Control)

If your MCP server handles requests from multiple customers or teams, you need isolation. One team’s runaway query shouldn’t starve another team’s requests.

The multi-tenant pattern adds:

  • Tenant identification: Every request is tagged with a tenant ID (from authentication).
  • Per-tenant rate limiting: Each tenant has a quota (e.g., 100 requests/minute).
  • Per-tenant resource limits: Each tenant’s queries can use at most 1GB of memory, 30-second timeout, etc.
  • Per-tenant logging: All logs are tagged with the tenant, so you can trace issues to a specific customer.
  • Per-tenant caching: Tenants’ cached data is isolated (one tenant can’t see another’s cache).
Request arrives with tenant_id = "acme-corp"

Server checks: Does acme-corp have quota left?

Server checks: Are all acme-corp's requests within resource limits?

Server executes with acme-corp's context

Server logs: tenant_id=acme-corp, action=tool_call, tool=fetch_users, duration=145ms

This pattern is essential if you’re building a SaaS product with Claude at its core, or if you’re running Claude as a shared service across multiple teams or business units.

Pattern 4: The Caching Server (Performance and Cost)

Claude API calls cost money. Every token you send costs a fraction of a cent, and if your MCP server is fetching the same data repeatedly, you’re wasting money and slowing down responses.

The caching pattern uses multiple layers:

  • In-process cache (Redis, Memcached, or in-memory): Fast, but limited by a single server’s memory.
  • Distributed cache (Redis cluster, DynamoDB): Shared across multiple MCP server instances.
  • Backend cache: Leverage your backend’s caching (HTTP ETags, conditional requests).
Claude requests: "Get user 12345's profile"

MCP server checks in-process cache: Hit! Return cached data.

(Or miss, check distributed cache)

(Or miss, call backend, cache the result, return it)

Caching strategies:

  • Time-based (TTL): Cache data for 5 minutes, then refresh.
  • Event-based: Invalidate cache when a user updates their profile.
  • Dependency-based: If user A’s data depends on team B’s settings, invalidate A’s cache when B changes.

In production, you’re usually combining these. A user profile might be cached for 5 minutes, but if the user updates their email, you invalidate the cache immediately.

Pattern 5: The Graceful Degradation Server (Resilience)

Your backend systems will fail. Network calls will time out. Databases will become unavailable. In production, the question isn’t if this happens, but when and how you handle it.

The graceful degradation pattern has layers:

  1. Timeouts: Every external call (database, API, file system) has a timeout. If it doesn’t respond in time, fail fast.
  2. Retries with backoff: Retry failed requests, but wait longer between retries (exponential backoff). Don’t hammer a failing system.
  3. Circuit breakers: If a backend is failing repeatedly, stop calling it for a while. Return a cached response or a default value instead.
  4. Fallbacks: If the primary data source is unavailable, use a secondary source (cache, default value, degraded response).
  5. Bulkheads: If one backend is slow, don’t let it starve other requests. Use separate thread pools or async contexts for each backend.
Claude requests user data

MCP server calls database with 5-second timeout

Database is slow (8 seconds to respond)

Timeout fires, server returns cached data (or error)

Server increments failure counter for that database

After 5 consecutive failures, circuit breaker opens

For the next 60 seconds, server doesn't call that database

Server returns cached data or default value instead

After 60 seconds, circuit breaker tries again (half-open)

This pattern keeps your system running even when parts of it fail. Claude gets a response (even if it’s degraded), and you have time to fix the underlying issue.


Failure Scenarios and Prevention

Scenario 1: Timeout and Resource Exhaustion

What happens: Claude requests a report that takes 30 seconds to generate. The MCP server waits for the result. Meanwhile, another request comes in, then another. Soon, the server has 50 pending requests, each holding a database connection, each consuming memory. The server runs out of connections or memory and crashes.

Prevention:

  • Set a timeout on every external call (5-30 seconds, depending on your system).
  • Use the event-driven pattern (Pattern 2) for long-running operations.
  • Limit concurrent requests: if you have 20 database connections, don’t allow more than 20 simultaneous operations.
  • Monitor resource usage: CPU, memory, open connections, queue depth. Alert when they approach limits.
  • Use async/await or thread pools to handle concurrency efficiently.

Scenario 2: Cascading Failures

What happens: Your database is slow due to a long-running query. MCP servers start timing out waiting for database responses. Claude retries, causing more load on the database. The database slows down further. Eventually, it becomes completely unresponsive.

Prevention:

  • Use circuit breakers (Pattern 5). If the database is slow, stop calling it.
  • Implement exponential backoff for retries. Don’t retry immediately; wait 100ms, then 200ms, then 400ms.
  • Monitor database health independently. If queries are slow, alert before timeouts start happening.
  • Use read replicas or caches to reduce load on the primary database.
  • Have a manual override: operators should be able to disable Claude’s access to a system if it’s causing problems.

Scenario 3: Data Leaks and Privilege Escalation

What happens: Your MCP server is supposed to show Claude only user data for the current user. But you forget to check the user ID in one tool. Claude can now see all users’ data. Or, Claude can call a tool that’s supposed to be admin-only, but you didn’t check permissions.

Prevention:

  • Whitelist, not blacklist. Define exactly what Claude can see and do. Don’t try to list what it can’t do.
  • Check permissions on every request. Even if a tool should only be callable by admins, verify it on every call.
  • Use a middleware layer. All requests pass through a permission check before reaching the tool.
  • Test with real data. In staging, use realistic data and verify that Claude can’t see data it shouldn’t.
  • Audit logs. Log every tool call with the user, the tool, the parameters, and the result. Review logs regularly.
  • Least privilege. Give Claude only the permissions it needs. If it doesn’t need to delete records, don’t give it that permission.

PADISO’s security audit services for SOC 2 and ISO 27001 compliance help teams identify these gaps before they become breaches. The patterns matter, but so does the process.

Scenario 4: Silent Failures

What happens: Your MCP server returns an error, but Claude doesn’t notice. It proceeds with its reasoning based on incomplete or incorrect data. The user gets a wrong answer, and you don’t know why.

Prevention:

  • Explicit error handling. If a tool fails, return a clear error message. Don’t return an empty result or a default value without saying so.
  • Logging. Log every error with context: what was the request, what went wrong, what was the error code.
  • Monitoring. Track error rates by tool and by tenant. Alert if error rates spike.
  • Testing. Test error paths, not just happy paths. What happens if the database is down? If the API times out? If the data is malformed?

Scenario 5: Cost Overruns

What happens: Claude is calling an expensive tool repeatedly. Maybe it’s fetching the same data multiple times, or it’s calling a tool that triggers an expensive operation. Your API bill doubles.

Prevention:

  • Caching (Pattern 4). Cache responses to reduce repeated calls.
  • Rate limiting. Limit how many times Claude can call expensive tools.
  • Batching. If Claude needs multiple records, give it a tool to fetch them all at once, not one-by-one.
  • Monitoring. Track API calls and costs by tool. Alert if usage spikes.
  • Cost attribution. Charge back costs to the team or customer that triggered them, so there’s accountability.

Implementation Examples and Code Patterns

Example 1: Basic Python MCP Server with Error Handling

A hands-on tutorial for building MCP servers covers the basics. Here’s a production-ready skeleton:

import json
import logging
import sys
from typing import Any, Dict, Optional
from datetime import datetime, timedelta
import hashlib

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    stream=sys.stderr
)
logger = logging.getLogger(__name__)

class MCPServer:
    def __init__(self, cache_ttl: int = 300):
        self.cache: Dict[str, tuple] = {}  # (value, expiry_time)
        self.cache_ttl = cache_ttl
        self.request_count = 0
        self.error_count = 0
        self.tenant_id: Optional[str] = None

    def get_cache_key(self, tool: str, params: Dict) -> str:
        """Generate a cache key from tool name and parameters."""
        param_str = json.dumps(params, sort_keys=True)
        return hashlib.md5(f"{tool}:{param_str}".encode()).hexdigest()

    def get_cached(self, key: str) -> Optional[Any]:
        """Retrieve cached value if it exists and hasn't expired."""
        if key in self.cache:
            value, expiry = self.cache[key]
            if datetime.now() < expiry:
                logger.info(f"Cache hit for {key}")
                return value
            else:
                del self.cache[key]
                logger.info(f"Cache expired for {key}")
        return None

    def set_cache(self, key: str, value: Any) -> None:
        """Store a value in cache with TTL."""
        expiry = datetime.now() + timedelta(seconds=self.cache_ttl)
        self.cache[key] = (value, expiry)
        logger.info(f"Cached {key}, expires at {expiry}")

    def handle_request(self, request: Dict) -> Dict:
        """Handle a single JSON-RPC request."""
        self.request_count += 1
        request_id = request.get('id')
        method = request.get('method')
        params = request.get('params', {})

        # Extract tenant ID from params (or auth header in real systems)
        self.tenant_id = params.get('tenant_id', 'default')

        logger.info(
            f"Request {request_id}: method={method}, tenant={self.tenant_id}, "
            f"params={json.dumps(params)}"
        )

        try:
            # Route to handler
            if method == 'tools/list':
                result = self.list_tools()
            elif method == 'tools/call':
                result = self.call_tool(
                    params.get('name'),
                    params.get('arguments', {})
                )
            elif method == 'resources/list':
                result = self.list_resources()
            elif method == 'resources/read':
                result = self.read_resource(params.get('uri'))
            else:
                return self.error_response(
                    request_id,
                    -32601,
                    f"Unknown method: {method}"
                )

            return {
                'jsonrpc': '2.0',
                'id': request_id,
                'result': result
            }

        except Exception as e:
            self.error_count += 1
            logger.error(
                f"Error handling request {request_id}: {str(e)}",
                exc_info=True
            )
            return self.error_response(
                request_id,
                -32603,
                f"Internal error: {str(e)}"
            )

    def list_tools(self) -> list:
        """Return available tools."""
        return [
            {
                'name': 'fetch_user',
                'description': 'Fetch user data by ID',
                'inputSchema': {
                    'type': 'object',
                    'properties': {
                        'user_id': {'type': 'string', 'description': 'User ID'}
                    },
                    'required': ['user_id']
                }
            },
            {
                'name': 'create_ticket',
                'description': 'Create a support ticket',
                'inputSchema': {
                    'type': 'object',
                    'properties': {
                        'title': {'type': 'string'},
                        'description': {'type': 'string'}
                    },
                    'required': ['title', 'description']
                }
            }
        ]

    def call_tool(self, tool_name: str, arguments: Dict) -> Any:
        """Execute a tool."""
        logger.info(f"Calling tool {tool_name} with args {arguments}")

        # Check permissions
        if not self.check_permission(tool_name):
            raise PermissionError(f"Not allowed to call {tool_name}")

        # Try cache
        cache_key = self.get_cache_key(tool_name, arguments)
        cached_result = self.get_cached(cache_key)
        if cached_result is not None:
            return cached_result

        # Execute tool
        if tool_name == 'fetch_user':
            result = self.fetch_user(arguments.get('user_id'))
        elif tool_name == 'create_ticket':
            result = self.create_ticket(
                arguments.get('title'),
                arguments.get('description')
            )
        else:
            raise ValueError(f"Unknown tool: {tool_name}")

        # Cache result
        self.set_cache(cache_key, result)
        return result

    def check_permission(self, tool_name: str) -> bool:
        """Check if the current tenant can call this tool."""
        # In production, check against a permissions database
        # For now, allow all tools for all tenants
        return True

    def fetch_user(self, user_id: str) -> Dict:
        """Fetch user data (mock implementation)."""
        # In production, call your database or API
        return {
            'id': user_id,
            'name': 'John Doe',
            'email': 'john@example.com',
            'tenant': self.tenant_id
        }

    def create_ticket(self, title: str, description: str) -> Dict:
        """Create a support ticket (mock implementation)."""
        # In production, call your ticketing system
        return {
            'id': 'TICKET-12345',
            'title': title,
            'description': description,
            'status': 'open',
            'tenant': self.tenant_id
        }

    def list_resources(self) -> list:
        """Return available resources."""
        return []

    def read_resource(self, uri: str) -> str:
        """Read a resource."""
        raise NotImplementedError()

    def error_response(self, request_id: Optional[int], code: int, message: str) -> Dict:
        """Return a JSON-RPC error response."""
        return {
            'jsonrpc': '2.0',
            'id': request_id,
            'error': {'code': code, 'message': message}
        }

    def run(self) -> None:
        """Read requests from stdin and write responses to stdout."""
        logger.info("MCP Server starting")
        try:
            for line in sys.stdin:
                line = line.strip()
                if not line:
                    continue
                try:
                    request = json.loads(line)
                    response = self.handle_request(request)
                    print(json.dumps(response))
                    sys.stdout.flush()
                except json.JSONDecodeError as e:
                    logger.error(f"Invalid JSON: {line}")
                    print(json.dumps({
                        'jsonrpc': '2.0',
                        'id': None,
                        'error': {'code': -32700, 'message': 'Parse error'}
                    }))
                    sys.stdout.flush()
        except KeyboardInterrupt:
            logger.info("MCP Server shutting down")
        finally:
            logger.info(
                f"Server stats: requests={self.request_count}, "
                f"errors={self.error_count}"
            )

if __name__ == '__main__':
    server = MCPServer(cache_ttl=300)
    server.run()

This skeleton includes:

  • Logging: Every request and error is logged with context.
  • Caching: Responses are cached with TTL.
  • Error handling: Exceptions are caught and returned as JSON-RPC errors.
  • Tenant isolation: The tenant ID is extracted from params (in production, from auth headers).
  • Permission checks: A stub for checking if the tenant can call a tool.
  • Metrics: Request and error counts.

Example 2: Node.js MCP Server with Rate Limiting

const readline = require('readline');
const EventEmitter = require('events');

class RateLimiter {
  constructor(maxRequests = 100, windowMs = 60000) {
    this.maxRequests = maxRequests;
    this.windowMs = windowMs;
    this.requests = new Map(); // tenant_id -> [timestamps]
  }

  isAllowed(tenantId) {
    const now = Date.now();
    const windowStart = now - this.windowMs;

    if (!this.requests.has(tenantId)) {
      this.requests.set(tenantId, []);
    }

    const timestamps = this.requests.get(tenantId);
    // Remove old requests outside the window
    const recentRequests = timestamps.filter(t => t > windowStart);
    this.requests.set(tenantId, recentRequests);

    if (recentRequests.length < this.maxRequests) {
      recentRequests.push(now);
      return true;
    }
    return false;
  }
}

class MCPServer extends EventEmitter {
  constructor() {
    super();
    this.rateLimiter = new RateLimiter(100, 60000); // 100 req/min per tenant
    this.cache = new Map();
    this.requestCount = 0;
    this.errorCount = 0;
  }

  handleRequest(request) {
    this.requestCount++;
    const { id, method, params = {} } = request;
    const tenantId = params.tenant_id || 'default';

    console.error(`Request ${id}: method=${method}, tenant=${tenantId}`);

    try {
      // Check rate limit
      if (!this.rateLimiter.isAllowed(tenantId)) {
        return this.errorResponse(
          id,
          -32000,
          `Rate limit exceeded for tenant ${tenantId}`
        );
      }

      // Route to handler
      let result;
      if (method === 'tools/list') {
        result = this.listTools();
      } else if (method === 'tools/call') {
        result = this.callTool(params.name, params.arguments || {});
      } else {
        return this.errorResponse(id, -32601, `Unknown method: ${method}`);
      }

      return {
        jsonrpc: '2.0',
        id,
        result
      };
    } catch (error) {
      this.errorCount++;
      console.error(`Error handling request ${id}:`, error.message);
      return this.errorResponse(id, -32603, `Internal error: ${error.message}`);
    }
  }

  listTools() {
    return [
      {
        name: 'fetch_user',
        description: 'Fetch user by ID',
        inputSchema: {
          type: 'object',
          properties: {
            user_id: { type: 'string' }
          },
          required: ['user_id']
        }
      }
    ];
  }

  callTool(toolName, args) {
    if (toolName === 'fetch_user') {
      return { id: args.user_id, name: 'John Doe', email: 'john@example.com' };
    }
    throw new Error(`Unknown tool: ${toolName}`);
  }

  errorResponse(id, code, message) {
    return {
      jsonrpc: '2.0',
      id,
      error: { code, message }
    };
  }

  run() {
    const rl = readline.createInterface({
      input: process.stdin,
      output: process.stdout,
      terminal: false
    });

    rl.on('line', (line) => {
      if (!line.trim()) return;
      try {
        const request = JSON.parse(line);
        const response = this.handleRequest(request);
        console.log(JSON.stringify(response));
      } catch (error) {
        console.error(`Parse error: ${error.message}`);
        console.log(JSON.stringify({
          jsonrpc: '2.0',
          id: null,
          error: { code: -32700, message: 'Parse error' }
        }));
      }
    });

    rl.on('close', () => {
      console.error(
        `Server shutdown. Stats: requests=${this.requestCount}, ` +
        `errors=${this.errorCount}`
      );
    });
  }
}

const server = new MCPServer();
server.run();

This Node.js example adds rate limiting per tenant, a critical pattern for multi-tenant systems.


Deployment and Scaling Considerations

Deployment Models

You have several options for deploying MCP servers:

1. Stdio (Local Development)

Claude Desktop runs the server as a subprocess and communicates via stdin/stdout. This works for development and local testing, but not for production.

2. HTTP Server

The MCP server listens on an HTTP port. Claude (or your application) makes HTTP requests to it. This works well for cloud deployments.

Claude → HTTP POST /mcp → MCP Server (listening on port 8000)

Pros: Easy to deploy, can be load-balanced, can be scaled horizontally. Cons: Network latency, need to manage authentication and encryption.

3. WebSocket Server

The MCP server maintains a persistent WebSocket connection with Claude. This reduces latency for request-response pairs.

Pros: Low latency, persistent connection. Cons: More complex to manage, harder to scale horizontally.

4. Embedded Server

The MCP server runs in the same process as Claude. No network calls, lowest latency.

Pros: Fastest, simplest. Cons: Can’t scale the server independently, must be in the same language as Claude.

For production systems, HTTP is usually the right choice. It’s simple, scalable, and well-understood.

Scaling Patterns

Horizontal Scaling: Run multiple MCP server instances behind a load balancer. Each instance handles a subset of requests.

Claude → Load Balancer → MCP Server 1
                      → MCP Server 2
                      → MCP Server 3

For this to work:

  • Servers must be stateless (or share state via a cache).
  • Requests must be idempotent (calling twice should be safe).
  • Use a distributed cache (Redis, Memcached) instead of in-process caching.

Vertical Scaling: Run a single MCP server with more CPU, memory, and connections. Simpler, but has limits.

Hybrid: Run a few MCP server instances, each handling multiple tenants or tools. Add instances as load grows.

For PADISO’s platform development teams across Sydney, New York, and San Francisco, the choice depends on the workload. Read-heavy systems can scale horizontally easily. Write-heavy systems need careful coordination.

Database Connections and Connection Pooling

If your MCP server talks to a database, connection pooling is critical. Each server instance should maintain a pool of connections (e.g., 10-20 connections), not create a new connection for every request.

from psycopg2 import pool

class MCPServer:
    def __init__(self, db_host, db_name, db_user, db_password):
        self.pool = pool.SimpleConnectionPool(
            1,  # minconn
            20, # maxconn
            host=db_host,
            database=db_name,
            user=db_user,
            password=db_password
        )

    def get_connection(self):
        return self.pool.getconn()

    def return_connection(self, conn):
        self.pool.putconn(conn)

Without pooling, each request creates a new connection, which is slow and expensive. With pooling, connections are reused.


Security and Compliance in MCP Deployments

Authentication and Authorization

Every MCP server needs to verify that the request is authentic (really from Claude, not an attacker) and authorized (Claude is allowed to do what it’s asking).

Authentication: Typically via API keys or signed requests. The client (Claude or your application) includes a secret in the request. The server verifies the secret.

def verify_request(request, secret_key):
    """Verify that the request is signed with the correct key."""
    import hmac
    import hashlib
    
    signature = request.headers.get('X-Signature')
    if not signature:
        raise AuthError('Missing signature')
    
    # Reconstruct the signature
    payload = json.dumps(request.json(), sort_keys=True)
    expected_signature = hmac.new(
        secret_key.encode(),
        payload.encode(),
        hashlib.sha256
    ).hexdigest()
    
    if not hmac.compare_digest(signature, expected_signature):
        raise AuthError('Invalid signature')

Authorization: After authenticating, check if the client is allowed to call the requested tool or access the requested resource.

def check_authorization(tenant_id, tool_name):
    """Check if the tenant can call this tool."""
    permissions = get_tenant_permissions(tenant_id)
    if tool_name not in permissions:
        raise PermissionError(f"Tenant {tenant_id} cannot call {tool_name}")

Data Encryption

Data in transit should be encrypted (TLS/HTTPS). Data at rest should be encrypted if it’s sensitive.

  • TLS: Always use HTTPS for MCP servers. Never send requests over plain HTTP.
  • Database encryption: If the MCP server stores sensitive data, encrypt it in the database.
  • Cache encryption: If caching sensitive data, encrypt it in the cache.

Audit Logging

Every request should be logged with:

  • Timestamp
  • Tenant ID (who made the request)
  • Tool name (what was called)
  • Parameters (what data was used)
  • Result (success or error)
  • Duration (how long it took)

These logs are essential for compliance audits and for debugging issues.

def log_request(tenant_id, tool_name, params, result, duration_ms, error=None):
    log_entry = {
        'timestamp': datetime.utcnow().isoformat(),
        'tenant_id': tenant_id,
        'tool_name': tool_name,
        'params': params,  # Don't log sensitive params
        'result': 'success' if not error else 'error',
        'error': error,
        'duration_ms': duration_ms
    }
    # Send to audit log (CloudWatch, Datadog, Splunk, etc.)
    audit_logger.info(json.dumps(log_entry))

Compliance Frameworks

If your system handles sensitive data, you may need to comply with:

  • SOC 2: Controls around security, availability, processing integrity, confidentiality, and privacy.
  • ISO 27001: Information security management.
  • HIPAA (healthcare), PCI-DSS (payments), GDPR (EU data), etc.

MCP servers are part of your data flow, so they need to be designed with compliance in mind. PADISO’s security audit services help teams build SOC 2 and ISO 27001-ready systems. The key patterns are:

  • Encryption in transit and at rest.
  • Audit logging.
  • Access controls (authentication and authorization).
  • Data minimisation (only access data Claude needs).
  • Incident response (what happens if there’s a breach?).

Monitoring and Observability

Key Metrics to Track

  1. Request rate: Requests per second, per tenant, per tool.
  2. Error rate: Percentage of requests that fail.
  3. Latency: p50, p95, p99 response times.
  4. Cache hit rate: Percentage of requests served from cache.
  5. Resource usage: CPU, memory, database connections.
  6. Cost: API calls, database queries, external service calls.
import time
from prometheus_client import Counter, Histogram, Gauge

# Prometheus metrics
request_count = Counter(
    'mcp_requests_total',
    'Total requests',
    ['tenant_id', 'tool_name', 'status']
)

request_duration = Histogram(
    'mcp_request_duration_seconds',
    'Request duration',
    ['tenant_id', 'tool_name'],
    buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 5.0)
)

cache_hits = Counter(
    'mcp_cache_hits_total',
    'Cache hits',
    ['tenant_id']
)

active_connections = Gauge(
    'mcp_active_connections',
    'Active database connections'
)

# Usage
def call_tool(tenant_id, tool_name, args):
    start = time.time()
    try:
        result = execute_tool(tool_name, args)
        duration = time.time() - start
        request_duration.labels(
            tenant_id=tenant_id,
            tool_name=tool_name
        ).observe(duration)
        request_count.labels(
            tenant_id=tenant_id,
            tool_name=tool_name,
            status='success'
        ).inc()
        return result
    except Exception as e:
        duration = time.time() - start
        request_duration.labels(
            tenant_id=tenant_id,
            tool_name=tool_name
        ).observe(duration)
        request_count.labels(
            tenant_id=tenant_id,
            tool_name=tool_name,
            status='error'
        ).inc()
        raise

Alerting

Set up alerts for:

  • Error rate > 5% (something is broken)
  • Latency p99 > 5 seconds (system is slow)
  • Cache hit rate < 50% (cache isn’t working)
  • Memory usage > 80% (running out of memory)
  • Database connection pool exhausted (can’t connect to database)

Tracing

For complex systems, distributed tracing helps you understand request flows. Use OpenTelemetry or a similar tool to trace a request from Claude through your MCP server and into your backend systems.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def call_tool(tenant_id, tool_name, args):
    with tracer.start_as_current_span("call_tool") as span:
        span.set_attribute("tenant_id", tenant_id)
        span.set_attribute("tool_name", tool_name)
        
        with tracer.start_as_current_span("fetch_from_db"):
            result = fetch_from_database(args)
        
        return result

Next Steps and Getting Started

Step 1: Choose Your Use Case

Start small. Don’t try to build a multi-tenant, globally-distributed MCP server on day one. Choose one use case:

  • Claude needs to read customer data (adapter pattern)
  • Claude needs to trigger a report (event-driven pattern)
  • Claude needs access to multiple internal systems (multi-tenant pattern)

Build for that use case first.

Step 2: Build a Minimal Server

Start with the code skeletons above. Add one tool, make it work, test it thoroughly.

The official MCP documentation and the design patterns article are good references as you build.

Step 3: Add Production Patterns

As you move from prototype to production, add:

  1. Error handling (timeouts, retries, graceful degradation)
  2. Logging and monitoring
  3. Caching
  4. Rate limiting
  5. Authentication and authorization
  6. Audit logging

Don’t add everything at once. Add patterns as you hit the problems they solve.

Step 4: Test Failure Scenarios

Before going live, test:

  • What happens if your database is down? (Does the server fail gracefully?)
  • What happens if an API times out? (Does the server retry?)
  • What happens if Claude sends invalid parameters? (Does the server return a clear error?)
  • What happens if you get a traffic spike? (Does the server rate-limit or degrade gracefully?)

Step 5: Monitor in Production

Once you’re live, monitor:

  • Error rates
  • Latency
  • Cache hit rates
  • Resource usage
  • Cost

Set up alerts so you know immediately if something breaks.

Getting Help

If you’re building a production Claude system and need guidance on architecture, deployment, or compliance, PADISO’s AI advisory services help Sydney-based and Australian scale-ups ship production-ready AI systems. We’ve worked with founders and CEOs on venture studio co-builds, with operators modernising with agentic AI, and with security leads pursuing SOC 2 compliance.

Our fractional CTO service includes architecture reviews, vendor evaluation, and technical hiring support. We also offer platform development services for building production-grade data and AI systems.

View our case studies to see how we’ve helped other companies ship production Claude systems.


Summary

MCP servers are the bridge between Claude and your systems. Building them correctly—with error handling, caching, rate limiting, logging, and monitoring—is the difference between a prototype that works and a production system that scales.

The patterns in this guide are battle-tested across dozens of production deployments. Start with the adapter pattern (isolation and abstraction), add event-driven processing for long-running operations, and layer on caching and rate limiting as you grow.

Test failure scenarios before going live. Monitor relentlessly in production. And don’t hesitate to reach out for help—production AI systems are complex, and getting the architecture right from the start saves weeks of debugging later.

Your next step: pick one use case, build a minimal MCP server using the code skeletons above, and test it thoroughly. Then add production patterns as you move toward scaling.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call