Guide 23 mins

MCP Sampling: Letting the Server Ask Claude Back

Master MCP sampling: how servers request Claude completions mid-execution. Real patterns, latency costs, auth models, and when to use agentic delegation.

The PADISO Team ·2026-05-10

What is MCP Sampling and Why It Matters
The Core Pattern: Server-Initiated LLM Requests
How MCP Sampling Works Under the Hood
Authentication and Security in Sampling
Latency Costs and Performance Trade-offs
Real-World Use Cases for MCP Sampling
Building Your First MCP Sampling Integration
Common Pitfalls and How to Avoid Them
Advanced Patterns and Optimisation
Compliance and Audit Readiness
Summary and Next Steps

What is MCP Sampling and Why It Matters

MCP sampling represents a fundamental shift in how servers and AI clients interact. Rather than forcing all intelligence into a single monolithic LLM call, MCP sampling allows a server—running your business logic, data access layer, or workflow orchestrator—to pause mid-execution and ask Claude (or another LLM via the client) for a completion, a decision, or a tool invocation.

This inversion of control is powerful. Traditionally, you call an LLM, it calls your tools, you return data, and the LLM decides what to do next. With MCP sampling, your server owns the orchestration. It can decide when to ask the LLM for help, what context to provide, and how to integrate the response back into your workflow.

Why does this matter? Because it lets you build agentic systems that are:

Deterministic and auditable: Your server controls the flow; you can log every decision point.
Scalable: You’re not bottlenecked by LLM latency for every single step.
Cost-efficient: You only invoke Claude when you actually need reasoning, not for every operation.
Compliant: When you’re pursuing SOC 2 compliance or ISO 27001 audit readiness, server-side control means you can enforce access controls, logging, and data residency rules before any LLM request leaves your infrastructure.

At PADISO, we’ve built dozens of agentic AI and automation workflows for startups and enterprises. MCP sampling is the architectural pattern that separates production-grade AI systems from proof-of-concepts. It’s the difference between a chatbot that calls tools randomly and an AI agent that orchestrates your entire operational workflow while maintaining compliance.

The Core Pattern: Server-Initiated LLM Requests

Let’s start with the conceptual model. In a traditional setup:

Client → LLM (Claude) → Tool Call → Your Server → LLM → Response → Client

The client is in charge. Claude is reactive. Your server is a passive tool.

With MCP sampling, the flow reverses:

Your Server → [MCP Sampling Request] → Client → LLM (Claude) → Tool Call → Server → LLM → Server Response

Your server initiates the sampling request. The client (which might be Claude Code, a local Claude instance, or an API wrapper) receives that request, invokes Claude, and returns the result. The server then continues execution based on Claude’s response.

This pattern is formally defined in the official MCP specification. The specification outlines how servers request sampling, what parameters they can pass (model selection, temperature, system prompts, tool definitions), and how clients should respond.

The key insight: the server is now the orchestrator. It decides:

When to ask Claude for help (after a data lookup fails, when a decision threshold is crossed, when human input is needed).
What tools Claude can access in that moment (not all tools, just the ones relevant to this decision).
What context Claude sees (filtered, sanitised, audit-logged data).
How to handle Claude’s response (integrate it, reject it, escalate it).

This is why agentic AI automation works so well in enterprise settings. Your server can enforce business rules, compliance gates, and data policies before and after Claude touches anything.

How MCP Sampling Works Under the Hood

Understanding the mechanics helps you design robust systems. Here’s what happens when your server initiates an MCP sampling request:

The Request Structure

Your server sends a JSON-RPC request to the client with the following shape:

{
  "jsonrpc": "2.0",
  "id": "unique-request-id",
  "method": "sampling/createMessage",
  "params": {
    "model": "claude-3-5-sonnet-20241022",
    "messages": [
      {
        "role": "user",
        "content": "Based on the transaction history, should we flag this account for review?"
      }
    ],
    "system": "You are a fraud detection assistant. Be conservative.",
    "maxTokens": 500,
    "temperature": 0.3,
    "tools": [
      {
        "name": "query_transaction_history",
        "description": "Fetch recent transactions for an account",
        "inputSchema": {
          "type": "object",
          "properties": {
            "account_id": {"type": "string"}
          },
          "required": ["account_id"]
        }
      }
    ]
  }
}

Notice what you’re controlling:

model: Which Claude version to use. You can have different servers request different models based on latency or cost constraints.
messages: The conversation context. You’ve already filtered this to only include relevant information.
system: The system prompt. You’re constraining Claude’s behaviour at request time.
tools: Only the tools Claude can call in this context. Not your entire tool registry—just the ones that make sense for this decision.
temperature: You set it low (0.3) for conservative fraud detection, higher for creative tasks.

The Response Flow

The client receives this request. It invokes Claude with these exact parameters. Claude might:

Return a text response: “This account shows unusual patterns. Flag for review.”
Call a tool: Request the transaction history via your server’s tool handler.
Ask for clarification: Request more context (rare, but possible).

If Claude calls a tool, the client routes that tool call back to your server. Your server executes the tool (with all your access controls and logging in place), and sends the result back to Claude. Claude processes that result and either responds with text or calls another tool.

Once Claude is done, the client returns the final message to your server:

{
  "jsonrpc": "2.0",
  "id": "unique-request-id",
  "result": {
    "content": [
      {
        "type": "text",
        "text": "This account shows 7 transactions in the last hour from different countries. Risk score: 8.5/10. Recommend immediate review."
      }
    ],
    "model": "claude-3-5-sonnet-20241022",
    "usage": {
      "inputTokens": 1240,
      "outputTokens": 85
    }
  }
}

Your server now has Claude’s decision, token usage, and model confirmation. You can log this, validate it against business rules, and decide what to do next.

This is the pattern outlined in the MCP specification for sampling. The beauty is that it’s language-agnostic and framework-agnostic. Whether you’re building in Python, Node.js, Go, or Rust, the protocol is the same.

For teams building custom software development solutions or pursuing platform engineering, this protocol is the foundation of production-grade agentic systems.

Authentication and Security in Sampling

Here’s where MCP sampling shines for compliance-focused teams. Because your server initiates the sampling request, you control the entire auth pipeline.

Identity and Access Control

When your server sends a sampling request, it includes:

Server identity: The client knows which server made the request (via TLS mutual auth or API keys).
User context: Your server can include the user ID, role, and permissions in the system prompt or message context.
Data scope: You only pass Claude the data that user is allowed to see.

Example:

{
  "method": "sampling/createMessage",
  "params": {
    "system": "You are assisting User ID: u_12345 with role: analyst. You can only reference data from their assigned portfolio.",
    "messages": [
      {
        "role": "user",
        "content": "Summarise the Q4 performance of my portfolio.",
        "metadata": {
          "user_id": "u_12345",
          "role": "analyst",
          "portfolio_ids": ["p_001", "p_002"]
        }
      }
    ]
  }
}

Claude never sees user IDs outside that portfolio. Your server enforces the boundary.

Audit Logging

Because your server owns the orchestration, you log every sampling request and response:

{
  "timestamp": "2025-01-15T14:32:00Z",
  "user_id": "u_12345",
  "sampling_request": {
    "model": "claude-3-5-sonnet-20241022",
    "tool_calls_requested": ["query_portfolio"],
    "tokens_input": 1240
  },
  "sampling_response": {
    "tokens_output": 85,
    "tools_called": ["query_portfolio"],
    "duration_ms": 1850,
    "decision": "approved"
  },
  "server_action": "executed_recommendation",
  "audit_trail_id": "audit_789xyz"
}

This audit trail is exactly what SOC 2 Type II auditors and ISO 27001 compliance reviewers want to see. You’re proving that AI-assisted decisions are logged, traceable, and subject to access controls.

For teams pursuing compliance via Vanta implementation, MCP sampling audit logs integrate seamlessly into your evidence collection workflow. You’re not asking “did Claude do something secure?”—you’re proving “did our server enforce access controls before Claude saw anything?”

Secrets and Data Residency

Because your server is the intermediary, you can:

Mask sensitive data: Replace real customer names with anonymised IDs before Claude sees them.
Enforce data residency: If your compliance requirement is that data never leaves Australia, your server can route sampling requests to a local Claude deployment (via an on-premise MCP server) instead of the cloud API.
Rate-limit per user: Your server can track sampling requests per user and enforce quotas before Claude is invoked.

Latency Costs and Performance Trade-offs

MCP sampling adds latency. Let’s be concrete about it.

The Latency Stack

When your server requests a sampling call, you pay:

Network round-trip to client: 50–200ms (depending on whether it’s local or remote).
Client-to-Claude API latency: 300–2000ms (depending on model, prompt length, and Claude’s load).
Tool execution latency (if Claude calls tools): 100–5000ms per tool (depends on your tool implementations).
Response serialisation and return: 50–200ms.

Total: 500–7200ms per sampling call. If your workflow makes 5 sampling calls, you’re looking at 2.5–36 seconds of pure LLM latency.

Compare that to a single monolithic LLM call (where Claude orchestrates all tool calls internally): 1000–3000ms total.

So why use MCP sampling if it’s slower?

When the Trade-off Wins

Scenario 1: Selective reasoning

You have a workflow with 100 steps. Only 3 steps need Claude’s reasoning. With MCP sampling, you invoke Claude 3 times (3–10 seconds). With a monolithic approach, you’d invoke Claude once but force it to process all 100 steps, including the 97 that don’t need reasoning (15–25 seconds).

Scenario 2: Cost optimisation

You have a workflow where 80% of decisions can be made by rule-based logic. Only 20% need Claude. With MCP sampling:

Rule-based decisions: 50ms each (80 decisions = 4 seconds).
Claude decisions: 1500ms each (20 decisions = 30 seconds).
Total: 34 seconds, ~0.3M input tokens.

With a monolithic approach:

Single Claude call: 2000ms.
But Claude has to process all 100 decision points, including the rule-based ones: ~1.2M input tokens.
Total: 2 seconds, but 4× the token cost.

Scenario 3: Compliance and auditability

You need to prove that sensitive data access was controlled. With MCP sampling, you can:

Rule-based logic checks user permissions (5ms).
Only if user is authorised, invoke Claude with filtered data (1500ms).
Log the entire decision (10ms).

With a monolithic approach, you’d have to pass all data to Claude and rely on Claude to respect permissions (risky, hard to audit).

Optimising for Latency

If latency is critical:

Batch sampling requests: If you have multiple independent decisions, request them in parallel.
Use faster models: claude-3-5-haiku is 2–3× faster than claude-3-5-sonnet and costs 90% less. Use it for low-risk decisions.
Cache system prompts: The MCP specification supports prompt caching. Reuse the same system prompt across requests to reduce token processing time.
Pre-filter context: Only pass Claude the data it needs. A 10KB context is 2–3× faster to process than a 100KB context.

For AI strategy and readiness engagements, we often recommend starting with MCP sampling for critical workflows (fraud detection, compliance checks, escalation decisions) and rule-based logic for everything else. This hybrid approach typically reduces latency by 40–60% compared to a fully agentic system.

Real-World Use Cases for MCP Sampling

Let’s ground this in concrete scenarios where MCP sampling delivers value.

Use Case 1: Fraud Detection in Fintech

A Sydney-based fintech startup processes 10,000 transactions per day. They need to flag suspicious transactions in real-time.

Without MCP sampling (monolithic LLM):

Transaction arrives → Claude processes: "Is this fraud?"
  → Claude calls: get_user_history, get_merchant_profile, check_geolocation
  → Claude decides: "Flag for review" or "Allow"
  → 2–3 second latency per transaction

For 10,000 transactions, that’s 20,000–30,000 seconds of LLM time. Not viable.

With MCP sampling (server orchestration):

Transaction arrives → Server checks:
  → Is transaction amount > $5,000? (rule-based, <1ms)
  → Is user in high-risk country? (rule-based, <1ms)
  → If both true, request MCP sampling: "Should we flag this?"
    → Claude reviews transaction + context → "Yes, flag."
    → 1.5 seconds
  → If either false, allow (rule-based, <1ms)

Result: 95% of transactions processed in <5ms. Only 5% trigger Claude (500 transactions × 1.5s = 750 seconds). Total daily LLM time: 750 seconds. Cost: ~$50/day instead of $500/day.

And crucially: every Claude invocation is logged. You can prove to auditors that you’re not using AI for low-risk decisions, only for high-risk ones. This is exactly what security audit and compliance teams want to see.

Use Case 2: Customer Support Escalation

A mid-market SaaS company handles 1,000 support tickets per day. Most are routine (password resets, billing questions). Some need human judgment.

Workflow:

Ticket arrives → Server extracts customer ID, issue category, sentiment.
Rule-based logic checks: Is this a known issue? (Yes → send FAQ, done in 50ms).
If not, request MCP sampling: “Should we escalate this to a human?”
- Claude sees: ticket text, customer history, current queue depth.
- Claude doesn’t see: other customers’ data, internal pricing, executive names.
- Claude responds: “Escalate to Tier 2” or “Agent can handle with template X”.
Server logs the decision and routes accordingly.

Outcome:

70% of tickets resolved by rule-based routing (50ms each).
20% resolved by Claude + template suggestion (1.5s each).
10% escalated to human (1.5s Claude decision + human time).
Total LLM cost: ~$20/day for 1,000 tickets.
Audit trail: Every escalation decision is logged with reasoning.

When your AI automation agency partner designs this, they’re not just optimising for speed and cost—they’re building auditability into the architecture.

Use Case 3: Data Labelling and Classification

A compliance team at an enterprise needs to classify 50,000 documents for GDPR compliance. They need to identify which documents contain personal data.

Workflow:

Document arrives → Server extracts metadata (size, file type, date).
Rule-based heuristics: Is this a known template? (Yes → auto-classify, 10ms).
If not, request MCP sampling: “Does this document contain personal data?”
- Claude sees: document text (first 2000 tokens only).
- Claude calls: check_schema (to understand data structure).
- Claude responds: “Yes, contains name, email, phone” or “No, generic content”.
Server logs the classification and stores it in audit database.

Outcome:

40,000 documents auto-classified by rules (10ms each).
10,000 documents classified by Claude (1.5s each).
Total time: 40,000 × 0.01s + 10,000 × 1.5s = 15,400 seconds = ~4 hours.
Cost: ~$30.
Audit trail: Every Claude decision is logged with document ID, timestamp, user who reviewed it.

When you’re pursuing ISO 27001 compliance or SOC 2 audit readiness, this kind of AI-assisted classification with full audit logging is exactly what regulators want to see. You’re not blindly trusting Claude; you’re using Claude as a tool within a controlled, logged process.

Building Your First MCP Sampling Integration

Let’s build a minimal but complete example. We’ll create a simple MCP server that uses sampling to make decisions about user account risk.

Step 1: Set Up Your MCP Server

If you’re using Python, start with the MCP SDK:

pip install mcp

Create a basic server:

from mcp.server import Server
from mcp.server.stdio import stdio_server
import json

server = Server("risk-assessment-server")

@server.call_tool()
async def assess_user_risk(user_id: str):
    """
    Main workflow: assess user risk using MCP sampling.
    """
    # Step 1: Fetch user data (your database)
    user = await get_user_from_db(user_id)
    
    # Step 2: Rule-based checks
    if user['account_age_days'] < 1:
        return {"risk_level": "high", "reason": "new_account", "requires_review": True}
    
    if user['failed_logins'] > 5:
        return {"risk_level": "high", "reason": "multiple_failed_logins", "requires_review": True}
    
    # Step 3: If rules don't decide, ask Claude
    sampling_request = {
        "model": "claude-3-5-sonnet-20241022",
        "messages": [
            {
                "role": "user",
                "content": f"User {user_id} has the following profile. Should we flag for review? Account age: {user['account_age_days']} days, Transaction volume: {user['monthly_transactions']}, Last login: {user['last_login']}, Country: {user['country']}"
            }
        ],
        "system": "You are a risk assessment assistant. Be conservative. Respond with ONLY: 'low', 'medium', or 'high'.",
        "maxTokens": 50,
        "temperature": 0.1
    }
    
    # Step 4: Invoke sampling
    response = await server.request_sampling(sampling_request)
    
    # Step 5: Parse response and log
    risk_level = response['content'][0]['text'].strip().lower()
    
    await log_sampling_decision({
        "user_id": user_id,
        "sampling_request": sampling_request,
        "response": response,
        "timestamp": datetime.now().isoformat(),
        "audit_id": generate_audit_id()
    })
    
    return {
        "risk_level": risk_level,
        "reason": "claude_assessment",
        "requires_review": risk_level in ["medium", "high"],
        "tokens_used": response['usage']['outputTokens']
    }

Step 2: Connect to a Client

Your MCP server runs as a subprocess. The client (which has access to Claude) connects via stdio:

# In your client code (Node.js example)
const { spawn } = require('child_process');
const { MCPClient } = require('@modelcontextprotocol/sdk/client');

const server = spawn('python', ['risk_assessment_server.py']);
const client = new MCPClient({
  name: "risk-assessment-client",
  command: server.stdin,
  responseHandler: (msg) => server.stdout.write(JSON.stringify(msg))
});

await client.connect();

The client now has a connection to your server. When your server sends a sampling/createMessage request, the client receives it, invokes Claude, and returns the result.

Step 3: Test the Workflow

curl -X POST http://localhost:3000/assess_user \
  -H "Content-Type: application/json" \
  -d '{"user_id": "u_12345"}'

Response:

{
  "risk_level": "low",
  "reason": "claude_assessment",
  "requires_review": false,
  "tokens_used": 12,
  "audit_id": "audit_xyz789"
}

Every decision is logged. You can query the audit log:

SELECT * FROM sampling_audit_log WHERE user_id = 'u_12345' ORDER BY timestamp DESC;

This is the foundation. For production, you’d add:

Retry logic: If Claude times out, fall back to a conservative rule.
Rate limiting: Limit sampling requests per user per day.
Caching: If you’ve already assessed a user in the last hour, reuse the result.
A/B testing: Compare Claude’s decisions to human decisions to validate accuracy.

For teams building custom software development solutions or pursuing AI strategy and readiness, this pattern is the starting point for production agentic systems.

Common Pitfalls and How to Avoid Them

Pitfall 1: Asking Claude for Decisions It Shouldn’t Make

The mistake: Passing Claude sensitive business logic.

# ❌ Bad: Claude decides who gets a discount
sampling_request = {
    "messages": [{"role": "user", "content": f"Should we give {user_id} a 50% discount?"}],
    "tools": ["apply_discount", "send_email", "log_decision"]
}

Why this fails:

Claude might approve discounts inconsistently.
You can’t audit why Claude made the decision (it’s a black box).
If Claude is compromised or hallucinating, you’ve lost control of your business logic.

The fix: Use rules for business logic, Claude for ambiguous cases.

# ✅ Good: Rules decide, Claude provides context
if user['loyalty_score'] > 80:
    discount = 0.5  # Rule: high-loyalty users get 50%
elif user['account_age_days'] > 365 and user['annual_spend'] > 10000:
    discount = 0.25  # Rule: long-term, high-spend users get 25%
else:
    # Ambiguous case: ask Claude
    sampling_request = {
        "messages": [{"role": "user", "content": f"Should we offer a discount to {user_id}? Context: {user_context}"}],
        "system": "You are a customer retention assistant. Suggest a discount (0%, 10%, 25%, or 50%) based on the user's profile. Be conservative."
    }
    response = await server.request_sampling(sampling_request)
    discount = parse_discount(response['content'][0]['text'])

Pitfall 2: Not Logging Sampling Requests

The mistake: Invoking Claude without recording what you asked or what it responded.

When an audit happens, you have no evidence of what Claude decided or why.

The fix: Log every sampling request and response.

await log_sampling_decision({
    "timestamp": datetime.now().isoformat(),
    "user_id": user_id,
    "request_id": generate_uuid(),
    "sampling_request": {
        "model": sampling_request['model'],
        "max_tokens": sampling_request['maxTokens'],
        "temperature": sampling_request['temperature'],
        "tools_available": [tool['name'] for tool in sampling_request.get('tools', [])]
    },
    "sampling_response": {
        "model": response['model'],
        "tokens_input": response['usage']['inputTokens'],
        "tokens_output": response['usage']['outputTokens'],
        "content": response['content'][0]['text'][:500]  # First 500 chars for audit
    },
    "server_decision": "approved" or "rejected",
    "audit_trail_id": generate_audit_id()
})

Store this in a tamper-proof audit database (append-only log, immutable timestamps). When your SOC 2 auditor asks “Can you prove that Claude’s decisions were reviewed?”, you have the evidence.

Pitfall 3: Giving Claude Too Much Context

The mistake: Passing Claude a 500KB JSON dump of user data.

# ❌ Bad: Claude sees everything
sampling_request = {
    "messages": [{
        "role": "user",
        "content": f"Assess this user: {json.dumps(all_user_data)}"  # 500KB!
    }]
}

Problems:

Latency explodes (processing 500KB takes 5–10 seconds).
Token cost balloons (500KB = 100K+ tokens).
Claude might leak sensitive data in its response (names, emails, phone numbers).
Audit trail becomes unwieldy.

The fix: Pre-filter context. Only pass Claude what it needs.

# ✅ Good: Claude sees only relevant data
relevant_context = {
    "account_age_days": user['account_age_days'],
    "monthly_transactions": user['monthly_transactions'],
    "failed_logins_last_7d": user['failed_logins_last_7d'],
    "country": user['country'],
    "risk_score": user['risk_score']
}

sampling_request = {
    "messages": [{
        "role": "user",
        "content": f"Assess risk for user: {json.dumps(relevant_context)}"
    }],
    "system": "You are a risk assessment assistant. Do NOT mention specific names, emails, or phone numbers in your response."
}

Pitfall 4: Not Handling Claude Timeouts

The mistake: If Claude times out, your entire workflow fails.

# ❌ Bad: No fallback
response = await server.request_sampling(sampling_request)
risk_level = response['content'][0]['text']

If Claude API is slow or down, your users get errors.

The fix: Implement fallback logic.

# ✅ Good: Fallback to conservative rule
try:
    response = await server.request_sampling(sampling_request, timeout=2.0)
    risk_level = response['content'][0]['text'].strip().lower()
except asyncio.TimeoutError:
    # Claude didn't respond in time. Fall back to conservative rule.
    risk_level = "medium"  # Default to caution
    await log_sampling_decision({
        "status": "timeout",
        "fallback": "medium",
        "user_id": user_id
    })

This ensures your service stays up even if Claude is slow.

Advanced Patterns and Optimisation

Pattern 1: Parallel Sampling Requests

If you have multiple independent decisions, request them in parallel:

# ✅ Good: Parallel requests
async def assess_user_comprehensive(user_id: str):
    user = await get_user_from_db(user_id)
    
    # Launch three independent sampling requests in parallel
    risk_task = server.request_sampling({
        "messages": [{"role": "user", "content": f"Risk assessment for {user_id}"}]
    })
    
    compliance_task = server.request_sampling({
        "messages": [{"role": "user", "content": f"Compliance check for {user_id}"}]
    })
    
    retention_task = server.request_sampling({
        "messages": [{"role": "user", "content": f"Retention opportunity for {user_id}"}]
    })
    
    # Wait for all three in parallel
    risk, compliance, retention = await asyncio.gather(risk_task, compliance_task, retention_task)
    
    return {
        "risk_level": parse_response(risk),
        "compliance_status": parse_response(compliance),
        "retention_opportunity": parse_response(retention),
        "total_latency_ms": measure_latency()  # ~1500ms instead of ~4500ms
    }

Instead of waiting 4.5 seconds for three sequential requests, you wait 1.5 seconds for three parallel requests.

Pattern 2: Caching Sampling Results

If you assess the same user twice in 1 hour, reuse the previous result:

# ✅ Good: Cache sampling results
async def assess_user_risk_cached(user_id: str):
    cache_key = f"risk_assessment:{user_id}"
    cached_result = await redis.get(cache_key)
    
    if cached_result:
        return json.loads(cached_result)
    
    # Not in cache, request sampling
    result = await server.request_sampling({...})
    
    # Cache for 1 hour
    await redis.setex(cache_key, 3600, json.dumps(result))
    
    return result

This reduces latency from 1.5 seconds to <50ms for repeat assessments.

Pattern 3: Cost Optimisation with Model Selection

Use cheaper models for low-risk decisions:

# ✅ Good: Model selection based on risk
async def assess_user_risk_optimised(user_id: str):
    user = await get_user_from_db(user_id)
    
    # Rule-based heuristic: Is this a clear case?
    if user['account_age_days'] > 1000 and user['failed_logins'] == 0:
        # Low-risk case: use cheaper, faster model
        model = "claude-3-5-haiku-20241022"
        max_tokens = 20
    else:
        # High-risk case: use more capable model
        model = "claude-3-5-sonnet-20241022"
        max_tokens = 100
    
    response = await server.request_sampling({
        "model": model,
        "maxTokens": max_tokens,
        "messages": [{"role": "user", "content": f"Risk level for {user_id}?"}]
    })
    
    return parse_response(response)

Haiku costs 90% less than Sonnet and is 2–3× faster. For straightforward decisions, it’s perfect.

Compliance and Audit Readiness

MCP sampling is powerful for compliance because it gives you control. Here’s how to leverage it for SOC 2 and ISO 27001 audits.

Building an Audit Trail

Create an immutable audit log table:

CREATE TABLE sampling_audit_log (
    id UUID PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL,
    user_id VARCHAR NOT NULL,
    request_id UUID NOT NULL,
    sampling_model VARCHAR NOT NULL,
    input_tokens INT NOT NULL,
    output_tokens INT NOT NULL,
    request_summary TEXT,
    response_summary TEXT,
    server_decision VARCHAR,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR NOT NULL
);

CREATE INDEX idx_sampling_audit_user ON sampling_audit_log(user_id);
CREATE INDEX idx_sampling_audit_timestamp ON sampling_audit_log(timestamp);

Every sampling request goes here. You can prove:

Who made the decision (created_by).
When it was made (timestamp).
What Claude was asked (request_summary).
What Claude responded (response_summary).
How the server acted on it (server_decision).

For Vanta implementation, this table is evidence that you’re controlling AI decisions, not letting them run wild.

Monitoring and Alerting

Set up alerts for anomalies:

# Alert if sampling requests spike
async def monitor_sampling_requests():
    last_hour_requests = await db.query(
        "SELECT COUNT(*) FROM sampling_audit_log WHERE timestamp > NOW() - INTERVAL 1 HOUR"
    )
    
    if last_hour_requests > 1000:  # Threshold
        await send_alert({
            "severity": "warning",
            "message": f"Sampling requests spike: {last_hour_requests} in last hour",
            "action": "Review sampling decisions for anomalies"
        })

This catches runaway AI usage, which auditors care about.

Access Control

Ensure only authorised services can request sampling:

@server.call_tool()
async def request_sampling_with_auth(request: SamplingRequest, caller_id: str):
    # Verify caller is authorised
    if not await is_authorised(caller_id, "sampling"):
        await log_auth_failure({
            "caller_id": caller_id,
            "action": "sampling",
            "timestamp": datetime.now().isoformat()
        })
        raise PermissionError(f"Caller {caller_id} not authorised for sampling")
    
    # Proceed with sampling
    response = await server.request_sampling(request)
    return response

When your ISO 27001 auditor asks “Who can invoke AI decisions?”, you have the answer: only authorised services, and every attempt is logged.

Summary and Next Steps

MCP sampling is a game-changer for building production-grade agentic systems. Here’s what we’ve covered:

Key Takeaways

MCP sampling inverts control: Your server orchestrates, Claude assists. This is more secure, auditable, and compliant than monolithic LLM calls.
Authentication and access control are built-in: Because your server is the intermediary, you can enforce identity, permissions, and data residency before Claude sees anything.
Audit logging is essential: Every sampling request and response must be logged. This is non-negotiable for compliance.
Latency is real but manageable: MCP sampling adds 500–2000ms per request. Optimise by using rules for straightforward decisions and Claude only for ambiguous cases.
Cost scales with intent: You only pay for Claude when you actually need it. Hybrid rule-based + agentic systems are typically 70–80% cheaper than fully agentic systems.
Compliance is achievable: When you follow these patterns—filtering context, logging decisions, enforcing access controls—you’re building systems that pass SOC 2 and ISO 27001 audits.

Next Steps

Start small: Pick one low-risk workflow (e.g., support ticket classification) and implement MCP sampling. Get comfortable with the pattern.
Build your audit trail: Set up logging from day one. Don’t retrofit it later.
Test with a partner: If you’re serious about this, work with a team that has built agentic systems before. At PADISO, we’ve built custom software development solutions and AI strategy and readiness programmes for 50+ startups and enterprises. We can help you design your MCP sampling architecture, implement it, and pass your compliance audits.
Measure and iterate: Track latency, cost, and decision accuracy. Use these metrics to refine your rules and sampling thresholds.
Scale to production: Once you’re confident, expand to more workflows. By then, you’ll have a proven playbook.

Resources for Going Deeper

The official MCP specification is the source of truth. For hands-on learning, Philipp Schmid’s MCP guide is excellent. If you’re building with open-source models, check out Hugging Face’s MCP blog post.

For Sydney-based teams pursuing AI adoption or AI advisory services, we’re here to help. Whether you need fractional CTO support to design your agentic architecture or a full venture studio co-build to ship your product, let’s talk.

MCP sampling isn’t just a technical pattern—it’s the foundation of AI systems that are fast, cheap, compliant, and auditable. Build with it.