Guide 23 mins

MCP Observability: Tracing Tool Calls Across an Agent Loop

Master MCP observability with OpenTelemetry. Trace tool calls end-to-end, detect runaway loops, and ship production-ready agentic AI with PostHog dashboards.

The PADISO Team ·2026-05-13

Why MCP Observability Matters in Production
What Is MCP Observability?
OpenTelemetry and MCP: The Foundation
Wiring OpenTelemetry Through MCP
Tracing Tool Calls End-to-End
Building PostHog Dashboards for Agent Visibility
Detecting and Debugging Runaway Loops
Real-World Implementation Patterns
Common Pitfalls and How to Avoid Them
Next Steps: Moving to Production

Why MCP Observability Matters in Production

Agentic AI systems are fundamentally different from traditional software. They don’t follow predetermined code paths. Instead, they reason, decide which tools to call, and iterate until they solve a problem. This autonomy is powerful—but it’s also opaque.

Without proper observability, you’re flying blind. You don’t know:

Which tools your agent actually called and in what order
How long each tool invocation took and whether it succeeded or failed
What data flowed between the agent and your backend systems
Whether your agent is stuck in an infinite loop, burning through your LLM budget
Which user requests triggered unexpected behaviour

We’ve seen this firsthand. One of our clients deployed an agentic AI system to automate customer support escalations. Within 48 hours, the agent had entered a loop where it kept calling the same database query tool, each time with slightly different parameters, unable to satisfy its own success criteria. The bill? $47,000 in LLM costs. The visibility? Zero.

That’s where MCP observability comes in. When you wire OpenTelemetry through your Model Context Protocol infrastructure, every tool call becomes traceable. You can see the exact path your agent took, spot the loop before it costs you five figures, and debug production issues in minutes instead of days.

This guide will show you exactly how to build that visibility into your agentic AI systems—from wiring OTEL instrumentation through MCP to building PostHog dashboards that your team actually uses.

What Is MCP Observability?

MCP observability is the practice of instrumenting your Model Context Protocol servers and agent orchestration layers so that every interaction—every tool call, every LLM decision, every data flow—is captured, traced, and made queryable.

It’s not just logging. Logging tells you that something happened. Observability tells you why it happened, how long it took, and what the impact was.

The Three Pillars of MCP Observability

Traces capture the full execution path of a single agent request. A trace shows every tool call, every LLM invocation, every step in the reasoning loop. Each span within a trace represents a discrete operation—a database query, an API call, a prompt evaluation. Traces let you see the entire journey from user input to final output.

Metrics aggregate data across many traces. Instead of looking at one agent run, you look at patterns: average tool-call latency, error rates per tool, distribution of loop iterations, cost per request. Metrics are what power your dashboards and alerts.

Logs provide context and detail. When a span fails, logs explain why. When a tool returns unexpected data, logs show what that data was. Logs are your source of truth for debugging.

MCP observability ties all three together. When you query your dashboard and see that tool X has a 15% error rate, you can drill down into traces to see which specific requests failed, then look at logs to understand why. That’s the power of integrated observability.

Why MCP Specifically?

The Model Context Protocol is becoming the standard for connecting LLMs to tools and data sources. It’s used by Claude, by many open-source agent frameworks, and by enterprise orchestration platforms. But MCP servers are often black boxes. You send a tool call, you get a result back, but you don’t see what happened in between.

Wiring observability through MCP means you’re not just seeing what your agent did—you’re seeing what your entire system did, from the LLM’s perspective down to the database row that was queried. That end-to-end visibility is critical for production agentic AI.

OpenTelemetry and MCP: The Foundation

OpenTelemetry (OTEL) is an open standard for collecting traces, metrics, and logs from applications. It’s maintained by the Cloud Native Computing Foundation and is rapidly becoming the industry standard for observability instrumentation.

Why OpenTelemetry? Because it’s:

Vendor-agnostic. You instrument your code once, and you can send traces to Jaeger, Grafana Tempo, Datadog, New Relic, or any OTEL-compatible backend.
Language-agnostic. Whether you’re writing agents in Python, Node.js, Go, or Rust, there’s an OTEL SDK.
Designed for distributed systems. OTEL propagates trace context across service boundaries, so you can follow a request from your agent orchestrator through your MCP server and into your database.

How OpenTelemetry Works

At its core, OpenTelemetry uses three concepts:

Tracer: An object that creates spans.
Span: A unit of work. A span has a name, a start time, an end time, attributes, and events.
Exporter: A component that sends spans to a backend (Jaeger, Tempo, PostHog, etc.).

When your agent calls a tool, you create a span. When that tool queries a database, you create a child span. When the database returns, you close the child span. When the tool returns, you close the parent span. The result is a tree of operations that shows exactly what happened.

Trace Context Propagation

One of OTEL’s superpowers is trace context propagation. When your agent calls an MCP server, it includes trace context in the request (usually in HTTP headers). The MCP server reads that context and creates child spans under the same trace. This way, a single trace can span from your agent orchestration layer all the way down to your database.

Without trace context propagation, you’d have separate traces for the agent, the MCP server, and the database. You’d have to manually correlate them. With propagation, they’re automatically linked into a single, coherent story.

Wiring OpenTelemetry Through MCP

Now let’s get concrete. Here’s how to instrument your MCP servers and agent orchestration with OpenTelemetry.

Step 1: Install OTEL Dependencies

For Python (the most common language for agentic AI):

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

If you’re using a specific MCP framework, there may be pre-built instrumentation. For example, if you’re using the traceloop/opentelemetry-mcp-server package, it provides automatic instrumentation for MCP servers with support for exporting to Jaeger, Grafana Tempo, and other OTEL backends.

Step 2: Initialize the OTEL SDK

In your agent orchestration code, initialize the tracer:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Create a tracer provider
tracer_provider = TracerProvider()

# Set up OTLP exporter (sends traces to your backend)
otlp_exporter = OTLPSpanExporter(
    endpoint="localhost:4317",  # or your observability backend
)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

# Set the global tracer provider
trace.set_tracer_provider(tracer_provider)

# Get a tracer
tracer = trace.get_tracer(__name__)

This initializes OTEL to export traces via the OTLP protocol (OpenTelemetry Protocol). You can point this at a local Jaeger instance for development, or at a managed service like Grafana Cloud for production.

Step 3: Instrument Your Agent Loop

Now, wrap your agent’s main loop in a span:

def run_agent(user_query: str) -> str:
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("user.query", user_query)
        span.set_attribute("agent.model", "claude-3-5-sonnet")
        
        messages = [{"role": "user", "content": user_query}]
        
        while True:
            # Call the LLM
            with tracer.start_as_current_span("llm.invoke") as llm_span:
                response = client.messages.create(
                    model="claude-3-5-sonnet",
                    max_tokens=1024,
                    tools=tools,
                    messages=messages,
                )
                llm_span.set_attribute("llm.stop_reason", response.stop_reason)
                llm_span.set_attribute("llm.input_tokens", response.usage.input_tokens)
                llm_span.set_attribute("llm.output_tokens", response.usage.output_tokens)
            
            # Check if the agent wants to call a tool
            if response.stop_reason == "end_turn":
                return response.content[0].text
            
            # Process tool calls
            for content_block in response.content:
                if content_block.type == "tool_use":
                    tool_name = content_block.name
                    tool_input = content_block.input
                    
                    with tracer.start_as_current_span(f"tool.call") as tool_span:
                        tool_span.set_attribute("tool.name", tool_name)
                        tool_span.set_attribute("tool.input", str(tool_input))
                        
                        # Call the tool (via MCP)
                        result = call_tool(tool_name, tool_input)
                        
                        tool_span.set_attribute("tool.result", str(result))
                    
                    # Add the tool result to the message history
                    messages.append({"role": "assistant", "content": response.content})
                    messages.append({
                        "role": "user",
                        "content": [{
                            "type": "tool_result",
                            "tool_use_id": content_block.id,
                            "content": str(result),
                        }],
                    })

Notice how we’re creating nested spans: an outer agent.run span, and within it, llm.invoke and tool.call spans. Each span has attributes that describe what happened.

Step 4: Instrument Your MCP Server

On the MCP server side, you want to capture what happens inside each tool. Here’s a simple example:

from mcp.server import Server
from opentelemetry import trace

tracer = trace.get_tracer(__name__)
server = Server("my-mcp-server")

@server.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="query_database",
            description="Query the customer database",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                },
            },
        ),
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict) -> str:
    with tracer.start_as_current_span(f"mcp.tool.{name}") as span:
        span.set_attribute("mcp.tool.name", name)
        span.set_attribute("mcp.tool.arguments", str(arguments))
        
        if name == "query_database":
            with tracer.start_as_current_span("database.query") as db_span:
                query = arguments["query"]
                db_span.set_attribute("db.statement", query)
                
                # Execute the query
                result = execute_query(query)
                
                db_span.set_attribute("db.rows_returned", len(result))
            
            return str(result)
        else:
            raise ValueError(f"Unknown tool: {name}")

Now, when your agent calls the query_database tool via MCP, the entire chain is traced: agent → LLM → tool call → MCP server → database query. Each step is a span, and they’re all linked together.

Step 5: Propagate Trace Context

For this to work end-to-end, you need to propagate trace context from your agent to your MCP server. If you’re using HTTP to call the MCP server, this is straightforward:

from opentelemetry.propagate import inject
import httpx

def call_tool(tool_name: str, tool_input: dict) -> str:
    # Prepare headers with trace context
    headers = {}
    inject(headers)
    
    # Call the MCP server
    response = httpx.post(
        "http://mcp-server:8000/call_tool",
        json={"name": tool_name, "arguments": tool_input},
        headers=headers,
    )
    
    return response.json()["result"]

On the MCP server side, extract the trace context:

from opentelemetry.propagate import extract
from fastapi import Request

@app.post("/call_tool")
async def call_tool_endpoint(request: Request, body: dict):
    # Extract trace context from headers
    ctx = extract(request.headers)
    
    with trace.set_span_in_context(ctx):
        # Now any spans created here will be children of the incoming trace
        return await call_tool(body["name"], body["arguments"])

This ensures that the trace context flows from your agent, through your HTTP request, and into your MCP server. All spans are part of the same trace.

Tracing Tool Calls End-to-End

Let’s zoom out and look at what end-to-end tracing actually gives you.

Anatomy of a Complete Trace

When a user submits a query to your agent, here’s what a complete trace looks like:

agent.run (1200ms)
├── llm.invoke (450ms)
│   └── api_call: claude.anthropic.com (450ms)
├── tool.call: query_database (600ms)
│   └── mcp.tool.query_database (590ms)
│       └── database.query (580ms)
│           └── postgres.execute (575ms)
├── llm.invoke (100ms)
│   └── api_call: claude.anthropic.com (100ms)
└── tool.call: format_result (50ms)
    └── mcp.tool.format_result (45ms)

Each span has:

A name (what operation is this?)
A duration (how long did it take?)
Attributes (metadata about the operation)
Events (discrete things that happened during the span)
Status (success, error, or unknown)

When you look at this trace in your observability backend, you can immediately see:

The agent took 1.2 seconds total
The LLM was called twice (450ms + 100ms = 550ms of LLM time)
The database query took 580ms, which was the bottleneck
The tool calls succeeded (no error status)

Attributes That Matter

When instrumenting tool calls, capture these attributes:

Tool name and version: tool.name, tool.version
Input and output: tool.input, tool.output (be careful with PII)
Latency: Automatically captured by OTEL (span duration)
Success or failure: tool.status, tool.error_message
Resource usage: tool.tokens_used (for LLM tools), tool.rows_returned (for database tools)
User context: user.id, user.request_id (for correlating logs and traces)

Here’s a more complete example:

with tracer.start_as_current_span("tool.call") as span:
    span.set_attribute("tool.name", tool_name)
    span.set_attribute("tool.version", "1.0.0")
    span.set_attribute("user.id", user_id)
    span.set_attribute("request.id", request_id)
    
    try:
        result = call_tool(tool_name, tool_input)
        span.set_attribute("tool.status", "success")
        span.set_attribute("tool.output_length", len(str(result)))
    except Exception as e:
        span.set_attribute("tool.status", "error")
        span.set_attribute("tool.error_type", type(e).__name__)
        span.set_attribute("tool.error_message", str(e))
        span.record_exception(e)
        raise

Querying Traces

Once your traces are flowing to your observability backend (Jaeger, Grafana Tempo, Datadog, etc.), you can query them:

Find all traces for user X: attributes.user.id = "X"
Find all traces where a tool failed: attributes.tool.status = "error"
Find all traces where the database query took > 1 second: mcp.tool.query_database.duration > 1000ms
Find all traces where the LLM was called more than 3 times: count(llm.invoke) > 3

This is where observability becomes a superpower. You’re not just looking at logs or metrics in isolation—you’re reconstructing the exact execution path of a single request.

Building PostHog Dashboards for Agent Visibility

Traces are great for debugging individual requests, but you also need visibility into patterns. That’s where metrics and dashboards come in.

PostHog is an open-source product analytics platform that integrates well with OpenTelemetry. Here’s how to build dashboards that your team will actually use.

Setting Up OTEL → PostHog

PostHog supports OTEL natively. Configure your OTEL exporter to send to PostHog:

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

otlp_exporter = OTLPSpanExporter(
    endpoint="https://app.posthog.com/batch",
    headers={
        "Authorization": f"Bearer {POSTHOG_API_KEY}",
    },
)

tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

PostHog will ingest your traces and automatically create a queryable trace database.

Dashboard 1: Agent Health

Your first dashboard should answer: “Is my agent working?”

Metrics to track:

Requests per minute: Count of agent.run spans
Error rate: Percentage of agent.run spans with status = error
Average loop iterations: Count of llm.invoke spans per agent.run span
P95 latency: 95th percentile duration of agent.run spans

Query in PostHog:

select
  count(*) as requests,
  sum(case when status = 'error' then 1 else 0 end) * 100.0 / count(*) as error_rate_pct,
  avg(duration) as avg_latency_ms,
  percentile(duration, 0.95) as p95_latency_ms
from spans
where span_name = 'agent.run'
and timestamp > now() - interval '1 hour'

Dashboard 2: Tool Performance

Which tools are slow? Which are failing?

Metrics to track:

Tool call count: Breakdown by tool name
Tool error rate: Percentage of tool.call spans with status = error, by tool
Tool latency: P50, P95, P99 duration by tool
Tool usage trend: Tool call count over time

Query in PostHog:

select
  attributes['tool.name'] as tool_name,
  count(*) as calls,
  sum(case when status = 'error' then 1 else 0 end) * 100.0 / count(*) as error_rate_pct,
  avg(duration) as avg_latency_ms,
  percentile(duration, 0.95) as p95_latency_ms
from spans
where span_name = 'tool.call'
and timestamp > now() - interval '24 hours'
group by attributes['tool.name']
order by calls desc

Dashboard 3: Cost Tracking

For agentic AI, LLM costs can spiral. Track them in real time.

Metrics to track:

Tokens per request: Input + output tokens per agent.run span
Cost per request: Calculated from token counts and model pricing
Cost trend: Total cost over time
Cost by user: Identify power users or cost anomalies

Query in PostHog:

select
  date_trunc('hour', timestamp) as hour,
  sum(attributes['llm.input_tokens']::int + attributes['llm.output_tokens']::int) as total_tokens,
  sum((attributes['llm.input_tokens']::int * 0.003 + attributes['llm.output_tokens']::int * 0.015) / 1000000) as cost_usd
from spans
where span_name = 'llm.invoke'
and timestamp > now() - interval '7 days'
group by date_trunc('hour', timestamp)
order by hour desc

(Prices are for Claude 3.5 Sonnet; adjust for your model.)

Dashboard 4: Loop Detection

This is the critical one. You want to catch runaway loops before they cost you money.

Metrics to track:

Loop iterations per request: Count of llm.invoke spans per agent.run span
Max iterations: Highest loop count in the last hour
Requests with > N iterations: Alert if any request loops more than your threshold

Query in PostHog:

select
  trace_id,
  count(*) as llm_invocations,
  max(duration) as max_span_duration_ms
from spans
where span_name = 'llm.invoke'
and timestamp > now() - interval '1 hour'
group by trace_id
having count(*) > 5  -- Alert if more than 5 LLM calls per trace
order by llm_invocations desc

When this query returns results, you’ve got a problem. Investigate immediately.

Setting Up Alerts

PostHog supports alerts on metric queries. Set these up:

Error rate > 5% on agent.run spans
P95 latency > 30 seconds on agent.run spans
Loop iterations > 5 per trace
Cost per request > $1 (or your threshold)
Tool error rate > 10% for any tool

These alerts should go to Slack or PagerDuty so your team gets notified immediately when something goes wrong.

Detecting and Debugging Runaway Loops

Runaway loops are the most dangerous failure mode for agentic AI. Your agent gets stuck calling the same tool over and over, each time burning LLM tokens and potentially calling expensive APIs.

With proper observability, you can detect and debug these in minutes instead of after they’ve cost you thousands.

Detecting Loops in Real Time

Your loop detection dashboard (from above) is the first line of defence. But you can be more proactive.

Add a check in your agent loop:

def run_agent(user_query: str) -> str:
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("user.query", user_query)
        
        messages = [{"role": "user", "content": user_query}]
        iterations = 0
        max_iterations = 10  # Safety limit
        
        while iterations < max_iterations:
            iterations += 1
            span.set_attribute("agent.iterations", iterations)
            
            # Call the LLM
            with tracer.start_as_current_span("llm.invoke") as llm_span:
                llm_span.set_attribute("iteration", iterations)
                response = client.messages.create(...)
            
            if response.stop_reason == "end_turn":
                span.set_attribute("agent.status", "success")
                return response.content[0].text
            
            # Process tool calls...
        
        # If we hit max_iterations, we're in a loop
        span.set_attribute("agent.status", "loop_detected")
        span.add_event("Max iterations reached")
        raise RuntimeError(f"Agent loop detected after {max_iterations} iterations")

Now, when an agent hits the iteration limit, it immediately shows up in your traces with agent.status = "loop_detected". Your dashboard can alert on this.

Debugging: Trace Analysis

When you detect a loop, here’s how to debug it:

Find the trace in your observability backend (by trace ID, user ID, or timestamp)
Look at the tool calls: Which tool is being called repeatedly?
Look at the tool inputs: Are they the same each time, or slightly different?
Look at the tool outputs: Are they consistent, or does the tool return different results?
Look at the LLM prompts: What is the agent asking for?

Common loop patterns:

Same tool, same input, same output: The agent is stuck in a decision loop. It’s calling a tool, getting the same result, and not understanding that it already has the answer. Fix: Improve the tool’s output format or the agent’s instructions.
Same tool, slightly different inputs: The agent is trying different variations to get a different result. Fix: The tool is broken or the agent’s expectation is wrong. Add error handling in the tool or clarify the agent’s instructions.
Different tools, same overall pattern: The agent is trying multiple tools to accomplish the same goal and failing with each one. Fix: The agent doesn’t have the right tool for the job, or the tools are broken. Add a tool or fix the existing ones.

Here’s how to capture enough detail in your traces to debug these:

with tracer.start_as_current_span("tool.call") as tool_span:
    tool_span.set_attribute("tool.name", tool_name)
    tool_span.set_attribute("tool.input", json.dumps(tool_input))
    tool_span.set_attribute("tool.input_hash", hashlib.md5(json.dumps(tool_input).encode()).hexdigest())
    
    result = call_tool(tool_name, tool_input)
    
    tool_span.set_attribute("tool.output", json.dumps(result))
    tool_span.set_attribute("tool.output_hash", hashlib.md5(json.dumps(result).encode()).hexdigest())

By capturing input and output hashes, you can quickly spot when the same input is being called repeatedly (same hash) or when the agent is trying variations (different hashes).

Cost Control

While you’re debugging, you want to prevent further damage. Implement cost controls:

def run_agent(user_query: str, max_cost_usd: float = 1.0) -> str:
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("user.query", user_query)
        span.set_attribute("max_cost_usd", max_cost_usd)
        
        total_cost = 0.0
        messages = [{"role": "user", "content": user_query}]
        
        while True:
            # Check cost before calling LLM
            if total_cost > max_cost_usd:
                span.set_attribute("agent.status", "cost_limit_exceeded")
                raise RuntimeError(f"Cost limit exceeded: ${total_cost:.2f} > ${max_cost_usd:.2f}")
            
            # Call the LLM
            with tracer.start_as_current_span("llm.invoke") as llm_span:
                response = client.messages.create(...)
                
                input_tokens = response.usage.input_tokens
                output_tokens = response.usage.output_tokens
                
                # Calculate cost (Claude 3.5 Sonnet pricing)
                cost = (input_tokens * 0.003 + output_tokens * 0.015) / 1000000
                total_cost += cost
                
                llm_span.set_attribute("llm.cost_usd", cost)
                llm_span.set_attribute("agent.total_cost_usd", total_cost)
            
            if response.stop_reason == "end_turn":
                span.set_attribute("agent.status", "success")
                span.set_attribute("agent.total_cost_usd", total_cost)
                return response.content[0].text
            
            # Process tool calls...

Now, if an agent enters a loop and starts burning through your budget, it’ll hit the cost limit and stop. You can set different limits for different users or request types.

Real-World Implementation Patterns

Let’s look at how this works in practice, with patterns we’ve seen at PADISO clients.

Pattern 1: Multi-Agent Orchestration

When you have multiple agents working together, trace context propagation becomes critical. Each agent should create a span, and those spans should be linked.

def orchestrate_agents(user_query: str) -> str:
    with tracer.start_as_current_span("orchestration.run") as orch_span:
        orch_span.set_attribute("user.query", user_query)
        
        # Route to the appropriate agent
        if "sales" in user_query:
            with tracer.start_as_current_span("agent.sales") as agent_span:
                result = sales_agent.run(user_query)
                agent_span.set_attribute("agent.result", result)
        elif "support" in user_query:
            with tracer.start_as_current_span("agent.support") as agent_span:
                result = support_agent.run(user_query)
                agent_span.set_attribute("agent.result", result)
        else:
            with tracer.start_as_current_span("agent.general") as agent_span:
                result = general_agent.run(user_query)
                agent_span.set_attribute("agent.result", result)
        
        return result

Now you can see which agent handled which request, and how long each took.

Pattern 2: Tool Caching

If your agent calls the same tool multiple times with the same input, you can cache the result. But you want to track this in your traces.

tool_cache = {}  # In production, use Redis or similar

def call_tool(tool_name: str, tool_input: dict) -> str:
    input_key = hashlib.md5(json.dumps(tool_input, sort_keys=True).encode()).hexdigest()
    cache_key = f"{tool_name}:{input_key}"
    
    with tracer.start_as_current_span("tool.call") as tool_span:
        tool_span.set_attribute("tool.name", tool_name)
        tool_span.set_attribute("tool.input_hash", input_key)
        
        if cache_key in tool_cache:
            tool_span.set_attribute("tool.cache_hit", True)
            result = tool_cache[cache_key]
        else:
            tool_span.set_attribute("tool.cache_hit", False)
            result = execute_tool(tool_name, tool_input)
            tool_cache[cache_key] = result
        
        tool_span.set_attribute("tool.output", str(result))
        return result

Now your dashboard can show cache hit rate, which helps you understand whether your agent is repeating itself.

Pattern 3: Async Tool Execution

If you have tools that take a long time (API calls, database queries), you might want to execute them asynchronously. Trace context propagation is important here too.

import asyncio

async def call_tool_async(tool_name: str, tool_input: dict) -> str:
    # Get the current trace context
    ctx = trace.get_current_span().get_span_context()
    
    # Execute the tool in a background task
    task = asyncio.create_task(execute_tool_in_background(tool_name, tool_input, ctx))
    
    # Wait for the result
    result = await task
    return result

async def execute_tool_in_background(tool_name: str, tool_input: dict, ctx):
    # Set the trace context in the background task
    with trace.set_span_in_context(ctx):
        with tracer.start_as_current_span(f"tool.{tool_name}") as span:
            # Execute the tool
            result = await execute_tool(tool_name, tool_input)
            return result

This ensures that background tasks are part of the same trace as the main agent loop.

Pattern 4: Structured Logging with Trace Context

Logs are still valuable. Include trace context in every log so you can correlate logs with traces.

import logging
from opentelemetry.trace import get_current_span

class TraceContextFilter(logging.Filter):
    def filter(self, record):
        span = get_current_span()
        record.trace_id = span.get_span_context().trace_id
        record.span_id = span.get_span_context().span_id
        return True

logger = logging.getLogger(__name__)
logger.addFilter(TraceContextFilter())

# Now every log includes trace_id and span_id
logger.info("Tool execution started", extra={
    "tool_name": tool_name,
    "tool_input": tool_input,
})

When you’re debugging an issue, you can grep logs by trace ID and see everything that happened in that request.

Common Pitfalls and How to Avoid Them

Pitfall 1: Not Propagating Trace Context

Problem: You have traces from your agent, and separate traces from your MCP server, but they’re not linked.

Solution: Always propagate trace context when making HTTP requests. Use the inject() and extract() functions from OpenTelemetry.

Pitfall 2: Logging PII in Traces

Problem: Your tool inputs or outputs contain personally identifiable information (PII), and you’re logging them in traces. Now you’ve got PII in your observability backend.

Solution: Sanitize sensitive data before logging. Use hashing or truncation.

import hashlib

def sanitize_email(email: str) -> str:
    return hashlib.sha256(email.encode()).hexdigest()[:8]

tool_span.set_attribute("user.email_hash", sanitize_email(tool_input["email"]))

Pitfall 3: Not Setting a Max Iteration Limit

Problem: Your agent enters a loop and burns through your entire budget before you notice.

Solution: Always set a max iteration limit in your agent loop, and a max cost limit. Make them configurable per request type.

Pitfall 4: Ignoring Tool Errors

Problem: A tool fails silently, returning an error message as a string. Your agent doesn’t understand that the tool failed, and keeps calling it.

Solution: Make tool errors explicit. Use exceptions, not error strings.

# Bad
result = execute_query(query)
if "error" in result:
    return result  # Agent doesn't know this is an error

# Good
try:
    result = execute_query(query)
except DatabaseError as e:
    span.record_exception(e)
    raise

Pitfall 5: Not Instrumenting Downstream Systems

Problem: You have traces from your agent and your MCP server, but not from your database. You know a query took 5 seconds, but you don’t know why.

Solution: Instrument your entire stack. If you’re using PostgreSQL, use an OTEL instrumentation library. Same for Redis, Elasticsearch, etc.

For PostgreSQL with Python:

pip install opentelemetry-instrumentation-psycopg2

from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

Psycopg2Instrumentor().instrument()

Now every database query is automatically traced.

Next Steps: Moving to Production

You’ve built observability into your agentic AI system. Now, how do you deploy it safely?

Step 1: Start with a Single Agent

Don’t try to instrument your entire system at once. Pick one agent, one MCP server, one tool. Get observability working end-to-end. Once you’re confident, scale.

Step 2: Set Up Your Observability Backend

You have options:

Local development: Use Jaeger in Docker. It’s free, open-source, and runs locally.
Staging: Use Grafana Cloud or Datadog. You want a managed service so you don’t have to operate it.
Production: Use a managed service. PostHog, Datadog, New Relic, Grafana Cloud—all support OTEL and are production-ready.

For many of our clients at PADISO, PostHog is the sweet spot: it’s open-source, it has a generous free tier, and it integrates seamlessly with OTEL.

Step 3: Build Your Dashboards

Start with the four dashboards we outlined: Agent Health, Tool Performance, Cost Tracking, and Loop Detection. Build them incrementally. Don’t try to build the perfect dashboard on day one.

Step 4: Set Up Alerts

Once your dashboards are working, add alerts. Start conservative (high thresholds) and tighten them as you understand your system’s normal behavior.

Step 5: Document Your Observability Strategy

Write down:

What you’re tracing and why
What your dashboards show
How to debug common issues
Who gets alerted when things go wrong

This documentation is invaluable when you’re on-call at 3 AM.

Step 6: Monitor Your Monitoring

Your observability system itself can fail. Your OTEL exporter might be dropping spans. Your observability backend might be down. Set up monitoring for your monitoring.

# Track OTEL exporter health
span_processor = BatchSpanProcessor(otlp_exporter)
span_processor.force_flush(timeout_millis=5000)

if not span_processor.force_flush():
    logger.error("OTEL exporter failed to flush spans")

Step 7: Iterate

Observability is not a one-time project. As your system evolves, your observability needs will evolve too. Add new metrics, refine dashboards, tighten alerts. Make observability part of your development process.

Connecting with PADISO

Building production-grade observability for agentic AI is complex. It’s not just about wiring OTEL—it’s about understanding your system deeply, knowing what to measure, and building dashboards that actually help you operate.

At PADISO, we’ve helped dozens of Sydney and Australian startups and enterprises ship agentic AI systems that are observable, reliable, and cost-effective. We’ve learned what works and what doesn’t. We’ve debugged runaway loops, optimised tool performance, and helped teams understand their LLM costs.

If you’re building agentic AI and want to ensure you have the observability to run it safely in production, reach out to PADISO. We offer AI & Agents Automation services that include observability architecture, dashboard design, and ongoing monitoring. We’ve also worked extensively with Vanta implementation for teams who need to audit their AI systems for compliance, and we provide Platform Design & Engineering support to help you build the infrastructure that observability requires.

We also publish guides on related topics. If you’re interested in how agentic AI compares to traditional automation, read our guide on Agentic AI vs Traditional Automation. If you want to understand real production failures and how to avoid them, check out Agentic AI Production Horror Stories.

Summary

MCP observability is about making your agentic AI system transparent. By wiring OpenTelemetry through your Model Context Protocol infrastructure, you can trace every tool call, detect loops before they cost you money, and debug production issues in minutes.

The key steps:

Instrument your agent loop with OTEL spans
Instrument your MCP server with OTEL spans
Propagate trace context from your agent to your MCP server
Export traces to a backend like PostHog or Grafana
Build dashboards for Agent Health, Tool Performance, Cost Tracking, and Loop Detection
Set up alerts for error rates, latency, loops, and cost
Debug using traces when something goes wrong

Start small. Pick one agent, one tool, one dashboard. Get it working. Then scale. Observability is not a one-time project—it’s an ongoing practice that gets better as your system matures.

If you’re building agentic AI in Sydney or Australia and want expert guidance on observability, compliance, or platform engineering, contact PADISO today.

MCP Observability: Tracing Tool Calls Across an Agent Loop

Table of Contents

Why MCP Observability Matters in Production

What Is MCP Observability?

The Three Pillars of MCP Observability

Why MCP Specifically?

OpenTelemetry and MCP: The Foundation

How OpenTelemetry Works

Trace Context Propagation

Wiring OpenTelemetry Through MCP

Step 1: Install OTEL Dependencies

Step 2: Initialize the OTEL SDK

Step 3: Instrument Your Agent Loop

Step 4: Instrument Your MCP Server

Step 5: Propagate Trace Context

Tracing Tool Calls End-to-End

Anatomy of a Complete Trace

Attributes That Matter

Querying Traces

Building PostHog Dashboards for Agent Visibility

Setting Up OTEL → PostHog

Dashboard 1: Agent Health

Dashboard 2: Tool Performance

Dashboard 3: Cost Tracking

Dashboard 4: Loop Detection

Setting Up Alerts

Detecting and Debugging Runaway Loops

Detecting Loops in Real Time

Debugging: Trace Analysis

Cost Control

Real-World Implementation Patterns

Pattern 1: Multi-Agent Orchestration

Pattern 2: Tool Caching

Pattern 3: Async Tool Execution

Pattern 4: Structured Logging with Trace Context

Common Pitfalls and How to Avoid Them

Pitfall 1: Not Propagating Trace Context

Pitfall 2: Logging PII in Traces

Pitfall 3: Not Setting a Max Iteration Limit

Pitfall 4: Ignoring Tool Errors

Pitfall 5: Not Instrumenting Downstream Systems

Next Steps: Moving to Production

Step 1: Start with a Single Agent

Step 2: Set Up Your Observability Backend

Step 3: Build Your Dashboards

Step 4: Set Up Alerts

Step 5: Document Your Observability Strategy

Step 6: Monitor Your Monitoring

Step 7: Iterate

Connecting with PADISO

Summary