MCP Observability: Tracing Tool Calls Across an Agent Loop
Master MCP observability with OpenTelemetry. Trace tool calls end-to-end, detect runaway loops, and ship production-ready agentic AI with PostHog dashboards.
Table of Contents
- Why MCP Observability Matters in Production
- What Is MCP Observability?
- OpenTelemetry and MCP: The Foundation
- Wiring OpenTelemetry Through MCP
- Tracing Tool Calls End-to-End
- Building PostHog Dashboards for Agent Visibility
- Detecting and Debugging Runaway Loops
- Real-World Implementation Patterns
- Common Pitfalls and How to Avoid Them
- Next Steps: Moving to Production
Why MCP Observability Matters in Production
Agentic AI systems are fundamentally different from traditional software. They don’t follow predetermined code paths. Instead, they reason, decide which tools to call, and iterate until they solve a problem. This autonomy is powerful—but it’s also opaque.
Without proper observability, you’re flying blind. You don’t know:
- Which tools your agent actually called and in what order
- How long each tool invocation took and whether it succeeded or failed
- What data flowed between the agent and your backend systems
- Whether your agent is stuck in an infinite loop, burning through your LLM budget
- Which user requests triggered unexpected behaviour
We’ve seen this firsthand. One of our clients deployed an agentic AI system to automate customer support escalations. Within 48 hours, the agent had entered a loop where it kept calling the same database query tool, each time with slightly different parameters, unable to satisfy its own success criteria. The bill? $47,000 in LLM costs. The visibility? Zero.
That’s where MCP observability comes in. When you wire OpenTelemetry through your Model Context Protocol infrastructure, every tool call becomes traceable. You can see the exact path your agent took, spot the loop before it costs you five figures, and debug production issues in minutes instead of days.
This guide will show you exactly how to build that visibility into your agentic AI systems—from wiring OTEL instrumentation through MCP to building PostHog dashboards that your team actually uses.
What Is MCP Observability?
MCP observability is the practice of instrumenting your Model Context Protocol servers and agent orchestration layers so that every interaction—every tool call, every LLM decision, every data flow—is captured, traced, and made queryable.
It’s not just logging. Logging tells you that something happened. Observability tells you why it happened, how long it took, and what the impact was.
The Three Pillars of MCP Observability
Traces capture the full execution path of a single agent request. A trace shows every tool call, every LLM invocation, every step in the reasoning loop. Each span within a trace represents a discrete operation—a database query, an API call, a prompt evaluation. Traces let you see the entire journey from user input to final output.
Metrics aggregate data across many traces. Instead of looking at one agent run, you look at patterns: average tool-call latency, error rates per tool, distribution of loop iterations, cost per request. Metrics are what power your dashboards and alerts.
Logs provide context and detail. When a span fails, logs explain why. When a tool returns unexpected data, logs show what that data was. Logs are your source of truth for debugging.
MCP observability ties all three together. When you query your dashboard and see that tool X has a 15% error rate, you can drill down into traces to see which specific requests failed, then look at logs to understand why. That’s the power of integrated observability.
Why MCP Specifically?
The Model Context Protocol is becoming the standard for connecting LLMs to tools and data sources. It’s used by Claude, by many open-source agent frameworks, and by enterprise orchestration platforms. But MCP servers are often black boxes. You send a tool call, you get a result back, but you don’t see what happened in between.
Wiring observability through MCP means you’re not just seeing what your agent did—you’re seeing what your entire system did, from the LLM’s perspective down to the database row that was queried. That end-to-end visibility is critical for production agentic AI.
OpenTelemetry and MCP: The Foundation
OpenTelemetry (OTEL) is an open standard for collecting traces, metrics, and logs from applications. It’s maintained by the Cloud Native Computing Foundation and is rapidly becoming the industry standard for observability instrumentation.
Why OpenTelemetry? Because it’s:
- Vendor-agnostic. You instrument your code once, and you can send traces to Jaeger, Grafana Tempo, Datadog, New Relic, or any OTEL-compatible backend.
- Language-agnostic. Whether you’re writing agents in Python, Node.js, Go, or Rust, there’s an OTEL SDK.
- Designed for distributed systems. OTEL propagates trace context across service boundaries, so you can follow a request from your agent orchestrator through your MCP server and into your database.
How OpenTelemetry Works
At its core, OpenTelemetry uses three concepts:
- Tracer: An object that creates spans.
- Span: A unit of work. A span has a name, a start time, an end time, attributes, and events.
- Exporter: A component that sends spans to a backend (Jaeger, Tempo, PostHog, etc.).
When your agent calls a tool, you create a span. When that tool queries a database, you create a child span. When the database returns, you close the child span. When the tool returns, you close the parent span. The result is a tree of operations that shows exactly what happened.
Trace Context Propagation
One of OTEL’s superpowers is trace context propagation. When your agent calls an MCP server, it includes trace context in the request (usually in HTTP headers). The MCP server reads that context and creates child spans under the same trace. This way, a single trace can span from your agent orchestration layer all the way down to your database.
Without trace context propagation, you’d have separate traces for the agent, the MCP server, and the database. You’d have to manually correlate them. With propagation, they’re automatically linked into a single, coherent story.
Wiring OpenTelemetry Through MCP
Now let’s get concrete. Here’s how to instrument your MCP servers and agent orchestration with OpenTelemetry.
Step 1: Install OTEL Dependencies
For Python (the most common language for agentic AI):
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
If you’re using a specific MCP framework, there may be pre-built instrumentation. For example, if you’re using the traceloop/opentelemetry-mcp-server package, it provides automatic instrumentation for MCP servers with support for exporting to Jaeger, Grafana Tempo, and other OTEL backends.
Step 2: Initialize the OTEL SDK
In your agent orchestration code, initialize the tracer:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Create a tracer provider
tracer_provider = TracerProvider()
# Set up OTLP exporter (sends traces to your backend)
otlp_exporter = OTLPSpanExporter(
endpoint="localhost:4317", # or your observability backend
)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
# Set the global tracer provider
trace.set_tracer_provider(tracer_provider)
# Get a tracer
tracer = trace.get_tracer(__name__)
This initializes OTEL to export traces via the OTLP protocol (OpenTelemetry Protocol). You can point this at a local Jaeger instance for development, or at a managed service like Grafana Cloud for production.
Step 3: Instrument Your Agent Loop
Now, wrap your agent’s main loop in a span:
def run_agent(user_query: str) -> str:
with tracer.start_as_current_span("agent.run") as span:
span.set_attribute("user.query", user_query)
span.set_attribute("agent.model", "claude-3-5-sonnet")
messages = [{"role": "user", "content": user_query}]
while True:
# Call the LLM
with tracer.start_as_current_span("llm.invoke") as llm_span:
response = client.messages.create(
model="claude-3-5-sonnet",
max_tokens=1024,
tools=tools,
messages=messages,
)
llm_span.set_attribute("llm.stop_reason", response.stop_reason)
llm_span.set_attribute("llm.input_tokens", response.usage.input_tokens)
llm_span.set_attribute("llm.output_tokens", response.usage.output_tokens)
# Check if the agent wants to call a tool
if response.stop_reason == "end_turn":
return response.content[0].text
# Process tool calls
for content_block in response.content:
if content_block.type == "tool_use":
tool_name = content_block.name
tool_input = content_block.input
with tracer.start_as_current_span(f"tool.call") as tool_span:
tool_span.set_attribute("tool.name", tool_name)
tool_span.set_attribute("tool.input", str(tool_input))
# Call the tool (via MCP)
result = call_tool(tool_name, tool_input)
tool_span.set_attribute("tool.result", str(result))
# Add the tool result to the message history
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": content_block.id,
"content": str(result),
}],
})
Notice how we’re creating nested spans: an outer agent.run span, and within it, llm.invoke and tool.call spans. Each span has attributes that describe what happened.
Step 4: Instrument Your MCP Server
On the MCP server side, you want to capture what happens inside each tool. Here’s a simple example:
from mcp.server import Server
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
server = Server("my-mcp-server")
@server.list_tools()
async def list_tools() -> list[Tool]:
return [
Tool(
name="query_database",
description="Query the customer database",
inputSchema={
"type": "object",
"properties": {
"query": {"type": "string"},
},
},
),
]
@server.call_tool()
async def call_tool(name: str, arguments: dict) -> str:
with tracer.start_as_current_span(f"mcp.tool.{name}") as span:
span.set_attribute("mcp.tool.name", name)
span.set_attribute("mcp.tool.arguments", str(arguments))
if name == "query_database":
with tracer.start_as_current_span("database.query") as db_span:
query = arguments["query"]
db_span.set_attribute("db.statement", query)
# Execute the query
result = execute_query(query)
db_span.set_attribute("db.rows_returned", len(result))
return str(result)
else:
raise ValueError(f"Unknown tool: {name}")
Now, when your agent calls the query_database tool via MCP, the entire chain is traced: agent → LLM → tool call → MCP server → database query. Each step is a span, and they’re all linked together.
Step 5: Propagate Trace Context
For this to work end-to-end, you need to propagate trace context from your agent to your MCP server. If you’re using HTTP to call the MCP server, this is straightforward:
from opentelemetry.propagate import inject
import httpx
def call_tool(tool_name: str, tool_input: dict) -> str:
# Prepare headers with trace context
headers = {}
inject(headers)
# Call the MCP server
response = httpx.post(
"http://mcp-server:8000/call_tool",
json={"name": tool_name, "arguments": tool_input},
headers=headers,
)
return response.json()["result"]
On the MCP server side, extract the trace context:
from opentelemetry.propagate import extract
from fastapi import Request
@app.post("/call_tool")
async def call_tool_endpoint(request: Request, body: dict):
# Extract trace context from headers
ctx = extract(request.headers)
with trace.set_span_in_context(ctx):
# Now any spans created here will be children of the incoming trace
return await call_tool(body["name"], body["arguments"])
This ensures that the trace context flows from your agent, through your HTTP request, and into your MCP server. All spans are part of the same trace.
Tracing Tool Calls End-to-End
Let’s zoom out and look at what end-to-end tracing actually gives you.
Anatomy of a Complete Trace
When a user submits a query to your agent, here’s what a complete trace looks like:
agent.run (1200ms)
├── llm.invoke (450ms)
│ └── api_call: claude.anthropic.com (450ms)
├── tool.call: query_database (600ms)
│ └── mcp.tool.query_database (590ms)
│ └── database.query (580ms)
│ └── postgres.execute (575ms)
├── llm.invoke (100ms)
│ └── api_call: claude.anthropic.com (100ms)
└── tool.call: format_result (50ms)
└── mcp.tool.format_result (45ms)
Each span has:
- A name (what operation is this?)
- A duration (how long did it take?)
- Attributes (metadata about the operation)
- Events (discrete things that happened during the span)
- Status (success, error, or unknown)
When you look at this trace in your observability backend, you can immediately see:
- The agent took 1.2 seconds total
- The LLM was called twice (450ms + 100ms = 550ms of LLM time)
- The database query took 580ms, which was the bottleneck
- The tool calls succeeded (no error status)
Attributes That Matter
When instrumenting tool calls, capture these attributes:
- Tool name and version:
tool.name,tool.version - Input and output:
tool.input,tool.output(be careful with PII) - Latency: Automatically captured by OTEL (span duration)
- Success or failure:
tool.status,tool.error_message - Resource usage:
tool.tokens_used(for LLM tools),tool.rows_returned(for database tools) - User context:
user.id,user.request_id(for correlating logs and traces)
Here’s a more complete example:
with tracer.start_as_current_span("tool.call") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.version", "1.0.0")
span.set_attribute("user.id", user_id)
span.set_attribute("request.id", request_id)
try:
result = call_tool(tool_name, tool_input)
span.set_attribute("tool.status", "success")
span.set_attribute("tool.output_length", len(str(result)))
except Exception as e:
span.set_attribute("tool.status", "error")
span.set_attribute("tool.error_type", type(e).__name__)
span.set_attribute("tool.error_message", str(e))
span.record_exception(e)
raise
Querying Traces
Once your traces are flowing to your observability backend (Jaeger, Grafana Tempo, Datadog, etc.), you can query them:
- Find all traces for user X:
attributes.user.id = "X" - Find all traces where a tool failed:
attributes.tool.status = "error" - Find all traces where the database query took > 1 second:
mcp.tool.query_database.duration > 1000ms - Find all traces where the LLM was called more than 3 times:
count(llm.invoke) > 3
This is where observability becomes a superpower. You’re not just looking at logs or metrics in isolation—you’re reconstructing the exact execution path of a single request.
Building PostHog Dashboards for Agent Visibility
Traces are great for debugging individual requests, but you also need visibility into patterns. That’s where metrics and dashboards come in.
PostHog is an open-source product analytics platform that integrates well with OpenTelemetry. Here’s how to build dashboards that your team will actually use.
Setting Up OTEL → PostHog
PostHog supports OTEL natively. Configure your OTEL exporter to send to PostHog:
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
otlp_exporter = OTLPSpanExporter(
endpoint="https://app.posthog.com/batch",
headers={
"Authorization": f"Bearer {POSTHOG_API_KEY}",
},
)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
PostHog will ingest your traces and automatically create a queryable trace database.
Dashboard 1: Agent Health
Your first dashboard should answer: “Is my agent working?”
Metrics to track:
- Requests per minute: Count of
agent.runspans - Error rate: Percentage of
agent.runspans with status = error - Average loop iterations: Count of
llm.invokespans peragent.runspan - P95 latency: 95th percentile duration of
agent.runspans
Query in PostHog:
select
count(*) as requests,
sum(case when status = 'error' then 1 else 0 end) * 100.0 / count(*) as error_rate_pct,
avg(duration) as avg_latency_ms,
percentile(duration, 0.95) as p95_latency_ms
from spans
where span_name = 'agent.run'
and timestamp > now() - interval '1 hour'
Dashboard 2: Tool Performance
Which tools are slow? Which are failing?
Metrics to track:
- Tool call count: Breakdown by tool name
- Tool error rate: Percentage of
tool.callspans with status = error, by tool - Tool latency: P50, P95, P99 duration by tool
- Tool usage trend: Tool call count over time
Query in PostHog:
select
attributes['tool.name'] as tool_name,
count(*) as calls,
sum(case when status = 'error' then 1 else 0 end) * 100.0 / count(*) as error_rate_pct,
avg(duration) as avg_latency_ms,
percentile(duration, 0.95) as p95_latency_ms
from spans
where span_name = 'tool.call'
and timestamp > now() - interval '24 hours'
group by attributes['tool.name']
order by calls desc
Dashboard 3: Cost Tracking
For agentic AI, LLM costs can spiral. Track them in real time.
Metrics to track:
- Tokens per request: Input + output tokens per
agent.runspan - Cost per request: Calculated from token counts and model pricing
- Cost trend: Total cost over time
- Cost by user: Identify power users or cost anomalies
Query in PostHog:
select
date_trunc('hour', timestamp) as hour,
sum(attributes['llm.input_tokens']::int + attributes['llm.output_tokens']::int) as total_tokens,
sum((attributes['llm.input_tokens']::int * 0.003 + attributes['llm.output_tokens']::int * 0.015) / 1000000) as cost_usd
from spans
where span_name = 'llm.invoke'
and timestamp > now() - interval '7 days'
group by date_trunc('hour', timestamp)
order by hour desc
(Prices are for Claude 3.5 Sonnet; adjust for your model.)
Dashboard 4: Loop Detection
This is the critical one. You want to catch runaway loops before they cost you money.
Metrics to track:
- Loop iterations per request: Count of
llm.invokespans peragent.runspan - Max iterations: Highest loop count in the last hour
- Requests with > N iterations: Alert if any request loops more than your threshold
Query in PostHog:
select
trace_id,
count(*) as llm_invocations,
max(duration) as max_span_duration_ms
from spans
where span_name = 'llm.invoke'
and timestamp > now() - interval '1 hour'
group by trace_id
having count(*) > 5 -- Alert if more than 5 LLM calls per trace
order by llm_invocations desc
When this query returns results, you’ve got a problem. Investigate immediately.
Setting Up Alerts
PostHog supports alerts on metric queries. Set these up:
- Error rate > 5% on
agent.runspans - P95 latency > 30 seconds on
agent.runspans - Loop iterations > 5 per trace
- Cost per request > $1 (or your threshold)
- Tool error rate > 10% for any tool
These alerts should go to Slack or PagerDuty so your team gets notified immediately when something goes wrong.
Detecting and Debugging Runaway Loops
Runaway loops are the most dangerous failure mode for agentic AI. Your agent gets stuck calling the same tool over and over, each time burning LLM tokens and potentially calling expensive APIs.
With proper observability, you can detect and debug these in minutes instead of after they’ve cost you thousands.
Detecting Loops in Real Time
Your loop detection dashboard (from above) is the first line of defence. But you can be more proactive.
Add a check in your agent loop:
def run_agent(user_query: str) -> str:
with tracer.start_as_current_span("agent.run") as span:
span.set_attribute("user.query", user_query)
messages = [{"role": "user", "content": user_query}]
iterations = 0
max_iterations = 10 # Safety limit
while iterations < max_iterations:
iterations += 1
span.set_attribute("agent.iterations", iterations)
# Call the LLM
with tracer.start_as_current_span("llm.invoke") as llm_span:
llm_span.set_attribute("iteration", iterations)
response = client.messages.create(...)
if response.stop_reason == "end_turn":
span.set_attribute("agent.status", "success")
return response.content[0].text
# Process tool calls...
# If we hit max_iterations, we're in a loop
span.set_attribute("agent.status", "loop_detected")
span.add_event("Max iterations reached")
raise RuntimeError(f"Agent loop detected after {max_iterations} iterations")
Now, when an agent hits the iteration limit, it immediately shows up in your traces with agent.status = "loop_detected". Your dashboard can alert on this.
Debugging: Trace Analysis
When you detect a loop, here’s how to debug it:
- Find the trace in your observability backend (by trace ID, user ID, or timestamp)
- Look at the tool calls: Which tool is being called repeatedly?
- Look at the tool inputs: Are they the same each time, or slightly different?
- Look at the tool outputs: Are they consistent, or does the tool return different results?
- Look at the LLM prompts: What is the agent asking for?
Common loop patterns:
-
Same tool, same input, same output: The agent is stuck in a decision loop. It’s calling a tool, getting the same result, and not understanding that it already has the answer. Fix: Improve the tool’s output format or the agent’s instructions.
-
Same tool, slightly different inputs: The agent is trying different variations to get a different result. Fix: The tool is broken or the agent’s expectation is wrong. Add error handling in the tool or clarify the agent’s instructions.
-
Different tools, same overall pattern: The agent is trying multiple tools to accomplish the same goal and failing with each one. Fix: The agent doesn’t have the right tool for the job, or the tools are broken. Add a tool or fix the existing ones.
Here’s how to capture enough detail in your traces to debug these:
with tracer.start_as_current_span("tool.call") as tool_span:
tool_span.set_attribute("tool.name", tool_name)
tool_span.set_attribute("tool.input", json.dumps(tool_input))
tool_span.set_attribute("tool.input_hash", hashlib.md5(json.dumps(tool_input).encode()).hexdigest())
result = call_tool(tool_name, tool_input)
tool_span.set_attribute("tool.output", json.dumps(result))
tool_span.set_attribute("tool.output_hash", hashlib.md5(json.dumps(result).encode()).hexdigest())
By capturing input and output hashes, you can quickly spot when the same input is being called repeatedly (same hash) or when the agent is trying variations (different hashes).
Cost Control
While you’re debugging, you want to prevent further damage. Implement cost controls:
def run_agent(user_query: str, max_cost_usd: float = 1.0) -> str:
with tracer.start_as_current_span("agent.run") as span:
span.set_attribute("user.query", user_query)
span.set_attribute("max_cost_usd", max_cost_usd)
total_cost = 0.0
messages = [{"role": "user", "content": user_query}]
while True:
# Check cost before calling LLM
if total_cost > max_cost_usd:
span.set_attribute("agent.status", "cost_limit_exceeded")
raise RuntimeError(f"Cost limit exceeded: ${total_cost:.2f} > ${max_cost_usd:.2f}")
# Call the LLM
with tracer.start_as_current_span("llm.invoke") as llm_span:
response = client.messages.create(...)
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
# Calculate cost (Claude 3.5 Sonnet pricing)
cost = (input_tokens * 0.003 + output_tokens * 0.015) / 1000000
total_cost += cost
llm_span.set_attribute("llm.cost_usd", cost)
llm_span.set_attribute("agent.total_cost_usd", total_cost)
if response.stop_reason == "end_turn":
span.set_attribute("agent.status", "success")
span.set_attribute("agent.total_cost_usd", total_cost)
return response.content[0].text
# Process tool calls...
Now, if an agent enters a loop and starts burning through your budget, it’ll hit the cost limit and stop. You can set different limits for different users or request types.
Real-World Implementation Patterns
Let’s look at how this works in practice, with patterns we’ve seen at PADISO clients.
Pattern 1: Multi-Agent Orchestration
When you have multiple agents working together, trace context propagation becomes critical. Each agent should create a span, and those spans should be linked.
def orchestrate_agents(user_query: str) -> str:
with tracer.start_as_current_span("orchestration.run") as orch_span:
orch_span.set_attribute("user.query", user_query)
# Route to the appropriate agent
if "sales" in user_query:
with tracer.start_as_current_span("agent.sales") as agent_span:
result = sales_agent.run(user_query)
agent_span.set_attribute("agent.result", result)
elif "support" in user_query:
with tracer.start_as_current_span("agent.support") as agent_span:
result = support_agent.run(user_query)
agent_span.set_attribute("agent.result", result)
else:
with tracer.start_as_current_span("agent.general") as agent_span:
result = general_agent.run(user_query)
agent_span.set_attribute("agent.result", result)
return result
Now you can see which agent handled which request, and how long each took.
Pattern 2: Tool Caching
If your agent calls the same tool multiple times with the same input, you can cache the result. But you want to track this in your traces.
tool_cache = {} # In production, use Redis or similar
def call_tool(tool_name: str, tool_input: dict) -> str:
input_key = hashlib.md5(json.dumps(tool_input, sort_keys=True).encode()).hexdigest()
cache_key = f"{tool_name}:{input_key}"
with tracer.start_as_current_span("tool.call") as tool_span:
tool_span.set_attribute("tool.name", tool_name)
tool_span.set_attribute("tool.input_hash", input_key)
if cache_key in tool_cache:
tool_span.set_attribute("tool.cache_hit", True)
result = tool_cache[cache_key]
else:
tool_span.set_attribute("tool.cache_hit", False)
result = execute_tool(tool_name, tool_input)
tool_cache[cache_key] = result
tool_span.set_attribute("tool.output", str(result))
return result
Now your dashboard can show cache hit rate, which helps you understand whether your agent is repeating itself.
Pattern 3: Async Tool Execution
If you have tools that take a long time (API calls, database queries), you might want to execute them asynchronously. Trace context propagation is important here too.
import asyncio
async def call_tool_async(tool_name: str, tool_input: dict) -> str:
# Get the current trace context
ctx = trace.get_current_span().get_span_context()
# Execute the tool in a background task
task = asyncio.create_task(execute_tool_in_background(tool_name, tool_input, ctx))
# Wait for the result
result = await task
return result
async def execute_tool_in_background(tool_name: str, tool_input: dict, ctx):
# Set the trace context in the background task
with trace.set_span_in_context(ctx):
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
# Execute the tool
result = await execute_tool(tool_name, tool_input)
return result
This ensures that background tasks are part of the same trace as the main agent loop.
Pattern 4: Structured Logging with Trace Context
Logs are still valuable. Include trace context in every log so you can correlate logs with traces.
import logging
from opentelemetry.trace import get_current_span
class TraceContextFilter(logging.Filter):
def filter(self, record):
span = get_current_span()
record.trace_id = span.get_span_context().trace_id
record.span_id = span.get_span_context().span_id
return True
logger = logging.getLogger(__name__)
logger.addFilter(TraceContextFilter())
# Now every log includes trace_id and span_id
logger.info("Tool execution started", extra={
"tool_name": tool_name,
"tool_input": tool_input,
})
When you’re debugging an issue, you can grep logs by trace ID and see everything that happened in that request.
Common Pitfalls and How to Avoid Them
Pitfall 1: Not Propagating Trace Context
Problem: You have traces from your agent, and separate traces from your MCP server, but they’re not linked.
Solution: Always propagate trace context when making HTTP requests. Use the inject() and extract() functions from OpenTelemetry.
Pitfall 2: Logging PII in Traces
Problem: Your tool inputs or outputs contain personally identifiable information (PII), and you’re logging them in traces. Now you’ve got PII in your observability backend.
Solution: Sanitize sensitive data before logging. Use hashing or truncation.
import hashlib
def sanitize_email(email: str) -> str:
return hashlib.sha256(email.encode()).hexdigest()[:8]
tool_span.set_attribute("user.email_hash", sanitize_email(tool_input["email"]))
Pitfall 3: Not Setting a Max Iteration Limit
Problem: Your agent enters a loop and burns through your entire budget before you notice.
Solution: Always set a max iteration limit in your agent loop, and a max cost limit. Make them configurable per request type.
Pitfall 4: Ignoring Tool Errors
Problem: A tool fails silently, returning an error message as a string. Your agent doesn’t understand that the tool failed, and keeps calling it.
Solution: Make tool errors explicit. Use exceptions, not error strings.
# Bad
result = execute_query(query)
if "error" in result:
return result # Agent doesn't know this is an error
# Good
try:
result = execute_query(query)
except DatabaseError as e:
span.record_exception(e)
raise
Pitfall 5: Not Instrumenting Downstream Systems
Problem: You have traces from your agent and your MCP server, but not from your database. You know a query took 5 seconds, but you don’t know why.
Solution: Instrument your entire stack. If you’re using PostgreSQL, use an OTEL instrumentation library. Same for Redis, Elasticsearch, etc.
For PostgreSQL with Python:
pip install opentelemetry-instrumentation-psycopg2
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor
Psycopg2Instrumentor().instrument()
Now every database query is automatically traced.
Next Steps: Moving to Production
You’ve built observability into your agentic AI system. Now, how do you deploy it safely?
Step 1: Start with a Single Agent
Don’t try to instrument your entire system at once. Pick one agent, one MCP server, one tool. Get observability working end-to-end. Once you’re confident, scale.
Step 2: Set Up Your Observability Backend
You have options:
- Local development: Use Jaeger in Docker. It’s free, open-source, and runs locally.
- Staging: Use Grafana Cloud or Datadog. You want a managed service so you don’t have to operate it.
- Production: Use a managed service. PostHog, Datadog, New Relic, Grafana Cloud—all support OTEL and are production-ready.
For many of our clients at PADISO, PostHog is the sweet spot: it’s open-source, it has a generous free tier, and it integrates seamlessly with OTEL.
Step 3: Build Your Dashboards
Start with the four dashboards we outlined: Agent Health, Tool Performance, Cost Tracking, and Loop Detection. Build them incrementally. Don’t try to build the perfect dashboard on day one.
Step 4: Set Up Alerts
Once your dashboards are working, add alerts. Start conservative (high thresholds) and tighten them as you understand your system’s normal behavior.
Step 5: Document Your Observability Strategy
Write down:
- What you’re tracing and why
- What your dashboards show
- How to debug common issues
- Who gets alerted when things go wrong
This documentation is invaluable when you’re on-call at 3 AM.
Step 6: Monitor Your Monitoring
Your observability system itself can fail. Your OTEL exporter might be dropping spans. Your observability backend might be down. Set up monitoring for your monitoring.
# Track OTEL exporter health
span_processor = BatchSpanProcessor(otlp_exporter)
span_processor.force_flush(timeout_millis=5000)
if not span_processor.force_flush():
logger.error("OTEL exporter failed to flush spans")
Step 7: Iterate
Observability is not a one-time project. As your system evolves, your observability needs will evolve too. Add new metrics, refine dashboards, tighten alerts. Make observability part of your development process.
Connecting with PADISO
Building production-grade observability for agentic AI is complex. It’s not just about wiring OTEL—it’s about understanding your system deeply, knowing what to measure, and building dashboards that actually help you operate.
At PADISO, we’ve helped dozens of Sydney and Australian startups and enterprises ship agentic AI systems that are observable, reliable, and cost-effective. We’ve learned what works and what doesn’t. We’ve debugged runaway loops, optimised tool performance, and helped teams understand their LLM costs.
If you’re building agentic AI and want to ensure you have the observability to run it safely in production, reach out to PADISO. We offer AI & Agents Automation services that include observability architecture, dashboard design, and ongoing monitoring. We’ve also worked extensively with Vanta implementation for teams who need to audit their AI systems for compliance, and we provide Platform Design & Engineering support to help you build the infrastructure that observability requires.
We also publish guides on related topics. If you’re interested in how agentic AI compares to traditional automation, read our guide on Agentic AI vs Traditional Automation. If you want to understand real production failures and how to avoid them, check out Agentic AI Production Horror Stories.
Summary
MCP observability is about making your agentic AI system transparent. By wiring OpenTelemetry through your Model Context Protocol infrastructure, you can trace every tool call, detect loops before they cost you money, and debug production issues in minutes.
The key steps:
- Instrument your agent loop with OTEL spans
- Instrument your MCP server with OTEL spans
- Propagate trace context from your agent to your MCP server
- Export traces to a backend like PostHog or Grafana
- Build dashboards for Agent Health, Tool Performance, Cost Tracking, and Loop Detection
- Set up alerts for error rates, latency, loops, and cost
- Debug using traces when something goes wrong
Start small. Pick one agent, one tool, one dashboard. Get it working. Then scale. Observability is not a one-time project—it’s an ongoing practice that gets better as your system matures.
If you’re building agentic AI in Sydney or Australia and want expert guidance on observability, compliance, or platform engineering, contact PADISO today.