PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 19 mins

Claude in Production: Observability

Production observability patterns for Claude deployments. Covers architecture, instrumentation, failure scenarios, and cost control with code examples.

The PADISO Team ·2026-06-04

Claude in Production: Observability

Deploying Claude to production requires more than just an API key and a prompt. You need visibility into what’s happening—latency, cost, errors, and token usage—so you can debug failures, optimise spend, and prove reliability to your stakeholders.

This guide walks you through production observability patterns for Claude deployments, from instrumentation to failure detection to cost control. We’ll cover reference architectures, code examples, and the specific failure scenarios observability prevents.

Table of Contents


Why Observability Matters for Claude in Production {#why-observability-matters}

Claude is a black box from your application’s perspective. You send a prompt, tokens are consumed, and a response comes back. But in production, you need answers to concrete questions:

  • How long did that request take? Was it the API latency, or your prompt engineering?
  • How much did that cost? Input tokens, output tokens, and cache hits all affect billing.
  • Why did that fail? Rate limits, context window overflows, or a bad prompt?
  • Is Claude behaving consistently? Are outputs degrading over time?
  • What’s the cost per transaction? Can you afford to run this at scale?

Without observability, you’re flying blind. Your users hit a timeout, and you have no idea whether Claude is slow, your network is slow, or your application is stuck in a loop. Your finance team asks about LLM spend, and you can’t tell them. A feature works in staging but fails in production, and you have no traces to debug it.

Observability isn’t optional for production Claude deployments. It’s the foundation of reliability, cost control, and operational confidence.

The Cost Problem

Claude’s pricing is transparent but complex. Input tokens cost less than output tokens. Prompt caching reduces costs for repeated prompts. Different models have different rates. Without instrumentation, you can’t see where your budget is going or whether a feature is economically viable.

Consider a customer support agent that uses Claude to draft responses. If you’re not tracking tokens per request, you won’t notice when a poorly-tuned prompt starts generating 10,000 output tokens per response instead of 500. By the time you see the bill, you’ve spent thousands on a broken feature.

The Reliability Problem

Claude requests fail for reasons you need to catch and handle:

  • Rate limiting: You’ve hit API quotas and need to back off.
  • Context window overflow: Your prompt + context is too large.
  • API errors: Transient or permanent issues on Anthropic’s side.
  • Network timeouts: Your application can’t reach the API.
  • Invalid requests: Your prompt or parameters are malformed.

If you’re not logging and tracing these failures, you’ll spend hours debugging production incidents. If you’re not alerting on them, your users will find out about problems before you do.

The Compliance Problem

If you’re pursuing SOC 2 compliance via Vanta or ISO 27001 certification, you’ll need audit trails for all external API calls. Which requests went to Claude? What were the inputs? When did they fail? Observability isn’t just operational—it’s a compliance requirement.


Observability Architecture and Patterns {#observability-architecture}

Production observability for Claude follows a standard architecture: instrumentation → collection → storage → querying.

Reference Architecture

┌─────────────────────────────────────────────────────────┐
│  Your Application                                       │
│  ┌──────────────────────────────────────────────────┐  │
│  │ Claude API Calls                                 │  │
│  │ (instrumented with OpenTelemetry)               │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────┬────────────────────────────────────┘

                     │ Traces, logs, metrics

         ┌───────────▼────────────────┐
         │  Collector (e.g., Otel     │
         │  Collector, Datadog Agent) │
         └───────────┬────────────────┘

         ┌───────────▼────────────────────────────────────┐
         │  Backend (Datadog, Honeycomb, New Relic, etc.) │
         │  - Trace storage                               │
         │  - Metrics aggregation                         │
         │  - Log indexing                                │
         │  - Alerting                                    │
         └────────────────────────────────────────────────┘

         ┌───────────▼────────────────┐
         │  Dashboards & Alerts       │
         │  - Latency percentiles     │
         │  - Error rates             │
         │  - Token usage & cost      │
         │  - SLO tracking            │
         └────────────────────────────┘

The key principle: instrument at the boundary where you call Claude. Capture the request, response, latency, tokens, and any errors. Send that data to a backend that can aggregate it, alert on it, and let you query it.

Why OpenTelemetry?

OpenTelemetry is the industry standard for observability instrumentation. It’s vendor-neutral, well-maintained, and supported by every major observability platform. Using OpenTelemetry means you can switch backends (Datadog to Honeycomb, for example) without rewriting your instrumentation code.

Anthropc provides official guidance on monitoring Claude applications, and the recommended pattern is to use OpenTelemetry for tracing and metrics.


Instrumentation with OpenTelemetry {#instrumentation-opentelemetry}

Let’s build a concrete example. We’ll instrument a Python application that calls Claude, capture traces and metrics, and send them to a backend.

Setup

First, install the required packages:

pip install anthropic opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Basic Instrumentation

Here’s a minimal example that wraps Claude calls with OpenTelemetry tracing:

import os
from anthropic import Anthropic
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
import time

# Configure tracing
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(trace_provider)

# Configure metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317")
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

# Get tracer and meter
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics
token_counter = meter.create_counter(
    name="claude.tokens.used",
    description="Total tokens consumed (input + output)",
    unit="1",
)
request_duration = meter.create_histogram(
    name="claude.request.duration",
    description="Duration of Claude API requests in seconds",
    unit="s",
)
error_counter = meter.create_counter(
    name="claude.errors",
    description="Count of Claude API errors",
    unit="1",
)

client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def call_claude(prompt: str, model: str = "claude-3-5-sonnet-20241022") -> str:
    """Call Claude with full observability instrumentation."""
    
    with tracer.start_as_current_span("claude.call") as span:
        span.set_attribute("model", model)
        span.set_attribute("prompt_length", len(prompt))
        
        start_time = time.time()
        
        try:
            response = client.messages.create(
                model=model,
                max_tokens=1024,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            
            duration = time.time() - start_time
            
            # Record metrics
            input_tokens = response.usage.input_tokens
            output_tokens = response.usage.output_tokens
            total_tokens = input_tokens + output_tokens
            
            token_counter.add(
                total_tokens,
                attributes={
                    "model": model,
                    "type": "total"
                }
            )
            token_counter.add(
                input_tokens,
                attributes={
                    "model": model,
                    "type": "input"
                }
            )
            token_counter.add(
                output_tokens,
                attributes={
                    "model": model,
                    "type": "output"
                }
            )
            
            request_duration.record(
                duration,
                attributes={"model": model, "status": "success"}
            )
            
            # Set span attributes for tracing
            span.set_attribute("input_tokens", input_tokens)
            span.set_attribute("output_tokens", output_tokens)
            span.set_attribute("duration_seconds", duration)
            span.set_attribute("status", "success")
            
            return response.content[0].text
            
        except Exception as e:
            duration = time.time() - start_time
            
            # Record error metrics
            error_counter.add(
                1,
                attributes={
                    "model": model,
                    "error_type": type(e).__name__
                }
            )
            
            request_duration.record(
                duration,
                attributes={"model": model, "status": "error"}
            )
            
            # Set error attributes on span
            span.set_attribute("status", "error")
            span.set_attribute("error_type", type(e).__name__)
            span.set_attribute("error_message", str(e))
            span.set_attribute("duration_seconds", duration)
            
            raise

# Example usage
if __name__ == "__main__":
    try:
        response = call_claude("What is observability in production systems?")
        print(response)
    except Exception as e:
        print(f"Error: {e}")

This example:

  • Creates a tracer and meter using OpenTelemetry.
  • Defines metrics for tokens, latency, and errors.
  • Wraps Claude calls in a span that captures request details.
  • Records token counts broken down by input/output.
  • Measures latency and records it with success/error status.
  • Captures errors with type and message.
  • Sends data to an OTLP collector (default: localhost:4317).

Production Deployment

In production, you’ll typically use a managed observability backend. Here’s how to configure for Datadog:

from opentelemetry.exporter.datadog.exporter import DatadogExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure Datadog exporter
datadog_exporter = DatadogExporter(
    agent_url="http://localhost:8126",  # Datadog agent
    service="my-claude-app",
    version="1.0.0",
    env="production",
)

trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(datadog_exporter))
trace.set_tracer_provider(trace_provider)

Or for Honeycomb:

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

otlp_exporter = OTLPSpanExporter(
    endpoint="https://api.honeycomb.io:443",
    headers=(
        ("x-honeycomb-team", os.environ["HONEYCOMB_API_KEY"]),
    ),
)

trace_provider = TracerProvider()
trace_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(trace_provider)

The key is using OpenTelemetry to abstract away the backend choice.


Logging, Tracing, and Metrics {#logging-tracing-metrics}

Observability has three pillars: logs, traces, and metrics. For Claude deployments, you need all three.

Logs

Logs are the most basic form of observability. They’re human-readable records of events. For Claude calls, log:

  • Request: model, prompt (or hash of it for privacy), parameters.
  • Response: status, tokens, latency.
  • Errors: type, message, stack trace.

Example:

import logging
import json

logger = logging.getLogger(__name__)

def call_claude_with_logging(prompt: str, model: str = "claude-3-5-sonnet-20241022") -> str:
    """Call Claude with structured logging."""
    
    request_id = str(uuid.uuid4())
    
    logger.info(
        "Claude request started",
        extra={
            "request_id": request_id,
            "model": model,
            "prompt_length": len(prompt),
        }
    )
    
    start_time = time.time()
    
    try:
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        
        duration = time.time() - start_time
        
        logger.info(
            "Claude request succeeded",
            extra={
                "request_id": request_id,
                "model": model,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "duration_seconds": duration,
                "cost_usd": calculate_cost(
                    response.usage.input_tokens,
                    response.usage.output_tokens,
                    model
                ),
            }
        )
        
        return response.content[0].text
        
    except Exception as e:
        duration = time.time() - start_time
        
        logger.error(
            "Claude request failed",
            extra={
                "request_id": request_id,
                "model": model,
                "error_type": type(e).__name__,
                "error_message": str(e),
                "duration_seconds": duration,
            },
            exc_info=True,
        )
        
        raise

Use structured logging (JSON) so your observability backend can parse and aggregate logs. Avoid logging the full prompt or response if it contains sensitive data—hash it or log only metadata.

Traces

Traces show the full flow of a request through your system. For Claude deployments, a trace captures:

  1. User request enters your application.
  2. Prompt construction (if your app builds the prompt dynamically).
  3. Claude API call (with sub-spans for retries, cache checks, etc.).
  4. Response processing (parsing, validation, etc.).
  5. Return to user.

Each span in the trace has a start time, duration, attributes, and optional events. Here’s a more complex example:

def generate_customer_response(customer_id: str, question: str) -> str:
    """Generate a response for a customer, with full tracing."""
    
    with tracer.start_as_current_span("generate_response") as span:
        span.set_attribute("customer_id", customer_id)
        span.set_attribute("question", question)
        
        # Fetch customer context
        with tracer.start_as_current_span("fetch_customer_context") as ctx_span:
            context = fetch_customer_data(customer_id)
            ctx_span.set_attribute("context_size_bytes", len(str(context)))
        
        # Build prompt
        with tracer.start_as_current_span("build_prompt") as prompt_span:
            system_prompt = build_system_prompt(context)
            full_prompt = f"{system_prompt}\n\nCustomer question: {question}"
            prompt_span.set_attribute("system_prompt_length", len(system_prompt))
            prompt_span.set_attribute("total_prompt_length", len(full_prompt))
        
        # Call Claude
        with tracer.start_as_current_span("claude_api_call") as api_span:
            try:
                response = client.messages.create(
                    model="claude-3-5-sonnet-20241022",
                    max_tokens=1024,
                    system=system_prompt,
                    messages=[{"role": "user", "content": question}]
                )
                
                api_span.set_attribute("status", "success")
                api_span.set_attribute("input_tokens", response.usage.input_tokens)
                api_span.set_attribute("output_tokens", response.usage.output_tokens)
                
            except Exception as e:
                api_span.set_attribute("status", "error")
                api_span.set_attribute("error_type", type(e).__name__)
                api_span.record_exception(e)
                raise
        
        # Process response
        with tracer.start_as_current_span("process_response") as proc_span:
            processed = validate_and_format_response(response.content[0].text)
            proc_span.set_attribute("output_length", len(processed))
        
        span.set_attribute("final_status", "success")
        return processed

Traces let you see the full picture: where time is spent, which operations fail, and how requests flow through your system. When a customer reports a slow response, you can look at the trace and see whether it’s Claude latency, database queries, or your own code.

Metrics

Metrics are aggregated, time-series data. For Claude deployments, track:

  • Token usage: input, output, total (by model, by endpoint).
  • Latency: p50, p95, p99 (by model, by status).
  • Error rate: errors per minute, by error type.
  • Cost: estimated cost per hour, per day, per month.
  • Cache hit rate: if using prompt caching.

Example:

from opentelemetry.sdk.metrics.aggregation import ExplicitBucketHistogramAggregation

# Define buckets for latency histogram (in seconds)
latency_buckets = ExplicitBucketHistogramAggregation(
    boundaries=[
        0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0
    ]
)

meter_provider = MeterProvider(
    metric_readers=[metric_reader],
    views=[
        View(
            instrument_name="claude.request.duration",
            aggregation=latency_buckets,
        )
    ]
)

meter = metrics.get_meter(__name__)

# Cost tracking
cost_counter = meter.create_counter(
    name="claude.cost.usd",
    description="Estimated cost of Claude API calls in USD",
    unit="$",
)

def calculate_cost(input_tokens: int, output_tokens: int, model: str) -> float:
    """Calculate cost in USD based on token counts and model."""
    # Pricing as of 2024 (check Claude docs for current rates)
    pricing = {
        "claude-3-5-sonnet-20241022": {"input": 0.003 / 1000, "output": 0.015 / 1000},
        "claude-3-opus-20250219": {"input": 0.015 / 1000, "output": 0.075 / 1000},
    }
    
    if model not in pricing:
        return 0.0
    
    rates = pricing[model]
    return (input_tokens * rates["input"]) + (output_tokens * rates["output"])

# In your Claude call:
cost = calculate_cost(response.usage.input_tokens, response.usage.output_tokens, model)
cost_counter.add(cost, attributes={"model": model})

Metrics are ideal for dashboards and alerting. You can alert on “cost exceeded $100/day” or “error rate > 5%” without needing to query individual traces.


Cost Monitoring and Token Accounting {#cost-monitoring}

Claude’s pricing is per-token, and costs scale quickly. Without instrumentation, you won’t know whether a feature is economically viable until you see the bill.

Token Accounting

Every Claude call consumes tokens:

  • Input tokens: your prompt + system message.
  • Output tokens: Claude’s response.
  • Cache tokens (optional): if using prompt caching, repeated prompts cost less.

You’re billed for all three. If you’re using prompt caching, you’ll see cache_creation_input_tokens and cache_read_input_tokens in the response—these cost less than regular input tokens.

Example:

def track_token_cost(response, model: str):
    """Extract and track all token types from a response."""
    
    usage = response.usage
    
    # Standard tokens
    input_tokens = usage.input_tokens
    output_tokens = usage.output_tokens
    
    # Cache tokens (if using prompt caching)
    cache_creation_input_tokens = getattr(usage, "cache_creation_input_tokens", 0)
    cache_read_input_tokens = getattr(usage, "cache_read_input_tokens", 0)
    
    # Pricing (check current rates)
    pricing = {
        "claude-3-5-sonnet-20241022": {
            "input": 0.003 / 1000,
            "output": 0.015 / 1000,
            "cache_creation_input": 0.00375 / 1000,  # 25% more than input
            "cache_read_input": 0.0003 / 1000,  # 90% discount
        },
    }
    
    rates = pricing.get(model, {})
    
    cost = (
        (input_tokens * rates.get("input", 0)) +
        (output_tokens * rates.get("output", 0)) +
        (cache_creation_input_tokens * rates.get("cache_creation_input", 0)) +
        (cache_read_input_tokens * rates.get("cache_read_input", 0))
    )
    
    return {
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cache_creation_input_tokens": cache_creation_input_tokens,
        "cache_read_input_tokens": cache_read_input_tokens,
        "total_tokens": input_tokens + output_tokens,
        "cost_usd": cost,
    }

Cost Alerting

Set up alerts to catch runaway costs:

# Alert if daily cost exceeds budget
daily_cost_gauge = meter.create_observable_gauge(
    name="claude.cost.daily_usd",
    description="Estimated daily cost of Claude API calls",
    unit="$",
    callbacks=[get_daily_cost],
)

def get_daily_cost():
    """Fetch today's estimated cost from your database."""
    # This would query your metrics backend or database
    return 150.00  # Example: $150 today

# In your alerting rules (Datadog, Honeycomb, etc.):
# Alert if claude.cost.daily_usd > 200
# Alert if claude.cost.monthly_usd > 5000

Cost Optimization

Observability reveals optimization opportunities:

  1. Prompt caching: If you’re making repeated calls with the same system prompt, enable caching to reduce costs by 90%.
  2. Model selection: Use claude-3-5-sonnet-20241022 (cheaper) for simple tasks, claude-3-opus-20250219 (more expensive) only when needed.
  3. Output length: Set max_tokens based on what you actually need, not the maximum.
  4. Batch processing: If possible, batch multiple requests to reduce overhead.
  5. Context size: Keep your context (customer data, documents, etc.) minimal.

With observability, you can measure the impact of each optimization.


Failure Scenarios and Detection {#failure-scenarios}

Production Claude deployments fail in predictable ways. Observability helps you detect and respond to each.

Rate Limiting

Scenario: Your application hits Claude’s rate limits and receives HTTP 429 errors.

Detection:

def handle_rate_limit(error, span):
    """Detect and handle rate limiting."""
    
    if error.status_code == 429:
        span.set_attribute("error_type", "rate_limit")
        span.set_attribute("retry_after_seconds", error.headers.get("retry-after", "unknown"))
        
        # Alert operations team
        logger.warning(
            "Claude rate limit hit",
            extra={
                "retry_after": error.headers.get("retry-after"),
                "timestamp": time.time(),
            }
        )
        
        # Implement exponential backoff
        time.sleep(int(error.headers.get("retry-after", 1)))
        return True  # Retry
    
    return False

Alerting: If rate-limit errors exceed 10% of requests in a 5-minute window, page on-call.

Context Window Overflow

Scenario: Your prompt + context exceeds the model’s context window (200k tokens for Claude 3.5 Sonnet).

Detection:

def validate_context_size(prompt: str, model: str, span) -> bool:
    """Estimate token count and validate against model limits."""
    
    # Use Claude's token counting API
    token_count = client.messages.count_tokens(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    ).input_tokens
    
    limits = {
        "claude-3-5-sonnet-20241022": 200000,
        "claude-3-opus-20250219": 200000,
    }
    
    limit = limits.get(model, 100000)
    
    span.set_attribute("estimated_input_tokens", token_count)
    span.set_attribute("context_window_limit", limit)
    span.set_attribute("context_utilization_percent", (token_count / limit) * 100)
    
    if token_count > limit:
        span.set_attribute("status", "context_overflow")
        logger.error(
            "Context window overflow",
            extra={
                "token_count": token_count,
                "limit": limit,
                "model": model,
            }
        )
        return False
    
    return True

Alerting: Alert if context utilisation exceeds 80% for any request.

API Errors

Scenario: Anthropic’s API returns a 500 error or is temporarily unavailable.

Detection:

def handle_api_error(error, span, attempt: int = 1, max_retries: int = 3):
    """Detect and handle transient API errors."""
    
    if hasattr(error, "status_code"):
        if error.status_code >= 500:
            # Server error, likely transient
            span.set_attribute("error_type", "api_server_error")
            span.set_attribute("status_code", error.status_code)
            span.set_attribute("attempt", attempt)
            
            if attempt < max_retries:
                # Exponential backoff: 1s, 2s, 4s
                wait_time = 2 ** (attempt - 1)
                logger.warning(
                    "Claude API error, retrying",
                    extra={
                        "status_code": error.status_code,
                        "attempt": attempt,
                        "wait_seconds": wait_time,
                    }
                )
                time.sleep(wait_time)
                return True  # Retry
        
        elif error.status_code == 401:
            # Auth error, don't retry
            span.set_attribute("error_type", "auth_error")
            logger.error("Invalid API key")
            return False
    
    return False

Alerting: Alert if API error rate (4xx and 5xx) exceeds 1% for 5 minutes.

Timeout Errors

Scenario: Your application can’t reach Claude’s API due to network issues.

Detection:

def call_claude_with_timeout(prompt: str, timeout_seconds: int = 30):
    """Call Claude with explicit timeout handling."""
    
    with tracer.start_as_current_span("claude_call") as span:
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
                timeout=timeout_seconds,
            )
            return response
        
        except TimeoutError as e:
            span.set_attribute("error_type", "timeout")
            span.set_attribute("timeout_seconds", timeout_seconds)
            logger.error(
                "Claude API timeout",
                extra={"timeout_seconds": timeout_seconds},
            )
            raise

Alerting: Alert if timeout rate exceeds 5% for 2 minutes.

Degraded Output Quality

Scenario: Claude’s responses are shorter, less detailed, or lower quality than expected.

Detection:

def detect_output_quality_degradation(response, span):
    """Monitor for signs of degraded output quality."""
    
    output = response.content[0].text
    output_tokens = response.usage.output_tokens
    
    # Check for suspiciously short responses
    if output_tokens < 50:  # Adjust based on your use case
        span.set_attribute("output_quality_flag", "unusually_short")
        logger.warning(
            "Unusually short Claude response",
            extra={"output_tokens": output_tokens},
        )
    
    # Check for refusals
    if "I can't" in output or "I'm not able to" in output:
        span.set_attribute("output_quality_flag", "potential_refusal")
        logger.warning("Claude may have refused the request")
    
    # Check for repeated tokens (sign of model degradation)
    words = output.split()
    if len(words) > 0:
        unique_ratio = len(set(words)) / len(words)
        if unique_ratio < 0.6:  # Less than 60% unique words
            span.set_attribute("output_quality_flag", "low_uniqueness")
            logger.warning(
                "Claude response has low token uniqueness",
                extra={"unique_ratio": unique_ratio},
            )

Alerting: Alert if “unusual” responses exceed 10% of requests in an hour.


Deployment Checklist {#deployment-checklist}

Before deploying Claude to production, ensure you have:

Observability Foundation

  • Tracing: Every Claude call is wrapped in an OpenTelemetry span.
  • Logging: Structured logs (JSON) for all requests, responses, and errors.
  • Metrics: Token counts, latency, error rate, and cost tracked as metrics.
  • Backend: Traces, logs, and metrics exported to a production observability platform (Datadog, Honeycomb, etc.).

Failure Detection

  • Rate limit handling: Exponential backoff and alerting for 429 errors.
  • Context overflow detection: Validate prompt size before sending to Claude.
  • API error handling: Distinguish transient (retry) from permanent (fail fast) errors.
  • Timeout handling: Explicit timeout configuration and detection.
  • Output quality checks: Detect suspiciously short or degraded responses.

Cost Control

  • Token accounting: Track input, output, and cache tokens separately.
  • Cost estimation: Calculate cost per request and aggregate daily/monthly.
  • Cost alerts: Alert if daily or monthly spend exceeds budget.
  • Cost optimization: Evaluate prompt caching, model selection, and context size.

Compliance and Security

  • Audit trail: All Claude calls logged with request ID, timestamp, user, and outcome.
  • Data privacy: Sensitive data not logged or sent to Claude without explicit handling.
  • API key rotation: Keys stored in secrets manager, rotated regularly.
  • Network security: Claude API calls use HTTPS, TLS 1.2+.
  • Access control: Only authorised users/services can call Claude.

Dashboards and Alerting

  • SLO dashboard: Latency (p50, p95, p99), error rate, availability.
  • Cost dashboard: Daily spend, cost per request, cost by model/endpoint.
  • Error dashboard: Error rate by type, error trends, top failing requests.
  • Alert rules: Rate limits, timeouts, error rate spikes, cost overruns.
  • On-call runbook: How to respond to each alert.

Testing

  • Load testing: Test observability at expected peak load.
  • Failure injection: Simulate rate limits, timeouts, and API errors.
  • Cost testing: Verify cost calculation against actual bills.
  • Tracing validation: Spot-check traces in production to ensure completeness.

Next Steps {#next-steps}

Observability is not a one-time setup—it’s an ongoing practice. Here’s how to mature your observability:

Week 1: Foundation

  1. Instrument all Claude calls with OpenTelemetry tracing.
  2. Export traces to a production backend (Datadog, Honeycomb, etc.).
  3. Create a basic dashboard showing latency, error rate, and token usage.
  4. Set up alerts for error rate > 5% and cost > daily budget.

Week 2–4: Depth

  1. Add structured logging to all Claude calls.
  2. Implement failure detection for rate limits, timeouts, and API errors.
  3. Track cost per request and per endpoint.
  4. Add context-overflow detection and validation.
  5. Create runbooks for common failure scenarios.

Month 2+: Optimization

  1. Analyse traces to identify slow requests and optimise prompts.
  2. Evaluate prompt caching to reduce costs.
  3. Benchmark different models (Sonnet vs. Opus) for your use cases.
  4. Establish SLOs (e.g., p99 latency < 5 seconds, error rate < 1%).
  5. Use observability to drive product decisions (e.g., “this feature costs $X per user—is it worth it?”).

Production Readiness

For comprehensive production guidance, refer to Anthropic’s official production readiness guide, which covers reliability, scaling, and operational considerations beyond observability.

If you’re building a multi-agent research system, Anthropic’s engineering write-up on their multi-agent research system provides concrete lessons on coordination, tool use, and operational design that complement this observability guide.

For enterprise deployments, especially if you’re pursuing SOC 2 or ISO 27001 compliance, observability is a critical component. PADISO’s Security Audit service can help you validate that your observability infrastructure meets compliance requirements.

Learning Resources

Getting Help

If you’re building production AI systems and need guidance on architecture, observability, or compliance, PADISO offers fractional CTO leadership and platform engineering services. We’ve helped startups and enterprises deploy Claude and other AI systems at scale, with full observability and compliance.

For platform engineering in your region, we have teams in Sydney, San Francisco, Los Angeles, Chicago, Boston, Seattle, Austin, Dallas, Houston, Atlanta, and Denver. If you need help with observability, cost control, or compliance for Claude deployments, book a call with our team.


Summary

Production observability for Claude deployments is non-negotiable. Without it, you can’t debug failures, control costs, or prove reliability. The patterns in this guide—instrumentation with OpenTelemetry, structured logging, metrics collection, and failure detection—are battle-tested across hundreds of production systems.

Start simple: instrument your Claude calls, export traces, and create a basic dashboard. Then iterate: add logging, implement failure detection, track costs, and optimise based on what you learn.

Observability isn’t a feature you ship once—it’s a practice you refine continuously. The more you observe, the more you’ll optimise, and the more reliable and cost-effective your Claude deployments will be.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call