Table of Contents
- Why Webhooks Matter for Claude in Production
- Webhook Architecture Fundamentals
- Security and Signature Verification
- Retry Logic and Delivery Guarantees
- Reference Diagram and Flow
- Code Patterns and Implementation
- Failure Scenarios and How to Prevent Them
- Monitoring and Observability
- Scaling Webhook Infrastructure
- Next Steps and Production Readiness
Why Webhooks Matter for Claude in Production {#why-webhooks-matter}
When you deploy Claude at scale—whether you’re running agentic AI workflows, automating customer support triage, or orchestrating multi-step business processes—you need a way for Claude to notify your systems of state changes, completions, and errors in real time. Webhooks are the production-grade pattern for this.
Unlike polling (where you repeatedly ask “Is the work done yet?”), webhooks let Claude push events to your infrastructure. This means lower latency, reduced database load, and the ability to react to events as they happen. For teams building AI-powered products, this translates directly into faster user feedback loops and more responsive automation.
At PADISO, we’ve deployed Claude webhooks across dozens of production systems—from fintech platforms automating compliance workflows to e-commerce engines generating product descriptions at scale. The pattern we’ve standardised here reflects real production constraints: verification must be cryptographically sound, retries must be idempotent, and failure modes must be observable.
This guide covers the complete architecture: how to subscribe to Claude webhooks, verify signatures, handle retries, prevent failure cascades, and scale reliably. We’ll walk through code, diagrams, and the specific failure scenarios that catch teams in production.
Webhook Architecture Fundamentals {#webhook-fundamentals}
What Is a Webhook?
A webhook is an HTTP callback—a way for one system (Claude) to notify another system (your application) when something happens. Instead of your code asking Claude “Is the message processed?” every second, Claude pushes a message to a URL you control when processing is complete.
For Claude deployments, webhooks typically fire when:
- A managed agent completes a task
- An API request reaches a terminal state (success, error, timeout)
- A long-running inference finishes
- A streaming session closes
The webhook event is a JSON payload sent via HTTP POST to your endpoint. Your endpoint receives it, processes it, and returns a 2xx status code to confirm receipt.
Webhook vs. Polling: Why It Matters
In a polling architecture, your code runs on a schedule—every 5 seconds, every minute—asking “Is the work done?” This works for low-volume workloads but breaks at scale:
- Database load: Every poll is a query. 1,000 concurrent tasks polling every 5 seconds = 200 queries per second.
- Latency: You don’t know work is done until the next poll interval. If polls run every 60 seconds, you wait up to 60 seconds to react.
- Cost: API calls and database queries add up fast.
Webhooks flip this: Claude calls you when work is done. No polling. No wasted queries. Latency drops to milliseconds.
For teams running AI automation at PADISO’s platform development centres—whether in Sydney, San Francisco, or across multiple regions—this difference is the difference between a system that scales smoothly and one that becomes a bottleneck.
Event-Driven Architecture
Webhooks are the foundation of event-driven architecture. When Claude fires a webhook, your system can:
- Update a database record
- Trigger downstream workflows
- Send notifications to users
- Log metrics and analytics
- Enqueue background jobs
This decoupling means Claude doesn’t need to wait for your entire response chain to complete. Claude fires the webhook and moves on. Your system processes asynchronously.
Security and Signature Verification {#security-verification}
Why Verification Is Non-Negotiable
Any HTTP endpoint on the internet can be called by anyone. Without verification, an attacker could:
- Forge webhook events to trigger false alerts
- Mark completed work as failed to trigger retries
- Inject malicious payloads into your processing pipeline
Production webhook architectures must verify that events actually came from Claude. This is done via cryptographic signatures.
How Claude Signs Webhooks
When you subscribe to Claude webhooks via the Claude API Docs — Subscribe to webhooks, you receive a signing key. Claude uses this key to generate an HMAC-SHA256 signature of the webhook payload and includes it in the request headers.
Your endpoint must:
- Extract the signature from the request header
- Compute the HMAC-SHA256 of the raw request body using your signing key
- Compare the two signatures (using constant-time comparison to prevent timing attacks)
- Reject the request if signatures don’t match
This is the same pattern used by Stripe Docs — Webhooks, GitHub Docs — Webhooks, and other production systems. The pattern is battle-tested.
Implementation: Signature Verification in Python
import hmac
import hashlib
import json
from flask import Flask, request
app = Flask(__name__)
# Your Claude webhook signing key (store in environment variable)
CLAUDE_WEBHOOK_KEY = "whsec_..."
def verify_webhook_signature(payload: bytes, signature: str, key: str) -> bool:
"""
Verify a Claude webhook signature using constant-time comparison.
Args:
payload: Raw request body as bytes
signature: Signature from X-Claude-Signature header
key: Your webhook signing key
Returns:
True if signature is valid, False otherwise
"""
# Compute expected signature
expected_signature = hmac.new(
key.encode(),
payload,
hashlib.sha256
).hexdigest()
# Use constant-time comparison to prevent timing attacks
return hmac.compare_digest(signature, expected_signature)
@app.route('/webhooks/claude', methods=['POST'])
def handle_claude_webhook():
"""
Handle incoming Claude webhook.
"""
# Get raw body (before Flask parses it)
raw_body = request.get_data()
signature = request.headers.get('X-Claude-Signature')
# Verify signature
if not signature or not verify_webhook_signature(raw_body, signature, CLAUDE_WEBHOOK_KEY):
return {'error': 'Unauthorized'}, 401
# Parse and process event
event = json.loads(raw_body)
process_claude_event(event)
return {'status': 'received'}, 200
def process_claude_event(event: dict):
"""
Process a Claude webhook event.
"""
event_type = event.get('type')
event_id = event.get('id')
if event_type == 'agent.task.completed':
handle_task_completed(event)
elif event_type == 'agent.task.failed':
handle_task_failed(event)
elif event_type == 'agent.task.timeout':
handle_task_timeout(event)
HTTPS Requirement
Claude only sends webhooks to HTTPS endpoints. This is non-negotiable. Your webhook endpoint must:
- Use a valid TLS certificate (self-signed certificates are rejected)
- Have a certificate chain that validates against standard root CAs
- Support TLS 1.2 or higher
For local development, use tools like ngrok or localtunnel to expose a local HTTP server via a public HTTPS URL.
Retry Logic and Delivery Guarantees {#retry-logic}
Understanding Claude’s Retry Behavior
Claude uses exponential backoff with jitter to retry failed webhook deliveries. According to the Claude API Docs — Subscribe to webhooks, if your endpoint returns a non-2xx status code, Claude will retry with increasing delays.
This is good news and bad news:
Good: You get multiple chances to receive the event. If your service is temporarily down, Claude will keep trying.
Bad: You might receive the same event multiple times. Your code must be idempotent.
Idempotency: The Core Requirement
Idempotency means processing the same webhook multiple times produces the same result as processing it once. This is critical because:
- Network failures can cause duplicate deliveries
- Your endpoint might crash after processing but before returning a 2xx response
- Clock skew or retry logic bugs can cause re-deliveries
To achieve idempotency:
- Use event IDs: Every webhook has a unique
id. Store processed event IDs in a database. - Check before processing: Before processing an event, check if its ID has been seen before.
- Use database transactions: Ensure the check and update are atomic.
Implementation: Idempotent Webhook Handler
import json
from datetime import datetime, timedelta
from sqlalchemy import Column, String, DateTime, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
engine = create_engine('postgresql://...')
Session = sessionmaker(bind=engine)
class ProcessedWebhookEvent(Base):
"""Track processed webhook events to prevent duplicates."""
__tablename__ = 'processed_webhook_events'
event_id = Column(String, primary_key=True)
event_type = Column(String)
received_at = Column(DateTime, default=datetime.utcnow)
processed_at = Column(DateTime)
def handle_claude_webhook_idempotent(event: dict):
"""
Process a webhook event idempotently.
"""
event_id = event.get('id')
event_type = event.get('type')
session = Session()
try:
# Check if event has been processed
existing = session.query(ProcessedWebhookEvent).filter(
ProcessedWebhookEvent.event_id == event_id
).first()
if existing:
# Already processed; return success without reprocessing
return {'status': 'already_processed'}
# Process the event
if event_type == 'agent.task.completed':
result = handle_task_completed(event)
elif event_type == 'agent.task.failed':
result = handle_task_failed(event)
else:
result = None
# Record that we've processed this event
processed = ProcessedWebhookEvent(
event_id=event_id,
event_type=event_type,
processed_at=datetime.utcnow()
)
session.add(processed)
session.commit()
return {'status': 'processed', 'result': result}
except Exception as e:
session.rollback()
raise
finally:
session.close()
Cleanup and Retention
Processed event records accumulate over time. Implement a cleanup job to remove old records (e.g., older than 90 days) to prevent the table from growing unbounded:
def cleanup_old_webhook_events(days_to_keep: int = 90):
"""
Remove processed webhook events older than the retention period.
"""
session = Session()
cutoff = datetime.utcnow() - timedelta(days=days_to_keep)
deleted = session.query(ProcessedWebhookEvent).filter(
ProcessedWebhookEvent.processed_at < cutoff
).delete()
session.commit()
session.close()
return deleted
Comparison with Other Platforms
The idempotency pattern we’ve outlined mirrors best practices from OpenAI Platform Docs — Webhooks guide and Stripe Docs — Webhooks. All production webhook systems require idempotent handlers.
Reference Diagram and Flow {#reference-diagram}
High-Level Architecture
┌──────────────────────────────────────────────────────────────┐
│ Your Application │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ API Client │ │ Webhook Endpoint │ │
│ │ (initiate task)│ │ (receive events) │ │
│ └────────┬────────┘ └──────────┬───────────┘ │
│ │ ▲ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────────────┐ ┌───────┴─────────────┐ │
│ │ Claude API │ │ Event Processing │ │
│ │ (task_id: abc123) │ │ - Verify signature │ │
│ └──────────────────────────┘ │ - Check idempotency│ │
│ │ │ - Update database │ │
│ │ │ - Trigger workflows│ │
│ │ └─────────────────────┘ │
│ │ │
│ └──────────────────────────────────────────────────┤
│ (HTTPS POST to │
│ /webhooks/claude) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Database │ │
│ │ - Tasks table (status, result) │ │
│ │ - Processed events (deduplication) │ │
│ │ - Audit logs │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
▲
│
HTTPS (TLS 1.2+)
│
┌──────────────────────────────────────────────────────────────┐
│ Claude (Anthropic) │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ │
│ │ Task Execution │ │
│ │ - Process request │ │
│ │ - Generate response │ │
│ │ - Determine outcome │ │
│ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Webhook Event Generation │ │
│ │ - Build event payload │ │
│ │ - Sign with HMAC-SHA256 │ │
│ │ - Prepare headers │ │
│ └──────────┬───────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Retry Queue │ │
│ │ - Initial attempt │ │
│ │ - Exponential backoff on failure │ │
│ │ - Max 10 retries (configurable) │ │
│ └──────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Event Flow Sequence
1. Client initiates task
→ POST /api/v1/agents/tasks
→ Returns {task_id: "abc123"}
2. Claude processes task
→ Inference runs
→ Result determined
3. Claude generates webhook event
→ Event ID: "evt_xyz789"
→ Type: "agent.task.completed"
→ Payload includes task_id, result, metadata
4. Claude signs webhook
→ Computes HMAC-SHA256 of payload
→ Includes signature in X-Claude-Signature header
5. Claude sends HTTPS POST
→ Target: https://yourapp.com/webhooks/claude
→ Headers: X-Claude-Signature, Content-Type
→ Body: JSON event payload
6. Your endpoint receives request
→ Verifies signature (constant-time comparison)
→ Checks event ID against processed_webhook_events table
→ If new, processes event
→ Returns 200 OK
7. If your endpoint returns non-2xx
→ Claude retries with exponential backoff
→ Retry delays: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s
→ After 10 retries, event is abandoned
8. If your endpoint returns 200 OK
→ Claude marks delivery as successful
→ No further retries
Code Patterns and Implementation {#code-patterns}
Complete Flask Implementation
Here’s a production-ready Flask application that handles Claude webhooks end-to-end:
import os
import json
import hmac
import hashlib
import logging
from datetime import datetime
from flask import Flask, request, jsonify
from sqlalchemy import Column, String, DateTime, Text, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import IntegrityError
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Flask app
app = Flask(__name__)
# Database setup
Base = declarative_base()
engine = create_engine(os.environ['DATABASE_URL'])
Session = sessionmaker(bind=engine)
class ProcessedWebhookEvent(Base):
__tablename__ = 'processed_webhook_events'
event_id = Column(String, primary_key=True)
event_type = Column(String)
payload = Column(Text)
received_at = Column(DateTime, default=datetime.utcnow)
processed_at = Column(DateTime, default=datetime.utcnow)
class TaskResult(Base):
__tablename__ = 'task_results'
task_id = Column(String, primary_key=True)
status = Column(String) # 'pending', 'completed', 'failed', 'timeout'
result = Column(Text)
error = Column(Text)
completed_at = Column(DateTime)
Base.metadata.create_all(engine)
def verify_webhook_signature(payload: bytes, signature: str, key: str) -> bool:
"""
Verify Claude webhook signature using constant-time comparison.
"""
expected = hmac.new(key.encode(), payload, hashlib.sha256).hexdigest()
return hmac.compare_digest(signature, expected)
@app.route('/webhooks/claude', methods=['POST'])
def handle_webhook():
"""
Main webhook handler for Claude events.
"""
try:
# Get raw body and signature
raw_body = request.get_data()
signature = request.headers.get('X-Claude-Signature')
webhook_key = os.environ.get('CLAUDE_WEBHOOK_KEY')
# Verify signature
if not signature or not verify_webhook_signature(raw_body, signature, webhook_key):
logger.warning('Invalid webhook signature')
return jsonify({'error': 'Unauthorized'}), 401
# Parse event
event = json.loads(raw_body)
event_id = event.get('id')
event_type = event.get('type')
logger.info(f'Received webhook event: {event_id} ({event_type})')
# Check for duplicate
session = Session()
try:
existing = session.query(ProcessedWebhookEvent).filter(
ProcessedWebhookEvent.event_id == event_id
).first()
if existing:
logger.info(f'Event {event_id} already processed')
return jsonify({'status': 'already_processed'}), 200
# Process based on event type
if event_type == 'agent.task.completed':
process_task_completed(event, session)
elif event_type == 'agent.task.failed':
process_task_failed(event, session)
elif event_type == 'agent.task.timeout':
process_task_timeout(event, session)
else:
logger.warning(f'Unknown event type: {event_type}')
# Record processed event
processed = ProcessedWebhookEvent(
event_id=event_id,
event_type=event_type,
payload=raw_body.decode('utf-8')
)
session.add(processed)
session.commit()
logger.info(f'Successfully processed event {event_id}')
return jsonify({'status': 'processed'}), 200
except IntegrityError:
session.rollback()
logger.info(f'Event {event_id} was processed by another worker')
return jsonify({'status': 'processed'}), 200
except Exception as e:
session.rollback()
logger.error(f'Error processing event {event_id}: {str(e)}')
# Return 5xx to trigger Claude retry
return jsonify({'error': 'Processing failed'}), 500
finally:
session.close()
except json.JSONDecodeError:
logger.error('Invalid JSON in webhook body')
return jsonify({'error': 'Invalid JSON'}), 400
except Exception as e:
logger.error(f'Unexpected error: {str(e)}')
return jsonify({'error': 'Internal error'}), 500
def process_task_completed(event: dict, session):
"""
Handle task completion event.
"""
task_id = event.get('data', {}).get('task_id')
result = event.get('data', {}).get('result')
if not task_id:
logger.warning('No task_id in completed event')
return
# Update database
task = session.query(TaskResult).filter(
TaskResult.task_id == task_id
).first()
if not task:
task = TaskResult(task_id=task_id)
session.add(task)
task.status = 'completed'
task.result = json.dumps(result) if result else None
task.completed_at = datetime.utcnow()
logger.info(f'Task {task_id} completed')
def process_task_failed(event: dict, session):
"""
Handle task failure event.
"""
task_id = event.get('data', {}).get('task_id')
error = event.get('data', {}).get('error')
if not task_id:
logger.warning('No task_id in failed event')
return
task = session.query(TaskResult).filter(
TaskResult.task_id == task_id
).first()
if not task:
task = TaskResult(task_id=task_id)
session.add(task)
task.status = 'failed'
task.error = error
task.completed_at = datetime.utcnow()
logger.error(f'Task {task_id} failed: {error}')
def process_task_timeout(event: dict, session):
"""
Handle task timeout event.
"""
task_id = event.get('data', {}).get('task_id')
if not task_id:
logger.warning('No task_id in timeout event')
return
task = session.query(TaskResult).filter(
TaskResult.task_id == task_id
).first()
if not task:
task = TaskResult(task_id=task_id)
session.add(task)
task.status = 'timeout'
task.error = 'Task exceeded maximum execution time'
task.completed_at = datetime.utcnow()
logger.warning(f'Task {task_id} timed out')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=False)
Node.js Implementation with Express
For teams preferring Node.js, here’s an equivalent Express implementation:
const express = require('express');
const crypto = require('crypto');
const { Pool } = require('pg');
const app = express();
app.use(express.raw({ type: 'application/json' }));
const pool = new Pool({
connectionString: process.env.DATABASE_URL
});
const CLAUDE_WEBHOOK_KEY = process.env.CLAUDE_WEBHOOK_KEY;
function verifyWebhookSignature(payload, signature, key) {
const expected = crypto
.createHmac('sha256', key)
.update(payload)
.digest('hex');
// Constant-time comparison
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(expected)
);
}
app.post('/webhooks/claude', async (req, res) => {
try {
const rawBody = req.body;
const signature = req.headers['x-claude-signature'];
// Verify signature
if (!signature || !verifyWebhookSignature(rawBody, signature, CLAUDE_WEBHOOK_KEY)) {
console.warn('Invalid webhook signature');
return res.status(401).json({ error: 'Unauthorized' });
}
// Parse event
const event = JSON.parse(rawBody.toString('utf-8'));
const eventId = event.id;
const eventType = event.type;
console.log(`Received webhook event: ${eventId} (${eventType})`);
// Check for duplicate
const existing = await pool.query(
'SELECT event_id FROM processed_webhook_events WHERE event_id = $1',
[eventId]
);
if (existing.rows.length > 0) {
console.log(`Event ${eventId} already processed`);
return res.status(200).json({ status: 'already_processed' });
}
// Process based on event type
if (eventType === 'agent.task.completed') {
await processTaskCompleted(event);
} else if (eventType === 'agent.task.failed') {
await processTaskFailed(event);
} else if (eventType === 'agent.task.timeout') {
await processTaskTimeout(event);
}
// Record processed event
await pool.query(
'INSERT INTO processed_webhook_events (event_id, event_type, payload) VALUES ($1, $2, $3)',
[eventId, eventType, rawBody.toString('utf-8')]
);
console.log(`Successfully processed event ${eventId}`);
return res.status(200).json({ status: 'processed' });
} catch (error) {
console.error('Webhook processing error:', error);
// Return 5xx to trigger Claude retry
return res.status(500).json({ error: 'Processing failed' });
}
});
async function processTaskCompleted(event) {
const taskId = event.data?.task_id;
const result = event.data?.result;
if (!taskId) return;
await pool.query(
'INSERT INTO task_results (task_id, status, result, completed_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (task_id) DO UPDATE SET status = $2, result = $3, completed_at = NOW()',
[taskId, 'completed', JSON.stringify(result)]
);
console.log(`Task ${taskId} completed`);
}
async function processTaskFailed(event) {
const taskId = event.data?.task_id;
const error = event.data?.error;
if (!taskId) return;
await pool.query(
'INSERT INTO task_results (task_id, status, error, completed_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (task_id) DO UPDATE SET status = $2, error = $3, completed_at = NOW()',
[taskId, 'failed', error]
);
console.error(`Task ${taskId} failed: ${error}`);
}
async function processTaskTimeout(event) {
const taskId = event.data?.task_id;
if (!taskId) return;
await pool.query(
'INSERT INTO task_results (task_id, status, error, completed_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (task_id) DO UPDATE SET status = $2, error = $3, completed_at = NOW()',
[taskId, 'timeout', 'Task exceeded maximum execution time']
);
console.warn(`Task ${taskId} timed out`);
}
app.listen(3000, () => {
console.log('Webhook server listening on port 3000');
});
Failure Scenarios and How to Prevent Them {#failure-scenarios}
Production deployments fail in predictable ways. Here are the failure modes we see most often, how they manifest, and how to prevent them.
Scenario 1: Signature Verification Bypass
What happens: Your code skips signature verification or implements it incorrectly. An attacker forges webhook events to trigger false alerts or corrupt data.
How it happens:
- Developer uses string comparison instead of constant-time comparison (
signature == expectedinstead ofhmac.compare_digest) - Signing key is logged or exposed in error messages
- Signature is verified against the wrong key (e.g., a test key in production)
Prevention:
- Always use constant-time comparison (built into
hmac.compare_digestin Python,crypto.timingSafeEqualin Node.js) - Store signing keys in environment variables, never in code
- Rotate signing keys periodically
- Log verification failures but never log the actual key or signature
- Unit test signature verification with both valid and forged signatures
Scenario 2: Duplicate Event Processing
What happens: The same webhook event is processed multiple times, leading to duplicate records, double-charged transactions, or repeated notifications.
How it happens:
- Your endpoint processes the event but crashes before returning a 2xx response
- Claude retries, and your code processes it again
- Multiple instances of your application process the same event simultaneously
Prevention:
- Implement idempotent event handling using event IDs
- Use database transactions to make the “check if processed” and “mark as processed” operations atomic
- Use a distributed lock (Redis, DynamoDB) if running multiple instances
- Always return 2xx status codes even if processing fails (use 200 OK for transient failures that Claude should retry, 400 Bad Request for permanent failures)
Scenario 3: Endpoint Timeout
What happens: Your webhook endpoint takes too long to respond, Claude times out, and retries the event.
How it happens:
- Your event handler makes synchronous database queries that lock
- You call external APIs without timeouts
- You process the entire event synchronously instead of queueing work
Prevention:
- Keep webhook handlers fast. Target sub-100ms response times.
- Offload heavy processing to background jobs (Celery, Bull, etc.)
- Set timeouts on all external API calls
- Use connection pooling to avoid database connection exhaustion
- Monitor webhook endpoint latency and alert on p99 > 500ms
Scenario 4: Database Connection Exhaustion
What happens: Your webhook handler opens database connections but doesn’t close them. Eventually, all available connections are consumed, and new webhooks fail.
How it happens:
- Exception handling doesn’t close connections
- Connection pooling is misconfigured (pool size too small)
- A single webhook handler spawns multiple database connections
Prevention:
- Use connection pooling with appropriate pool size (typically 10-20 per CPU core)
- Always close connections in
finallyblocks or use context managers - Monitor connection pool usage and alert when utilisation exceeds 80%
- Set connection timeouts to prevent hung connections from blocking the pool
Scenario 5: Silent Failures
What happens: Your webhook handler catches exceptions and returns 200 OK without actually processing the event. Claude thinks delivery succeeded and doesn’t retry.
How it happens:
@app.route('/webhooks/claude', methods=['POST'])
def handle_webhook():
try:
event = parse_event(request)
process_event(event)
return {'status': 'ok'}, 200
except Exception:
# BUG: Silently swallow the error
return {'status': 'ok'}, 200
Prevention:
- Return 5xx status codes for transient failures (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable)
- Return 4xx status codes only for permanent failures (400 Bad Request for malformed events)
- Log all exceptions with full stack traces
- Use structured logging (JSON logs) so exceptions are searchable
- Set up alerts for 5xx webhook responses
Scenario 6: Clock Skew and Replay Attacks
What happens: An attacker captures a webhook event and replays it hours or days later. Your code processes it as if it were new.
How it happens:
- Your idempotency check only uses event ID, not timestamp
- You don’t validate event timestamps
- Webhook events are stored indefinitely, allowing old events to be replayed
Prevention:
- Include timestamp in the event payload (Claude does this)
- Reject events older than a threshold (e.g., > 5 minutes)
- Combine event ID and timestamp for deduplication
- Clean up processed events after a retention period (90 days is reasonable)
def verify_event_timestamp(event: dict, max_age_seconds: int = 300) -> bool:
"""
Verify that event timestamp is recent.
"""
event_timestamp = datetime.fromisoformat(event.get('timestamp'))
age = (datetime.utcnow() - event_timestamp).total_seconds()
return 0 <= age <= max_age_seconds
Monitoring and Observability {#monitoring}
Key Metrics to Track
Production webhook systems require comprehensive observability. Track these metrics:
- Delivery rate: Percentage of events successfully delivered (target: > 99.9%)
- Latency: Time from webhook dispatch to endpoint response (target: p99 < 500ms)
- Error rate: Percentage of requests returning non-2xx status (target: < 0.1%)
- Duplicate rate: Percentage of events processed multiple times (target: 0%)
- Retry rate: Percentage of events requiring retries (target: < 1%)
- Processing time: Time to process event after signature verification (target: p99 < 100ms)
Structured Logging
Use structured logging (JSON) so logs are machine-readable and searchable:
import json
import logging
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
'timestamp': datetime.utcnow().isoformat(),
'level': record.levelname,
'logger': record.name,
'message': record.getMessage(),
'module': record.module,
'function': record.funcName,
'line': record.lineno
}
if record.exc_info:
log_data['exception'] = self.formatException(record.exc_info)
return json.dumps(log_data)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger()
logger.addHandler(handler)
Example log output:
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "INFO",
"logger": "webhook",
"message": "Webhook event processed successfully",
"event_id": "evt_abc123",
"event_type": "agent.task.completed",
"task_id": "task_xyz789",
"processing_time_ms": 45
}
Alerting Rules
Set up alerts for these conditions:
- Webhook endpoint returning 5xx: Immediate alert (indicates service degradation)
- Error rate > 1%: 5-minute alert (something is wrong)
- Latency p99 > 1 second: 10-minute alert (performance degradation)
- Duplicate processing detected: Daily digest (indicates idempotency issues)
- Webhook delivery backlog: 15-minute alert (Claude is queuing events)
Distributed Tracing
For complex systems, use distributed tracing to correlate webhook events with downstream actions:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.route('/webhooks/claude', methods=['POST'])
def handle_webhook():
with tracer.start_as_current_span('handle_webhook') as span:
raw_body = request.get_data()
event = json.loads(raw_body)
span.set_attribute('event.id', event.get('id'))
span.set_attribute('event.type', event.get('type'))
with tracer.start_as_current_span('verify_signature'):
verify_webhook_signature(raw_body, request.headers.get('X-Claude-Signature'), WEBHOOK_KEY)
with tracer.start_as_current_span('process_event'):
process_claude_event(event)
return {'status': 'processed'}, 200
Scaling Webhook Infrastructure {#scaling}
Horizontal Scaling
As webhook volume grows, you’ll need to scale horizontally. Key considerations:
- Load balancing: Distribute incoming webhooks across multiple instances
- Idempotency database: Shared across all instances (e.g., PostgreSQL, DynamoDB)
- Distributed locks: Prevent duplicate processing when multiple instances receive the same event
Here’s a pattern using Redis for distributed locking:
import redis
from redis.lock import Lock
redis_client = redis.Redis(host='localhost', port=6379)
def handle_webhook_with_lock(event_id: str, event_type: str, process_fn):
"""
Process webhook with distributed locking to prevent duplicates.
"""
lock_key = f'webhook:lock:{event_id}'
lock = Lock(redis_client, lock_key, timeout=30, blocking=True)
try:
if not lock.acquire(blocking_timeout=5):
# Another instance is processing this event
logger.info(f'Event {event_id} is being processed by another instance')
return {'status': 'processing'}
# Check if already processed
if redis_client.exists(f'webhook:processed:{event_id}'):
logger.info(f'Event {event_id} already processed')
return {'status': 'already_processed'}
# Process event
process_fn(event_id, event_type)
# Mark as processed
redis_client.setex(f'webhook:processed:{event_id}', 86400, '1')
return {'status': 'processed'}
finally:
lock.release()
Queue-Based Architecture
For very high-volume deployments, decouple webhook ingestion from processing:
┌─────────────────────────────────────────────────────┐
│ Webhook Ingestion Layer │
│ - Verify signature │
│ - Check signature │
│ - Enqueue to message queue │
│ - Return 200 OK immediately │
└────────────────┬────────────────────────────────────┘
│
▼
┌────────────────┐
│ Message Queue │
│ (Kafka, RabbitMQ, SQS) │
└────────┬───────┘
│
┌────────▼──────────┐
│ Worker Pool │
│ - Consume events │
│ - Process events │
│ - Update DB │
│ - Trigger actions │
└───────────────────┘
This pattern allows you to:
- Respond to webhooks immediately (ingestion layer returns 200 OK)
- Scale processing independently (add more workers without adding webhook endpoints)
- Retry failed processing without Claude’s retry mechanism
- Process events in parallel with configurable concurrency
Database Optimization
As webhook volume grows, optimize your database for high throughput:
- Partition processed_webhook_events table: Partition by event timestamp to keep indexes small
- Batch inserts: Insert processed events in batches of 100-1000
- Async writes: Use write-ahead logging (WAL) for PostgreSQL
- Connection pooling: Use PgBouncer or similar to manage connections
- Monitoring: Track query latency, index usage, and table bloat
-- Partition processed_webhook_events by month
CREATE TABLE processed_webhook_events (
event_id VARCHAR PRIMARY KEY,
event_type VARCHAR,
payload TEXT,
received_at TIMESTAMP,
processed_at TIMESTAMP
) PARTITION BY RANGE (processed_at);
-- Create partitions for each month
CREATE TABLE processed_webhook_events_2024_01 PARTITION OF processed_webhook_events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE processed_webhook_events_2024_02 PARTITION OF processed_webhook_events
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
Next Steps and Production Readiness {#next-steps}
Pre-Launch Checklist
Before deploying webhook infrastructure to production:
- Signature verification implemented and tested
- Idempotency check implemented (event ID deduplication)
- Database connection pooling configured
- Error handling returns appropriate status codes (5xx for retries, 4xx for permanent failures)
- Structured logging configured
- Monitoring and alerting in place
- Load testing completed (target: 100+ webhooks/second)
- Disaster recovery plan documented (what happens if webhook endpoint goes down)
- Security audit completed (signature verification, key rotation, access control)
- Documentation written (runbooks, troubleshooting guides)
Testing Strategy
Implement comprehensive tests for webhook handling:
import pytest
import json
import hmac
import hashlib
def test_valid_webhook_signature():
"""
Test that valid webhooks are accepted.
"""
payload = json.dumps({'id': 'evt_123', 'type': 'agent.task.completed'})
key = 'test_key'
signature = hmac.new(key.encode(), payload.encode(), hashlib.sha256).hexdigest()
response = client.post(
'/webhooks/claude',
data=payload,
headers={'X-Claude-Signature': signature}
)
assert response.status_code == 200
def test_invalid_webhook_signature():
"""
Test that invalid webhooks are rejected.
"""
payload = json.dumps({'id': 'evt_123', 'type': 'agent.task.completed'})
invalid_signature = 'invalid_signature'
response = client.post(
'/webhooks/claude',
data=payload,
headers={'X-Claude-Signature': invalid_signature}
)
assert response.status_code == 401
def test_duplicate_event_handling():
"""
Test that duplicate events are handled idempotently.
"""
payload = json.dumps({'id': 'evt_123', 'type': 'agent.task.completed'})
key = 'test_key'
signature = hmac.new(key.encode(), payload.encode(), hashlib.sha256).hexdigest()
# First request
response1 = client.post(
'/webhooks/claude',
data=payload,
headers={'X-Claude-Signature': signature}
)
assert response1.status_code == 200
# Second request (duplicate)
response2 = client.post(
'/webhooks/claude',
data=payload,
headers={'X-Claude-Signature': signature}
)
assert response2.status_code == 200
# Verify event was only processed once
assert db.query('SELECT COUNT(*) FROM task_results').scalar() == 1
Observability Setup
For teams operating at scale, we recommend Cloudflare Developers — Webhook signature verification example as a reference for edge-based verification, and AWS EventBridge User Guide — Invoking targets with events for understanding event-driven patterns at scale.
Also consider Microsoft Learn — HTTP-triggered workflows and endpoints for webhook integration with enterprise automation platforms.
Getting Help
If you’re building production Claude deployments and need fractional CTO leadership, platform engineering expertise, or help scaling webhook infrastructure, PADISO’s Services include CTO as a Service and Platform Design & Engineering tailored to your architecture. We’ve deployed webhook systems across Sydney, San Francisco, and other regions.
For a diagnostic of your current Claude readiness and infrastructure, consider our AI Quickstart Audit—a fixed-scope, two-week engagement that identifies production risks and prioritises what to ship first.
Key Takeaways
- Webhooks are essential for production Claude deployments: They enable real-time event-driven architectures without polling.
- Security is non-negotiable: Always verify signatures using constant-time comparison, and store keys in environment variables.
- Idempotency is mandatory: Use event IDs and database transactions to ensure the same event is never processed twice.
- Failure modes are predictable: Implement the patterns we’ve outlined to prevent signature bypass, duplicates, timeouts, and silent failures.
- Observability is critical: Monitor delivery rate, latency, error rate, and duplicate detection from day one.
- Scale horizontally with care: Use distributed locking and message queues to handle high volumes reliably.
Webhook architectures are mature, battle-tested patterns. Following the guidance in this guide—signature verification, idempotency, error handling, monitoring—will give you a production-grade foundation for Claude deployments at any scale.
Additional Resources
For reference implementations and deeper technical guidance, review the official documentation from Claude API Docs — Subscribe to webhooks, OpenAI Platform Docs — Webhooks guide, and Stripe Docs — Webhooks. All three platforms use similar patterns for security and reliability.
For edge-based verification and advanced routing, Cloudflare Developers — Webhook signature verification example shows how to verify signatures at the edge before routing to origin servers. For enterprise event orchestration, GitHub Docs — Webhooks and Sentry Docs — Webhooks provide additional reference patterns.
If you’re building multi-region or event-driven infrastructure, AWS EventBridge User Guide — Invoking targets with events covers managed event routing at scale, and Microsoft Learn — HTTP-triggered workflows and endpoints shows how to integrate webhooks with enterprise automation platforms.