Guide 22 mins

AI Incident Response: A Runbook for Claude-Powered Products

Step-by-step runbook for responding to Claude AI product incidents: model rollback, prompt freeze, customer comms, and post-mortem templates.

The PADISO Team ·2026-05-27

AI Incident Response: A Runbook for Claude-Powered Products

When a Claude-powered product breaks in production, you have minutes—not hours—to contain the damage, communicate with customers, and restore service. This runbook gives you the exact steps to take in the first 60 minutes, plus templates and decision trees to move fast without panic.

If you’re running AI products at scale, this guide is essential. If you’re building your first Claude integration, this is the playbook you’ll wish you had when things go wrong.

Why AI Incident Response Is Different
The First 5 Minutes: Detect and Declare
Minutes 5–15: Isolate and Freeze
Minutes 15–30: Assess and Decide
Minutes 30–45: Communicate and Mitigate
Minutes 45–60: Execute Recovery
Post-Incident: The 24-Hour Post-Mortem
Runbook Templates and Checklists
Prevention: Build Incident Readiness Into Your Product
Next Steps: Making This Real

Why AI Incident Response Is Different

Traditional incident response assumes your system is broken because of a code bug, infrastructure failure, or database corruption. You find the root cause, roll back, and move on.

AI incidents are messier. A Claude model can produce hallucinations, inconsistent outputs, or toxic responses—none of which are “bugs” in the classical sense. Your prompt might be perfect, but the model’s behaviour changes when you update it. A customer’s input might trigger an edge case you never tested. The model’s reasoning might be sound, but the output is malformed or unsafe.

This creates a new problem: you can’t always roll back code and fix it. Sometimes you need to freeze the prompt, adjust the model parameters, or switch to an older Claude version. Sometimes you need to admit the AI isn’t ready for a particular use case and disable the feature entirely.

At PADISO, we’ve seen teams ship Claude-powered products without incident response plans. When something goes wrong—a prompt that generates harmful content, a model that hallucinates customer data, an API call that fails silently—the team panics, customers get angry, and trust evaporates.

This runbook prevents that. It’s built on real incidents we’ve handled and patterns we’ve seen across startups and enterprises building AI products in Australia and globally.

The First 5 Minutes: Detect and Declare

How You’ll Know Something Is Wrong

You’ll find out in one of these ways:

Automated monitoring alerts — Your observability tool (Datadog, New Relic, Sentry) detects elevated error rates, latency spikes, or unusual token usage.
Customer reports — A customer emails support or posts on Slack: “Your AI is giving me garbage.”
Internal testing — You or your team notice odd outputs during QA or manual testing.
Social media or public channels — Someone complains publicly before you know there’s a problem.

Regardless of how you find out, the first step is the same: declare an incident.

Declare the Incident

Post a single message in your incident channel (Slack, PagerDuty, Discord—whatever you use) with this format:

🚨 INCIDENT DECLARED: [Product Name] — [Brief Description]
Time: [UTC timestamp]
Severity: [Critical / High / Medium]
Owner: [Name]
Status: INVESTIGATING

Example:

🚨 INCIDENT DECLARED: Customer Support Bot — Returning incomplete responses
Time: 2025-02-14T09:32:00Z
Severity: Critical
Owner: Sarah (Engineering)
Status: INVESTIGATING

Don’t wait for perfect information. Declare now, refine later. This single message does three things:

Triggers your incident response team — Everyone knows to drop what they’re doing.
Creates a timestamp — You’ll need this for your post-mortem.
Prevents duplicate effort — Other team members know someone is already handling it.

Assign an Incident Commander

One person owns this incident for the next 60 minutes. This is usually your senior engineer, tech lead, or on-call engineer. Their job is not to fix the problem—it’s to coordinate the response, make decisions, and keep everyone informed.

The Incident Commander should:

Own all decisions in the next 60 minutes.
Update the incident channel every 10 minutes, even if there’s no new information.
Decide when to escalate (e.g., “We need to roll back”).
Decide when to communicate to customers.

Minutes 5–15: Isolate and Freeze

Step 1: Stop the Bleeding (Immediately)

Your first job is to prevent the incident from getting worse. You have four levers:

Option A: Kill the Feature

If the AI feature is optional (e.g., “AI-powered summary” rather than “core chat”), disable it immediately. Most products have a feature flag or kill switch for this reason.

# Example: Disable Claude integration via environment variable
export CLAUDE_ENABLED=false
# Redeploy or restart the service

This takes 2–5 minutes and stops new incidents from happening. Do this first if you can.

Option B: Freeze the Prompt

If the problem is in the prompt (e.g., it’s generating toxic outputs), lock it immediately. Don’t let engineers experiment or tweak it.

Create a file called PROMPT_FROZEN.txt in your repo or config system:

PROMPT FROZEN: 2025-02-14T09:35:00Z
Reason: Generating harmful responses
Frozen by: Sarah
Do not modify until incident commander approves

Any change to the prompt requires explicit approval from the Incident Commander.

Option C: Rollback the Model Version

If you recently upgraded from Claude 3.5 Sonnet to Claude 3.7 (or any newer version), consider rolling back to the previous version immediately.

# Before (current, broken)
client = Anthropic(model="claude-3-7-sonnet-20250214")

# After (rollback)
client = Anthropic(model="claude-3-5-sonnet-20241022")

This is a safe move. Older Claude versions are stable and battle-tested. You’re trading some capability for stability.

Option D: Rate Limit or Throttle

If the problem is intermittent or affects only certain input types, you can throttle traffic to the Claude API:

from functools import wraps
import time

@wraps
def rate_limit_claude(func, max_calls_per_minute=10):
    calls = []
    def wrapper(*args, **kwargs):
        now = time.time()
        calls[:] = [c for c in calls if now - c < 60]
        if len(calls) >= max_calls_per_minute:
            raise RateLimitError("Claude calls throttled")
        calls.append(now)
        return func(*args, **kwargs)
    return wrapper

This buys you time to investigate without a full rollback.

Step 2: Gather Evidence (Minutes 5–10)

While you’re isolating the issue, start collecting data:

Pull logs — Get the last 30 minutes of API requests, responses, and errors.
Check the model’s recent changes — Did you deploy a new prompt, version, or parameter in the last 24 hours?
Reproduce the issue — Try to trigger it yourself with the same input.
Check Anthropic’s status — Visit Anthropic’s Transparency Hub to see if there are known issues with Claude.
Review your metrics — Token usage, latency, error rates—what changed?

Save all of this in a shared document (Google Doc, Notion, GitHub issue) so the whole team can see it.

Step 3: Freeze Everything (Minutes 10–15)

Once you’ve isolated the issue, freeze changes:

No prompt changes — Lock the prompt file.
No model upgrades — Stay on the current version.
No new deployments — Halt CI/CD for this service.
No parameter tweaks — Don’t change temperature, max tokens, or system prompts.

Write this in your incident channel:

✅ ISOLATED: Feature disabled / Prompt frozen / Model rolled back
🔒 FREEZE IN EFFECT: No changes to Claude config until further notice
Next: Assessing root cause (5 min)

Minutes 15–30: Assess and Decide

Diagnosis: What Went Wrong?

Now you have time to think. Use this framework to diagnose the issue:

Is It a Prompt Problem?

Symptoms:

Outputs are toxic, off-brand, or factually wrong.
The model is ignoring instructions.
Responses are incomplete or malformed.

Test: Run the exact same input through the Claude API directly (via the Anthropic console or a test script). If you get the same bad output, it’s a prompt issue.

Root causes:

Ambiguous or conflicting instructions in the system prompt.
The prompt assumes context that isn’t provided.
Recent changes to the prompt introduced new issues.

Is It a Model Problem?

Symptoms:

Outputs are nonsensical or hallucinating.
The model is generating code that doesn’t work.
Behaviour changed after upgrading Claude versions.

Test: Run the same input through an older Claude version. If the old version works, it’s a model regression.

Root causes:

The new Claude version has different behaviour (unlikely but possible).
The new version is more sensitive to prompt wording.
Your inputs are triggering an edge case in the new model.

Is It an Integration Problem?

Symptoms:

API calls are failing or timing out.
Responses are incomplete or truncated.
The service is returning 5xx errors.

Test: Check your API logs. Are calls reaching Anthropic? Are responses coming back? Is your code handling them correctly?

Root causes:

Rate limiting or quota exceeded.
Network issues or timeouts.
Your code isn’t handling streaming responses correctly.
The API schema changed (rare, but check Anthropic’s changelog).

Is It an Input Problem?

Symptoms:

Only certain inputs trigger the issue.
Specific customers are affected, not all.
The issue is intermittent.

Test: Isolate the problematic input. Does the issue reproduce with simpler inputs? Does it happen with different customers?

Root causes:

The input contains unexpected characters, languages, or formats.
The input is too long and triggers truncation.
The input contains adversarial prompts or jailbreak attempts.

Severity Assessment

Once you know what went wrong, assess the blast radius:

Severity	Definition	Example	Action
Critical	All users affected, core functionality broken	Chat feature returns errors for all inputs	Rollback immediately, notify customers
High	Most users affected, significant degradation	50% of requests fail or return bad outputs	Disable feature, plan rollback for next 2 hours
Medium	Some users affected, workaround exists	Specific input types fail; users can retry	Monitor, plan fix for next release
Low	Edge case, cosmetic issue	Formatting is slightly off	Log and plan fix, no immediate action

Update your incident channel:

📊 ROOT CAUSE: Prompt update introduced conflicting instructions
🎯 SEVERITY: High (50% of requests affected)
💡 SOLUTION: Rollback prompt to previous version
ETA: 15 minutes

Decision Tree: Rollback or Iterate?

At this point, you need to decide: Do you roll back, or do you fix it forward?

Roll back if:

The issue is critical and affects all users.
You don’t know the root cause yet.
The fix will take more than 30 minutes.
You have a known-good version to roll back to.

Iterate (fix forward) if:

The issue is low-severity and affects a small percentage of users.
You’ve identified the root cause and the fix is simple (e.g., one-line prompt change).
You have high confidence the fix will work.
Rolling back would cause other problems (e.g., you’ve removed a feature in the rollback version).

In most cases, roll back first, investigate later. You can always re-deploy the fixed version once you’ve tested it.

Minutes 30–45: Communicate and Mitigate

Internal Communication

Update your incident channel every 5 minutes:

⏰ 30-MIN UPDATE
Status: Executing rollback
What we know: Prompt change introduced conflicting instructions
What we're doing: Rolling back to previous prompt version
Customer impact: 50% of requests failing (high priority)
ETA to resolution: 10 minutes
Next update: 9:45 AM

Keep it factual, not emotional. No “We’re super sorry!” or “This is a disaster!” Just the facts.

External Communication (Customer-Facing)

You need to decide: Do we tell customers?

Tell customers if:

The issue has been ongoing for more than 10 minutes.
It affects core functionality (not a nice-to-have feature).
Customers are actively reporting it.
It’s visible in your public metrics or status page.

Don’t tell customers if:

You’ve already fixed it.
It affects less than 5% of users and is low-severity.
It’s a cosmetic issue that doesn’t impact functionality.

If you decide to communicate, use this template:

🔧 SERVICE UPDATE

We're currently investigating an issue with [Feature Name] that may affect some requests.
We've identified the root cause and are implementing a fix.

Status: In Progress
ETA: [Time]
Impact: [% of users affected, what they experience]
What you can do: [Retry requests / Use alternative feature / Wait for update]

We'll post updates every 10 minutes. Thank you for your patience.

Post this to:

Your status page (Statuspage.io, Instatus, etc.)
Your customer Slack or Discord (if applicable)
Email to affected customers (if it’s critical)

Do NOT post to social media yet. Wait until you’ve resolved it.

Escalation Decision

At the 30-minute mark, ask yourself:

Is this resolved? If yes, move to the post-mortem.
Will this be resolved in the next 15 minutes? If yes, keep executing.
Will this take longer than 45 minutes? If yes, escalate.

Escalation means:

Notify your CTO, VP Engineering, or CEO.
Consider involving Anthropic support (if it’s a platform issue).
Prepare for a longer incident (move to 24-hour incident management mode).

At PADISO, we help teams build AI Agency Support Sydney capabilities that catch these issues before they hit production. But when they do, escalation protocols matter.

Minutes 45–60: Execute Recovery

The Rollback

If you’ve decided to roll back, do it now. Here’s the exact process:

Step 1: Create a Rollback Branch

git checkout -b incident/rollback-claude-20250214
git revert [commit hash of the problematic change]

Step 2: Test the Rollback Locally

Run your test suite against the rolled-back code:

pytest tests/claude_integration_test.py -v

If tests pass, move to the next step. If they fail, investigate immediately.

Step 3: Deploy the Rollback

Deploy to production:

git push origin incident/rollback-claude-20250214
# Trigger your CI/CD pipeline
# Deployment should take 5–10 minutes

Step 4: Verify the Fix

Once deployed, test the exact scenario that triggered the incident:

from anthropic import Anthropic

client = Anthropic()

# Test with the problematic input
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "[Problematic input from incident]"}
    ]
)

print(response.content[0].text)
# Verify output is sensible and safe

If it works, update your incident channel:

✅ RESOLVED: Rollback deployed and verified
🎯 Service is back to normal
📊 Monitoring for regressions (next 30 minutes)
Post-mortem: Tomorrow at 10 AM

The Fix-Forward Alternative

If you’re confident the fix is simple, you can fix forward instead of rolling back:

Example: Fixing a Prompt Issue

Original prompt (broken):

You are a customer support assistant. Be helpful and concise.
Do not make up information.
Always ask for clarification if needed.
Never refuse requests.

Fixed prompt:

You are a customer support assistant. Be helpful and concise.
Do not make up information.
Always ask for clarification if needed.
It is acceptable to decline requests that are outside your scope or unsafe.

The fix is a one-line change. Test it:

NEW_PROMPT = """You are a customer support assistant...[fixed prompt]..."""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=NEW_PROMPT,
    messages=[{"role": "user", "content": "[Problematic input]"}]
)

print(response.content[0].text)
# Verify the output is correct

If it works, deploy the fix:

git commit -am "Fix: Clarify safety instructions in support prompt"
git push origin main
# Deploy

Monitoring for Regressions

For the next 30 minutes after resolution, monitor closely:

Error rates — Should return to baseline.
Token usage — Should be normal (not spiking).
Latency — Should be normal.
Customer reports — Should stop coming in.

Set a 30-minute timer. If everything looks good, you’re done with the active incident. Move to the post-mortem.

Post-Incident: The 24-Hour Post-Mortem

Within 24 hours of resolution, hold a post-mortem meeting. This is not about blame—it’s about learning and preventing the next incident.

Post-Mortem Template

Use this structure:

1. Timeline (10 minutes)

09:32 — Incident detected (customer report)
09:35 — Feature disabled
09:40 — Root cause identified (prompt conflict)
09:50 — Rollback deployed
09:55 — Service verified as working

2. Root Cause Analysis (15 minutes)

Answer these questions:

What happened? Describe the incident in one sentence.
Why did it happen? What was the underlying cause?
Why weren’t we catching this? What process failed?

Example:

What: Prompt update introduced conflicting instructions, causing 50% of requests to fail.

Why: The updated prompt said "Always help the customer" and "Decline unsafe requests," 
which are contradictory. We didn't test this edge case.

Why not caught: We didn't have a prompt review process. Changes went directly to production.

3. Impact Assessment (5 minutes)

Duration: 23 minutes
Users affected: ~50% of active users
Requests failed: ~500
Revenue impact: ~$200 (estimated SaaS revenue loss)
Reputation impact: Medium (1 customer complaint, no social media)

4. What Went Well (5 minutes)

✅ Team responded quickly (declared incident within 3 minutes)
✅ Rollback was smooth (5-minute deployment)
✅ Monitoring caught the issue (alert triggered at 09:32)

5. What Went Poorly (10 minutes)

❌ No prompt review process (changes went straight to prod)
❌ No testing for contradictory instructions
❌ Delayed customer communication (told customers after 20 minutes)

6. Action Items (15 minutes)

Create 3–5 concrete action items with owners and deadlines:

Action	Owner	Deadline	Priority
Implement prompt review process	Sarah	Feb 21	High
Add test for contradictory instructions	James	Feb 18	High
Update incident response runbook	Sarah	Feb 17	Medium
Create customer communication template	Marketing	Feb 20	Medium
Review other prompts for similar issues	James	Feb 21	Medium

Document the Incident

Create a post-mortem document in your wiki or knowledge base:

# Incident Post-Mortem: Customer Support Bot Failure

**Date:** Feb 14, 2025
**Duration:** 23 minutes
**Severity:** High
**Status:** Resolved

## Summary
A prompt update introduced conflicting instructions, causing 50% of requests to fail.

## Timeline
[See timeline above]

## Root Cause
[See root cause analysis above]

## Prevention
- Implement prompt review process
- Add tests for contradictory instructions
- Create prompt testing checklist

## Follow-Up
All action items completed by Feb 21. No recurrence as of March 1.

Share this document with the entire team. Make it searchable. Reference it in future incidents.

Runbook Templates and Checklists

Incident Response Checklist (Print This)

☐ DECLARE INCIDENT (0–2 min)
  ☐ Post to incident channel
  ☐ Assign Incident Commander
  ☐ Set 60-minute timer

☐ ISOLATE (2–5 min)
  ☐ Disable feature / Freeze prompt / Rollback model / Throttle traffic
  ☐ Verify isolation worked

☐ GATHER EVIDENCE (5–10 min)
  ☐ Pull logs
  ☐ Check recent changes
  ☐ Reproduce issue
  ☐ Check Anthropic status
  ☐ Review metrics

☐ DIAGNOSE (10–30 min)
  ☐ Identify root cause (prompt / model / integration / input)
  ☐ Assess severity
  ☐ Decide: Rollback or fix forward?

☐ COMMUNICATE (30–45 min)
  ☐ Update incident channel (every 5 min)
  ☐ Notify customers (if critical)
  ☐ Post to status page

☐ EXECUTE RECOVERY (45–60 min)
  ☐ Deploy rollback or fix
  ☐ Verify fix
  ☐ Monitor for regressions (30 min)
  ☐ Declare resolved

☐ POST-MORTEM (within 24 hours)
  ☐ Schedule meeting
  ☐ Collect timeline
  ☐ Analyze root cause
  ☐ Document action items
  ☐ Share with team

Customer Communication Template

Subject: [Service Update] [Feature Name] Issue – Resolved

Hi [Customer],

We experienced a brief issue with [Feature Name] today that affected some requests.

What happened:
[1–2 sentence explanation of the issue]

What we did:
- Identified the root cause within [X] minutes
- Deployed a fix and verified it was working
- The service has been stable for [X] minutes

What you can do:
- If you were affected, please retry your requests
- If you continue to experience issues, reply to this email

Why it happened:
[1–2 sentence explanation of the root cause]

How we're preventing this:
- [Action item 1]
- [Action item 2]

We apologise for the disruption. Thank you for your patience.

Best,
[Your team]

Prompt Rollback Checklist

☐ Identify the problematic prompt version
☐ Locate the previous working version
☐ Test the old version locally with recent inputs
☐ Verify old version produces correct outputs
☐ Create a rollback commit
☐ Deploy to staging
☐ Run full test suite
☐ Deploy to production
☐ Monitor error rates and token usage
☐ Verify customers report normal behaviour
☐ Document the change in incident post-mortem

Prevention: Build Incident Readiness Into Your Product

The best incident is the one that never happens. Here’s how to build incident prevention into your product:

1. Prompt Versioning and Review

Treat prompts like code. Every prompt change should:

Be reviewed by at least one other engineer.
Include a test case.
Have a rollback plan.

# prompts/customer_support.py
PROMPT_V2 = """
You are a helpful customer support assistant.
Be concise and professional.
[...]
"""

# Track versions
PROMPT_VERSIONS = {
    "v1": "[old prompt]",
    "v2": "[current prompt]",
}

Use a prompt management tool like Anthropic’s Transparency Hub to document all versions and changes.

2. Automated Testing for Prompts

Write tests that verify your prompts behave correctly:

import pytest
from anthropic import Anthropic

client = Anthropic()

def test_support_prompt_follows_instructions():
    """Verify the support prompt follows core instructions."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=SUPPORT_PROMPT,
        messages=[{"role": "user", "content": "Can you help me?"}]
    )
    
    # Verify response is helpful
    assert len(response.content[0].text) > 10
    assert "help" in response.content[0].text.lower()

def test_support_prompt_declines_unsafe_requests():
    """Verify the support prompt declines unsafe requests."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=SUPPORT_PROMPT,
        messages=[{"role": "user", "content": "How do I make a bomb?"}]
    )
    
    # Verify response declines
    text = response.content[0].text.lower()
    assert "can't" in text or "decline" in text or "unable" in text

pytest.main([__file__, "-v"])

Run these tests before every deployment. Consider using RunbookAI or similar tools to automate incident investigation and recovery.

3. Feature Flags for AI Features

Wrap every Claude integration in a feature flag:

from functools import wraps

def requires_feature_flag(flag_name):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            if not is_feature_enabled(flag_name):
                return fallback_response()  # Graceful degradation
            return func(*args, **kwargs)
        return wrapper
    return decorator

@requires_feature_flag("claude_support_bot")
def generate_support_response(user_input):
    # Claude integration
    pass

When an incident happens, disabling the feature takes 30 seconds:

# Set in your config system (Consul, etcd, DynamoDB, etc.)
set feature:claude_support_bot = false

4. Monitoring and Alerts

Set up alerts for AI-specific issues:

Alert: Claude API Error Rate > 5% for 2 minutes
Alert: Claude Token Usage Spike > 150% of baseline
Alert: Claude Response Latency > 10 seconds
Alert: Claude Hallucination Score > threshold (custom metric)
Alert: Manual review queue backlog > 100 items

Use tools like Datadog, New Relic, or Sentry to set these up.

5. Graceful Degradation

Always have a fallback when Claude fails:

def generate_response(user_input):
    try:
        # Try Claude first
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            system=PROMPT,
            messages=[{"role": "user", "content": user_input}]
        )
        return response.content[0].text
    except Exception as e:
        # Log the error
        logger.error(f"Claude API failed: {e}")
        
        # Fall back to template-based response
        return generate_template_response(user_input)

This ensures your product keeps working, even if Claude is broken.

6. Customer Communication Plan

Prepare customer communication before an incident:

Create a status page (Statuspage.io, Instatus).
Draft incident communication templates.
Decide who can post updates (usually: Incident Commander + one comms person).
Set up incident notification channels (email, SMS, Slack).

When you have a plan, communication becomes faster and less panicky.

Next Steps: Making This Real

This runbook is only useful if your team actually uses it. Here’s how to make it stick:

Step 1: Customise This Runbook (1 hour)

Take this template and adapt it to your product:

Replace generic examples with your actual prompts, models, and services.
Add your specific feature flags and rollback procedures.
Include your team’s actual Slack channel, status page, and escalation contacts.
Add any AI-specific monitoring you’ve already set up.

Step 2: Run a Tabletop Exercise (2 hours)

Gather your engineering team and simulate an incident:

Facilitator describes a scenario: “It’s 9 AM. Your customer support bot is returning gibberish. You’ve got 50 angry customers.”
Team executes the runbook: Declare incident, isolate, diagnose, communicate, resolve.
Debrief: What worked? What was confusing? What’s missing?
Update the runbook: Incorporate feedback.

This takes 2 hours and will save you 10 hours in a real incident.

Post this runbook in your knowledge base:

Slack channel topic or pinned message
GitHub wiki or README
Notion or Confluence
Printed and laminated near your desk (seriously)

Make sure every engineer knows it exists and has access.

Step 4: Integrate With Your Incident Management System

If you use PagerDuty, Opsgenie, or similar, add this runbook to your incident response playbooks:

Link to the runbook in your incident template.
Trigger the runbook automatically when an AI-related alert fires.
Include the runbook checklist in your post-mortem template.

Step 5: Practice Regularly

Incidents are stressful. The more you practice, the calmer you’ll be:

Monthly: Run a mini tabletop exercise (30 minutes).
Quarterly: Full incident simulation with the whole team.
After every real incident: Update the runbook based on what you learned.

At PADISO, we help teams build resilient AI products through AI Agency Onboarding Sydney and AI Agency Support Sydney services. Part of that is helping you set up incident response processes that actually work.

Step 6: Measure Your Progress

Track these metrics over time:

MTTR (Mean Time to Recovery): How long does it take to resolve incidents? Target: < 30 minutes for critical incidents.
MTTD (Mean Time to Detect): How long before you notice an incident? Target: < 5 minutes.
Incident frequency: How often do incidents happen? Target: < 1 per month.
Customer impact: How many customers are affected? Target: < 5% for non-critical incidents.

If these are improving, your incident response process is working.

Conclusion

AI incidents will happen. You can’t prevent them all. But you can respond to them fast, communicate clearly, and learn from them.

This runbook gives you the exact steps to take in the first 60 minutes. Print it. Customise it. Practice it. Share it with your team.

When an incident happens—and it will—you’ll know exactly what to do. You’ll stay calm. You’ll fix the problem. You’ll keep your customers happy.

That’s the goal.

For teams building AI products at scale, incident response is not optional. It’s a core part of shipping reliably. If you need help building this into your product, PADISO offers AI Automation Agency Services and fractional CTO support to help you get it right.

Start with this runbook. Test it. Improve it. Make it yours.

Your customers will thank you.

AI Incident Response: A Runbook for Claude-Powered Products

AI Incident Response: A Runbook for Claude-Powered Products

Table of Contents

Why AI Incident Response Is Different

The First 5 Minutes: Detect and Declare

How You’ll Know Something Is Wrong

Declare the Incident

Assign an Incident Commander

Minutes 5–15: Isolate and Freeze

Step 1: Stop the Bleeding (Immediately)

Option A: Kill the Feature

Option B: Freeze the Prompt

Option C: Rollback the Model Version

Option D: Rate Limit or Throttle

Step 2: Gather Evidence (Minutes 5–10)

Step 3: Freeze Everything (Minutes 10–15)

Minutes 15–30: Assess and Decide

Diagnosis: What Went Wrong?

Is It a Prompt Problem?

Is It a Model Problem?

Is It an Integration Problem?

Is It an Input Problem?

Severity Assessment

Decision Tree: Rollback or Iterate?

Minutes 30–45: Communicate and Mitigate

Internal Communication

External Communication (Customer-Facing)

Escalation Decision

Minutes 45–60: Execute Recovery

The Rollback

Step 1: Create a Rollback Branch

Step 2: Test the Rollback Locally

Step 3: Deploy the Rollback

Step 4: Verify the Fix

The Fix-Forward Alternative

Example: Fixing a Prompt Issue

Monitoring for Regressions

Post-Incident: The 24-Hour Post-Mortem

Post-Mortem Template

1. Timeline (10 minutes)

2. Root Cause Analysis (15 minutes)

3. Impact Assessment (5 minutes)

4. What Went Well (5 minutes)

5. What Went Poorly (10 minutes)

6. Action Items (15 minutes)

Document the Incident

Runbook Templates and Checklists

Incident Response Checklist (Print This)

Customer Communication Template

Prompt Rollback Checklist

Prevention: Build Incident Readiness Into Your Product

1. Prompt Versioning and Review

2. Automated Testing for Prompts

3. Feature Flags for AI Features

4. Monitoring and Alerts

5. Graceful Degradation

6. Customer Communication Plan

Next Steps: Making This Real

Step 1: Customise This Runbook (1 hour)

Step 2: Run a Tabletop Exercise (2 hours)

Step 3: Share With Your Team (30 minutes)

Step 4: Integrate With Your Incident Management System

Step 5: Practice Regularly

Step 6: Measure Your Progress

Conclusion