APRA CPS 230 + Claude Agents: A Compliance Mapping
Map APRA CPS 230 operational risk requirements to Claude agent deployments. Auditor expectations, evidence frameworks, and runbooks for AU insurers and banks.
APRA CPS 230 + Claude Agents: A Compliance Mapping
Table of Contents
- What Is APRA CPS 230 and Why It Matters for AI
- The Operational Risk Landscape for Claude Agent Deployments
- Third-Party Risk and Service Provider Oversight
- Critical Operations Identification and Recovery Objectives
- Claude Agents as Operational Risk: Evidence Frameworks
- Building Compliant Runbooks for Agent Deployments
- Incident Management and Escalation Protocols
- Testing, Monitoring, and Audit Readiness
- Common Pitfalls and How to Avoid Them
- Next Steps: Getting Audit-Ready
What Is APRA CPS 230 and Why It Matters for AI
APRA CPS 230 (Prudential Standard CPS 230 Operational Risk Management) is the Australian Prudential Regulation Authority’s framework for managing operational risks in authorised deposit-taking institutions (ADIs), insurance companies, and other regulated entities. Effective from 1 July 2025, CPS 230 requires financial institutions to identify, measure, monitor, and control operational risks—including those arising from third-party service providers, technology failures, and system dependencies.
When you deploy Claude agents into your operations—whether for claims processing, risk assessment, customer service automation, or workflow orchestration—you’re introducing a new operational risk vector. The agent becomes a material service provider in APRA’s eyes. Your auditors will want to see evidence that you understand this risk, have mapped dependencies, documented recovery procedures, and can demonstrate resilience during failures.
The official APRA CPS 230 Prudential Standard sets out explicit requirements for operational risk governance, business continuity, and third-party oversight. The CPS 230 Prudential Handbook provides detailed guidance on what auditors expect to see in your evidence base. For Australian financial institutions deploying AI agents, this is not optional compliance theatre—it’s a material audit requirement.
Claude agents introduce three distinct operational risk categories:
Model Risk: Claude’s outputs are probabilistic, not deterministic. An agent that miscalculates a customer’s insurance premium or misclassifies a fraud signal is an operational failure, not a software bug.
Third-Party Dependency: Anthropic (Claude’s provider) becomes a critical service provider. Your institution is now dependent on their uptime, security posture, and API stability.
Data and Governance Risk: Agents process sensitive customer and transaction data. Loss of control over that data, or misuse by an agent, is an operational incident.
APRA’s revised CPS 230 framework explicitly addresses third-party risks. The Gatekeeper compliance guide highlights mandatory clauses for material service provider contracts, including exit strategies, data protection, and incident notification. The UpGuard summary emphasises business continuity, incident management, and dependency mapping. For Claude agent deployments, this means you need contractual clarity with Anthropic, fallback procedures when the API is unavailable, and documented recovery time objectives (RTOs) and recovery point objectives (RPOs).
The Operational Risk Landscape for Claude Agent Deployments
Operational risk, under CPS 230, is defined as the risk of loss resulting from inadequate or failed internal processes, people, systems, or external events. For Claude agents, this breaks down into six key risk domains:
Technology and System Risk
Claude agents depend on API availability, latency, and model consistency. If the Anthropic API goes down, your agent stops processing. If the model’s output distribution shifts (due to retraining), your agent’s behaviour may drift. If latency spikes, your workflow automation stalls.
APRA expects you to map these dependencies explicitly. The Dynatrace blog on CPS 230 notes that the revised standard requires institutions to identify critical operations and define recovery objectives. For a claims processing agent, your critical operation is “process and approve claims within 24 hours.” Your recovery objective is “if the Claude API is unavailable, revert to manual claims triage and escalate within 4 hours.” You need runbooks that spell this out.
Model Risk and Output Validation
Claude agents are not perfect. They hallucinate, misinterpret context, and occasionally produce incorrect outputs. In financial services, an incorrect output is an operational loss. A fraud detection agent that misses a $500k transaction is a direct loss. A loan origination agent that approves a high-risk applicant is a credit loss.
APRA’s framework requires you to validate agent outputs before they flow into critical systems. This means implementing guardrails: rule-based checks that catch obviously wrong outputs, human-in-the-loop approval for high-stakes decisions, and continuous monitoring of agent accuracy against a ground truth dataset.
Data Governance and Privacy Risk
Claude agents process sensitive data: customer names, account numbers, transaction histories, health information (for insurance). If an agent leaks this data in its outputs, or if Anthropic’s systems are compromised, you have a data breach. APRA’s CPS 230 framework explicitly requires controls over third-party data handling.
You need to document what data flows to Claude, why, and what controls prevent misuse. If you’re processing customer PII, you need to understand Anthropic’s data retention policies, their security posture (do they have SOC 2 Type II certification?), and contractual commitments around data deletion. For Australian institutions, you may also need to consider the Privacy Act 1988 (Cth) and whether Claude’s processing constitutes an overseas disclosure of personal information.
Integration and Workflow Risk
Claude agents rarely operate in isolation. They integrate with your core systems: claims management platforms, customer relationship management (CRM) tools, fraud detection engines, and data warehouses. If the agent writes incorrect data to your core system, or if the integration fails silently, you have an operational incident.
APRA expects you to test these integrations, document failure modes, and implement monitoring. An agent that writes a claim decision to your system should include audit logs showing who authorised the decision, what data the agent processed, and what rules it applied.
Vendor Lock-In and Exit Risk
Once you’ve built agents on Claude, you’re dependent on Anthropic. If Anthropic changes its API pricing, deprecates a model, or exits the market, you need a plan. APRA’s CPS 230 framework requires you to assess exit risk for material service providers and have contingency plans.
This doesn’t mean you need a complete replacement agent on a different LLM (though that’s an option). It means documenting the cost and timeline to migrate, understanding what data you’d need to export, and having a decision tree: “If Anthropic raises prices by 50%, we’ll [migrate to open-source Claude via AWS Bedrock / rebuild on GPT-4 / revert to manual processes].” Your auditors want to see that you’ve thought this through.
Change and Configuration Risk
Claude agents are software. Software changes. You’ll update prompts, adjust guardrails, integrate new data sources, and refine workflows. Each change is an operational risk. If you push a new agent prompt to production without testing, and it starts approving fraudulent claims, you have an incident.
APRA expects you to implement change management: version control for agent prompts, testing in staging environments, approval workflows for production changes, and rollback procedures. You need to track what changed, when, and who approved it.
Third-Party Risk and Service Provider Oversight
Under CPS 230, Anthropic is a material service provider. This is not negotiable. Your auditors will want to see evidence of oversight.
Contractual Requirements
Your service agreement with Anthropic (or AWS if you’re using Bedrock) should include:
Service Level Agreements (SLAs): Define uptime commitments, response times, and incident notification procedures. Anthropic’s standard terms may not be sufficient for regulated financial services. You may need to negotiate a custom agreement that includes APRA-specific requirements.
Data Protection and Privacy: Explicit commitments around data retention, encryption, access controls, and compliance with Australian privacy laws. If Anthropic processes Australian customer data, they may be a “service provider” under the Privacy Act.
Security and Audit Rights: Rights to audit Anthropic’s security controls, request SOC 2 or ISO 27001 certifications, and be notified of security incidents. The SAI360 compliance overview emphasises that APRA expects institutions to verify third-party security postures.
Business Continuity and Disaster Recovery: Commitments around redundancy, failover, and recovery time objectives. What happens if Anthropic’s primary data centre fails? How quickly will the API recover?
Exit and Transition: Clear procedures for exiting the relationship, including data export, transition support, and notice periods. The Megaport preparation guide highlights dependency mapping as a critical CPS 230 requirement—knowing how to exit a vendor relationship is part of that.
Ongoing Monitoring and Compliance
Once you’ve signed the agreement, you need to monitor Anthropic’s performance:
Uptime and Availability: Track API availability against the SLA. If uptime falls below the agreed threshold, document the breach and follow your escalation procedures.
Security Posture: Request annual SOC 2 Type II reports. If Anthropic suffers a security incident, ensure they notify you within the contractual timeframe (typically 24-72 hours).
Regulatory Changes: Monitor for changes to Anthropic’s terms of service, data handling policies, or geographic restrictions. If Anthropic announces it’s exiting Australia or changing its data residency, you need to know immediately.
Cost and Pricing: Track API costs against your budget. If Anthropic raises prices, assess the impact on your business case and decide whether to continue, migrate, or scale back.
APRA expects you to document this monitoring in your operational risk register. You should have a quarterly review of third-party risks, with a documented assessment of whether Anthropic remains a suitable service provider.
Critical Operations Identification and Recovery Objectives
APRA CPS 230 requires you to identify which operations are “critical” and define recovery objectives for each. For Claude agent deployments, this is where theory meets practice.
Defining Critical Operations
A critical operation is one where failure would materially impact your institution’s financial condition, reputation, or ability to serve customers. For a claims processing agent:
Critical: Processing and approving insurance claims. If the agent stops working, claims pile up, customers get angry, and you face regulatory scrutiny.
Non-Critical: Generating summary reports of claims trends. If the agent stops working, you lose visibility into trends, but claims still get processed (just manually).
For a fraud detection agent:
Critical: Flagging potentially fraudulent transactions in real-time. If the agent fails, fraudulent transactions slip through, and you face financial losses and regulatory action.
Non-Critical: Generating post-hoc fraud reports for compliance teams. If the agent fails, compliance teams have less visibility, but fraud prevention still occurs through other controls.
APRA expects you to document this classification. You need a register that lists each agent, identifies which operations it supports, and classifies those operations as critical or non-critical. The Empowered Systems comprehensive guide emphasises that critical operations identification is the foundation of CPS 230 compliance.
Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs)
Once you’ve identified critical operations, you need to define how quickly you can recover if the agent fails.
Recovery Time Objective (RTO): The maximum acceptable downtime. For a claims processing agent, your RTO might be 4 hours. If the agent is down longer than 4 hours, you escalate to manual claims processing.
Recovery Point Objective (RPO): The maximum acceptable data loss. For a fraud detection agent, your RPO might be zero—you can’t afford to lose any transaction data. For a reporting agent, your RPO might be 24 hours—you’re willing to lose one day’s worth of reports.
These objectives should be based on business impact, not technical convenience. If your claims processing agent has an RTO of 24 hours, your auditors will ask: “Why can’t you recover in 4 hours? What’s the business impact of a 24-hour outage?” You need a documented answer.
For Claude agent deployments, your RTOs and RPOs should account for:
API Availability: Anthropic’s API is generally highly available, but it’s not guaranteed to be 100% uptime. If you need 99.99% uptime (52 minutes of downtime per year), you need a fallback plan that doesn’t depend on Claude.
Model Changes: Anthropic occasionally updates its models. If a model update changes the agent’s behaviour, you may need to retrain or adjust prompts. This takes time.
Data Freshness: If your agent relies on real-time data from your core systems, and that data becomes stale, the agent’s outputs may become unreliable. You need to define acceptable staleness (e.g., “data must be no more than 1 hour old”).
Claude Agents as Operational Risk: Evidence Frameworks
APRA auditors don’t ask for theoretical compliance. They ask for evidence. “Show me your operational risk register. Show me your testing results. Show me your incident log.” Here’s what you need to prepare.
Operational Risk Register
Your operational risk register should include a row for each Claude agent deployment:
| Agent | Operation | Critical? | RTO | RPO | Risk Level | Mitigation |
|---|---|---|---|---|---|---|
| Claims Processor | Claims approval | Yes | 4h | 0 | High | Fallback to manual; API SLA; monitoring |
| Fraud Detector | Transaction flagging | Yes | 1h | 0 | High | Redundant detection rules; API SLA; alerts |
| Customer Service Bot | Inquiry resolution | No | 8h | 24h | Medium | Queue to human agents; escalation runbook |
For each agent, you need to document:
Risk Description: What could go wrong? (API outage, model drift, data quality issues, integration failures, security breaches)
Risk Likelihood: How often do you expect this to happen? (Daily, monthly, yearly, once per 10 years)
Risk Impact: What’s the financial or reputational impact? (Dollar amount, customer impact, regulatory consequences)
Current Controls: What are you doing to prevent or mitigate this risk? (Monitoring, guardrails, testing, fallback procedures)
Residual Risk: After controls, what’s the remaining risk? (Acceptable, monitor, escalate)
APRA expects you to review this register quarterly and update it based on incidents, changes to the agent, or changes to your business.
Testing and Validation Evidence
Your auditors will ask: “How do you know this agent works correctly?” You need a testing strategy that covers:
Functional Testing: Does the agent do what it’s supposed to do? For a claims processor, does it correctly extract claim details, apply business rules, and generate approval decisions? You should have a test suite with 100+ test cases covering normal scenarios, edge cases, and error conditions.
Accuracy Testing: How often does the agent produce correct outputs? You should measure accuracy against a ground truth dataset (claims that were manually processed and verified). For a fraud detector, you might measure precision (of the transactions the agent flags as fraudulent, how many actually are?) and recall (of all fraudulent transactions, how many does the agent catch?).
Robustness Testing: How does the agent behave when given unusual or malicious inputs? For a claims processor, what happens if you give it a claim with missing data, or with data in an unexpected format? You should document expected behaviour and verify the agent handles edge cases gracefully.
Integration Testing: Does the agent correctly integrate with your core systems? If the agent writes a claim decision to your system, does it appear correctly? Are audit logs generated? You should have test cases that verify end-to-end workflows.
Regression Testing: When you update the agent (new prompt, new guardrails, new data sources), do you still get correct outputs? You should have a regression test suite that runs automatically before each production deployment.
APRA expects to see evidence of this testing: test plans, test results, test coverage metrics, and sign-offs from your testing team. If you’ve deployed an agent to production without testing, your auditors will flag this as a control gap.
Monitoring and Alerting Evidence
Once the agent is in production, you need to monitor its behaviour continuously. Your evidence base should include:
Performance Metrics: API latency, error rates, request volume. You should have dashboards showing these metrics in real-time, with alerts if they deviate from baselines.
Output Quality Metrics: Accuracy, precision, recall. You should have a process for continuously sampling agent outputs and comparing them against ground truth. If accuracy drops below a threshold (e.g., 95%), you should alert your operations team.
Dependency Metrics: Anthropic API uptime, latency, error rates. You should track these separately from your own system metrics, so you can distinguish between your own failures and third-party failures.
Audit Logs: Every agent invocation should be logged: timestamp, input data, output, user who triggered it, approval status. You should retain these logs for at least 7 years (or per your regulatory requirements).
APRA expects to see evidence of this monitoring: dashboards, alert configurations, log retention policies, and incident reports showing how you’ve responded to alerts.
Building Compliant Runbooks for Agent Deployments
A runbook is a step-by-step procedure for responding to a specific situation. For Claude agent deployments, you need runbooks covering:
Normal Operations Runbook
Objective: Deploy and operate a Claude agent in production.
Prerequisites:
- Agent has passed functional, accuracy, robustness, and integration testing
- Agent has been approved by risk and compliance teams
- Monitoring and alerting are configured
- Fallback procedures are documented
- Team members have been trained
Steps:
-
Pre-Deployment Checklist: Verify that all prerequisites are met. Check that the agent’s prompt, guardrails, and integration logic match what was tested and approved. Confirm that monitoring dashboards are active and alerting is enabled.
-
Canary Deployment: Deploy the agent to a small subset of users or transactions (e.g., 5% of claims). Monitor accuracy and error rates for 24 hours. If accuracy is within acceptable bounds, proceed to full deployment.
-
Full Deployment: Roll out the agent to 100% of traffic. Continue monitoring accuracy and error rates.
-
Ongoing Monitoring: Review performance metrics daily for the first week, then weekly thereafter. If accuracy drops below threshold, investigate the root cause and decide whether to roll back, adjust the agent, or escalate to manual processing.
-
Quarterly Review: Review the agent’s performance, cost, and business impact. Update the operational risk register. Assess whether the agent is still meeting its objectives.
Degraded Performance Runbook
Trigger: Agent accuracy drops below 95%, or error rate exceeds 1%.
Steps:
-
Alert and Notification: Operations team receives alert. Alert is escalated to the agent owner (product manager) and risk team.
-
Root Cause Investigation: Is the issue with the agent (e.g., prompt drift, model update), the data (e.g., data quality degradation), or the integration (e.g., API latency)? Check monitoring dashboards, recent changes, and recent incidents.
-
Mitigation Decision: Based on root cause, decide on next steps:
- If the issue is temporary (e.g., API latency spike), wait and monitor.
- If the issue is persistent, roll back to a previous agent version or prompt.
- If rollback isn’t possible, escalate to manual processing.
-
Communication: Notify stakeholders (compliance, risk, business teams) of the issue and the mitigation. If manual escalation is required, notify operations teams that they need to handle the workload manually.
-
Post-Incident Review: Once the issue is resolved, conduct a post-incident review. Document what went wrong, why it wasn’t caught by monitoring, and what changes you’ll make to prevent recurrence.
API Outage Runbook
Trigger: Anthropic API becomes unavailable or returns errors for >5% of requests.
Steps:
-
Detect and Notify: Monitoring system detects API errors. Alert is sent to operations team. Team checks Anthropic’s status page and external monitoring services to confirm the outage.
-
Activate Fallback: Depending on the agent’s role:
- Claims Processing: Switch to manual claims triage. Operations team reviews all incoming claims and makes approval decisions manually. Escalate complex claims to senior claims managers.
- Fraud Detection: Activate rule-based fraud detection (if available). Flag all transactions above a certain threshold for manual review. Accept higher false-positive rates temporarily.
- Customer Service: Queue all incoming inquiries. Notify customers of delays. Escalate urgent inquiries to human agents.
-
Estimate Recovery Time: Check Anthropic’s status page and incident communications. Estimate when the API will recover. Communicate this to stakeholders.
-
Monitor Recovery: Once Anthropic indicates the API is recovering, gradually shift traffic back to the agent. Start with 10% of traffic, monitor for errors, and increase gradually.
-
Post-Outage Analysis: Once fully recovered, document the outage: duration, impact, root cause (if Anthropic disclosed it), and lessons learned. Update your RTO and RPO if necessary.
Data Quality Degradation Runbook
Trigger: Agent receives data from upstream systems that is incomplete, inaccurate, or stale.
Steps:
-
Detect: Agent outputs are flagged as incorrect by downstream validation or by auditors. Investigate whether the issue is with the agent or with the input data.
-
Isolate: If the issue is with input data, work with the upstream system owner to identify the root cause. Is the data extraction process broken? Is the data source unavailable?
-
Mitigate: Depending on severity:
- If the issue affects <5% of transactions, flag those transactions for manual review and allow the agent to continue processing others.
- If the issue affects >5% of transactions, stop the agent and escalate to manual processing until the data issue is resolved.
-
Communicate: Notify stakeholders of the data quality issue and the mitigation. If manual escalation is required, notify operations teams.
-
Resolve: Work with the upstream system owner to fix the data quality issue. Test the fix with the agent. Resume agent processing once the issue is resolved.
These runbooks should be documented in a wiki, shared drive, or runbook management system. They should be reviewed annually and updated whenever your agent’s configuration or dependencies change. Your auditors will want to see evidence that your team has actually used these runbooks—incident reports that reference them, post-incident reviews that mention them.
Incident Management and Escalation Protocols
When something goes wrong with your Claude agent, you need a clear process for handling it. APRA expects to see evidence of incident management.
Incident Classification
Not all incidents are equal. You need a classification system:
Severity 1 (Critical): Agent failure affecting critical operations. Example: Claims processing agent is down for >1 hour. Impact: Claims are not being processed, customers are not being served, revenue is at risk.
Severity 2 (High): Agent failure affecting important operations, or degraded performance affecting critical operations. Example: Claims processing agent accuracy drops to 90%. Impact: Some claims are being incorrectly approved or rejected, but the agent is still processing claims.
Severity 3 (Medium): Agent degradation affecting non-critical operations. Example: Customer service agent response time increases to 30 seconds. Impact: Customer experience is slightly degraded, but service is still available.
Severity 4 (Low): Minor issues with no customer impact. Example: Agent logs are growing faster than expected. Impact: No immediate impact, but should be addressed before logs fill up storage.
Escalation Procedures
For each severity level, you need to define who gets notified and how quickly they need to respond:
Severity 1: Notify operations lead immediately (phone call, not email). If not resolved within 30 minutes, escalate to manager. If not resolved within 1 hour, escalate to VP. Target resolution time: 4 hours (or your RTO, whichever is shorter).
Severity 2: Notify operations lead within 15 minutes (email + Slack). Target resolution time: 8 hours.
Severity 3: Notify operations lead within 1 hour (email). Target resolution time: 24 hours.
Severity 4: Log in incident tracking system. Review in weekly operations meeting. Target resolution time: 1 week.
Incident Documentation
Every incident should be documented:
Incident ID: Unique identifier (e.g., INC-2025-001)
Date/Time: When was the incident detected? When was it resolved?
Severity: Classification (1-4)
Description: What happened? What was the impact?
Root Cause: Why did it happen? (If determined)
Mitigation: What did you do to resolve it?
Prevention: What will you do to prevent it in future?
Lessons Learned: What did you learn from this incident?
APRA expects to see your incident log. They’ll review it to understand what’s gone wrong, how you’ve responded, and whether you’re learning from incidents. If you have a pattern of incidents (e.g., the agent crashes every Monday), your auditors will flag this as a control gap and ask what you’re doing to address it.
For Claude agent deployments, you should maintain a separate incident log focused on agent-related incidents. This makes it easier to spot patterns and trends.
Testing, Monitoring, and Audit Readiness
APRA audits are comprehensive. They’ll review your documentation, test your systems, and interview your team. Here’s how to prepare.
Pre-Audit Preparation
Before your auditors arrive, conduct an internal audit:
-
Documentation Review: Gather all documentation related to your Claude agent deployments: requirements documents, design documents, testing plans, test results, runbooks, incident logs, monitoring dashboards, SLAs with Anthropic.
-
Completeness Check: For each Claude agent, verify that you have:
- A documented business case (why are we using this agent?)
- A documented design (how does it work?)
- Evidence of testing (test plans, test results, accuracy metrics)
- Evidence of approval (risk and compliance sign-off)
- Operational runbooks
- Monitoring and alerting configuration
- Incident logs (if any incidents have occurred)
- Training records (who has been trained on this agent?)
-
Gap Identification: For any missing documentation, create it now. If you’ve deployed an agent without a documented business case, create one. If you haven’t tested the agent, run tests now and document the results.
-
Team Preparation: Brief your team on what auditors will ask. Practice answering questions like:
- “Walk me through how this agent works.”
- “What happens if the agent produces an incorrect output?”
- “How do you know the agent is working correctly?”
- “What would you do if the Anthropic API went down?”
- “How do you manage changes to the agent?”
- “What incidents have you had with this agent, and how did you respond?”
During the Audit
When auditors are on-site, they’ll want to see:
System Demonstrations: “Show me the agent in action. Process a sample claim / flag a sample transaction.” Have someone ready to do this. Show them the monitoring dashboard. Show them the logs.
Documentation Review: “Walk me through your testing plan.” Go through it section by section. Explain the test cases, the expected results, and the actual results. If a test failed, explain why and what you did about it.
Operational Readiness: “What happens if the agent fails?” Walk them through your runbooks. Show them the fallback procedures. Demonstrate that your team knows how to activate them.
Risk Management: “How do you manage the risk of this agent?” Show them your operational risk register. Explain how you’ve identified risks, assessed their likelihood and impact, and implemented controls.
Third-Party Oversight: “How do you manage the risk of Anthropic?” Show them your SLA with Anthropic. Show them your monitoring of their uptime and performance. Show them your exit strategy.
Post-Audit Follow-Up
After the audit, you’ll receive findings. Some will be observations (things you’re doing well), some will be recommendations (things you could do better), and some will be findings (things you need to fix).
For each finding, you need to:
-
Understand It: Make sure you understand what the auditor is saying. Ask clarifying questions if necessary.
-
Develop a Remediation Plan: What will you do to address the finding? By when? Who is responsible?
-
Implement It: Do the work. Document what you’ve done.
-
Verify It: Test that the remediation actually addresses the finding. Get sign-off from risk and compliance.
-
Communicate It: Send the remediation plan and evidence to the auditors. Confirm that they’re satisfied.
Common findings for Claude agent deployments include:
-
Missing SLA with Anthropic: You haven’t documented uptime, response time, or incident notification commitments. Remediation: Negotiate a service agreement with Anthropic (or review AWS Bedrock terms if you’re using that) and document the key commitments.
-
Insufficient Testing: You haven’t tested the agent thoroughly before deploying it. Remediation: Run a comprehensive test suite, document the results, and re-deploy with evidence of testing.
-
Missing Monitoring: You’re not monitoring the agent’s performance in production. Remediation: Set up dashboards and alerting for key metrics (accuracy, error rate, latency, API availability).
-
Inadequate Runbooks: Your runbooks are vague or incomplete. Remediation: Rewrite them with specific steps, responsible parties, and timelines.
-
Weak Change Management: You’re not controlling changes to the agent. Remediation: Implement version control for prompts, require testing before production changes, and document approvals.
Common Pitfalls and How to Avoid Them
Based on our experience working with Australian financial institutions deploying AI agents, here are the most common mistakes—and how to avoid them.
Pitfall 1: Treating Claude as a Deterministic System
The Mistake: Assuming the agent will always produce the same output for the same input. Deploying it to production without guardrails because “it worked in testing.”
Why It Fails: Claude is a probabilistic model. Its outputs vary based on temperature settings, model updates, and even minor changes in input formatting. In production, you’ll inevitably encounter edge cases you didn’t test.
How to Avoid It: Implement guardrails and validation logic downstream of the agent. For a claims processor, don’t let the agent’s approval decision flow directly to your system—validate it first. Check that the claim ID exists, that the approval amount is within policy limits, that the decision is consistent with previous decisions for the same customer. If validation fails, flag the claim for manual review.
Pitfall 2: Ignoring Model Updates
The Mistake: Deploying an agent on Claude 3.5 Sonnet, then Anthropic releases Claude 4 (hypothetically). You don’t test the agent on the new model before deploying it. The agent’s behaviour changes, and you don’t notice until customers complain.
Why It Fails: Model updates can change output distributions, introduce new capabilities (or remove old ones), and affect performance. What worked on Claude 3.5 might not work on Claude 4.
How to Avoid It: When Anthropic releases a new model version, test your agent on it in a staging environment before deploying. Compare outputs on a test dataset. If accuracy changes, investigate why. If it’s an improvement, great—deploy it. If it’s a regression, stick with the old model or adjust your prompts.
Pitfall 3: Deploying Without Fallback
The Mistake: Making the agent a critical dependency without a fallback. If the agent fails, your entire operation grinds to a halt.
Why It Fails: APIs go down. Models change. Things break. If you don’t have a fallback, you’re exposed to operational risk.
How to Avoid It: For any critical operation, define a fallback. For a claims processor: if the agent is unavailable, escalate to manual processing. For a fraud detector: if the agent is unavailable, activate rule-based detection. For a customer service agent: if the agent is unavailable, queue inquiries and escalate to human agents. Document the fallback in your runbooks and test it regularly.
Pitfall 4: Insufficient Monitoring
The Mistake: Deploying the agent to production and assuming it works. Not monitoring accuracy, error rates, or API availability. Only finding out about problems when customers complain or auditors ask questions.
Why It Fails: Problems compound over time. If you’re not monitoring, you won’t detect degradation until it’s severe.
How to Avoid It: Set up monitoring from day one. Track accuracy, error rates, latency, and API availability. Set alerts for anomalies. Review metrics daily for the first week, then weekly. Use these metrics to drive continuous improvement.
Pitfall 5: Weak Data Governance
The Mistake: Not controlling what data flows to Claude. Sending customer PII, transaction data, or sensitive business information to the API without understanding Anthropic’s data handling policies.
Why It Fails: If Anthropic’s systems are compromised, or if they retain data longer than you expect, you could face a data breach. This is a regulatory issue (Privacy Act, GDPR if you have EU customers) and a reputational issue.
How to Avoid It: Understand what data you’re sending to Claude and why. Review Anthropic’s data retention and privacy policies. Consider whether you need to anonymise or redact data before sending it. For highly sensitive data, consider running Claude locally (via Bedrock or self-hosted) rather than using the API.
Pitfall 6: Neglecting Third-Party Risk
The Mistake: Not treating Anthropic as a material service provider. No SLA, no monitoring of their uptime, no contingency plan if they go out of business.
Why It Fails: You’re now dependent on Anthropic. If they have an outage, you’re down. If they raise prices 10x, you need to decide quickly whether to absorb the cost or migrate. If they go out of business, you’re in trouble.
How to Avoid It: Treat Anthropic like any other critical vendor. Negotiate an SLA. Monitor their uptime and performance. Have a contingency plan (migrate to GPT-4, run Claude locally, revert to manual processes). Document all of this in your operational risk register.
Next Steps: Getting Audit-Ready
If you’re running Claude agents in an Australian financial institution, here’s your action plan:
Immediate (This Month)
-
Map Your Agents: List all Claude agents currently in production or in development. For each, document:
- What operation does it support?
- Is that operation critical?
- What data does it process?
- Who depends on it?
-
Review Your Documentation: For each agent, gather all documentation (requirements, design, testing, runbooks). Identify gaps.
-
Check Your SLA: Review your agreement with Anthropic or AWS. Do you have documented uptime, response time, and incident notification commitments? If not, flag this as a gap.
Short-Term (Next Quarter)
-
Build Your Risk Register: Create an operational risk register with rows for each agent. For each agent, document risks, likelihood, impact, controls, and residual risk.
-
Implement Monitoring: Set up dashboards and alerting for each agent. Track accuracy, error rates, latency, and API availability. Integrate with your existing monitoring infrastructure.
-
Document Runbooks: Write runbooks for normal operations, degraded performance, API outages, and data quality issues. Get sign-off from operations and risk teams.
-
Conduct Testing: If you haven’t tested your agents thoroughly, do it now. Run functional, accuracy, robustness, and integration tests. Document the results.
When you’re ready to engage with PADISO for support, our AI & Agents Automation service can help you map these requirements, build compliant runbooks, and prepare for audits. We’ve worked with Australian financial institutions on similar deployments and understand what auditors expect to see. We can also help with AI Strategy & Readiness assessments to ensure your AI deployments align with APRA requirements.
Medium-Term (Next 6 Months)
-
Engage with Auditors: Proactively brief your auditors on your Claude agent deployments. Share your risk register, testing results, and runbooks. Ask for feedback on what you might be missing.
-
Implement Recommendations: Based on auditor feedback, implement any recommendations or remediate any findings.
-
Build a Centre of Excellence: If you have multiple Claude agent deployments, consider establishing a centre of excellence to standardise testing, monitoring, and runbook development. This makes it easier to scale and maintain consistency across agents.
For complex deployments or if you’re planning major AI transformations, consider engaging a partner like PADISO who can provide CTO as a Service support, including fractional CTO leadership and co-build support. We can help you navigate the regulatory landscape, architect compliant solutions, and ensure your AI deployments are audit-ready from day one.
If you’re also considering broader platform engineering or security audit work, our Security Audit (SOC 2 / ISO 27001) service via Vanta can help you build the foundational security controls that support APRA compliance.
Long-Term (Ongoing)
-
Monitor Regulatory Changes: APRA continues to evolve its guidance on AI and operational risk. Subscribe to APRA’s updates and adjust your controls as needed.
-
Continuous Improvement: Review your agent deployments quarterly. Are they meeting their business objectives? Are there new risks you haven’t considered? Are there new tools or techniques that could improve your control posture?
-
Scale Thoughtfully: If you’re planning to deploy more Claude agents, use the lessons from your first deployment to improve your second, third, and subsequent deployments. Build repeatable processes.
Conclusion
APRA CPS 230 is not optional, and Claude agents are not exempt from it. When you deploy an AI agent into a regulated financial institution, you’re introducing operational risk that APRA expects you to manage.
The good news: you don’t need to be perfect. Auditors understand that AI is new and that perfect control is impossible. What they want to see is that you’ve thought about the risks, implemented reasonable controls, monitored your systems, and learned from incidents.
Start with the basics: document your agents, identify risks, implement monitoring, and write runbooks. Get these right, and you’ll pass your audit. As you mature, add more sophisticated controls: automated testing, continuous monitoring, predictive analytics, and proactive risk management.
The institutions that will succeed in the AI era are those that treat AI deployments like any other critical system: with rigorous governance, continuous monitoring, and a culture of learning from failures. APRA CPS 230 is a framework for exactly that.
If you need help navigating this landscape, PADISO is here. We work with Australian financial institutions on AI deployments, compliance, and operational resilience. We understand what auditors expect and how to build systems that are both innovative and compliant. Reach out if you’d like to discuss your specific situation.