Evaluations for Claude Agents: Beyond Vibe Checks
Build eval suites that catch regressions before users do. Golden datasets, LLM-as-judge patterns, and eval cadence for production Claude agents.
Table of Contents
- Why Vibe Checks Fail (And Cost You Money)
- The Three Pillars of Agent Evaluation
- Building Your Golden Dataset
- LLM-as-Judge: The Pattern That Actually Works
- Deterministic Evals for Deterministic Tasks
- Regression Testing and Eval Cadence
- Instrumentation and Observability
- Real-World Eval Patterns from Production
- Common Pitfalls and How to Avoid Them
- Shipping with Confidence: Your Eval Checklist
Why Vibe Checks Fail (And Cost You Money)
You’ve shipped a Claude agent into production. Your team ran it through a handful of test cases, watched it handle a few examples, and declared it “looks good.” Then a user reports it’s hallucinating dates. Another says it’s returning inconsistent formatting. A third discovers it missed an edge case that costs them an hour of manual cleanup.
This is the vibe check era. It’s fast, it feels thorough, and it is almost entirely unreliable.
Vibe checks fail because they:
- Lack scale. You tested 10 scenarios. Your agent will encounter 10,000 variations.
- Miss edge cases. You tested the happy path. Real users find the weird ones.
- Don’t measure consistency. An agent might handle the same request differently on Tuesday than Monday.
- Provide no regression signal. You updated your prompt. Did performance improve or degrade? You won’t know until users complain.
- Don’t isolate failure modes. When something breaks, you can’t pinpoint whether it’s the retrieval, the reasoning, the tool call, or the output formatting.
At PADISO, we’ve shipped evaluations for dozens of Claude agents across startups and enterprises. The teams that built proper eval suites shipped faster, caught regressions before users did, and confidently iterated on their agents. The teams that relied on vibe checks shipped slower, debugged in production, and spent weeks chasing phantom issues.
The cost difference is real. A single undetected regression in production can mean:
- Support tickets that eat your team’s week.
- User churn as customers discover unreliability.
- Delayed feature launches while you debug.
- Loss of confidence in your AI system.
Proper evaluations cost upfront time. They pay dividends immediately.
The Three Pillars of Agent Evaluation
Agent evaluation sits on three pillars. Miss one, and your suite is incomplete.
Pillar 1: Correctness
Does the agent produce the right output? This is the most obvious pillar and the one most teams focus on—often exclusively.
Correctness evals answer:
- Did the agent retrieve the right information?
- Did it reason over that information correctly?
- Is the final output factually accurate?
- Does it match the expected format?
Correctness is necessary but not sufficient. An agent can be correct 95% of the time and still be unreliable in production.
Pillar 2: Consistency
Does the agent produce the same output given the same input? This is where most teams stumble.
LLMs are non-deterministic by default. The same prompt, run twice, may produce different outputs. For some use cases (creative writing, brainstorming), this is fine. For agent workflows, it’s catastrophic.
Consistency evals answer:
- Does the agent return the same answer when asked the same question twice?
- Does it format output consistently across runs?
- Does it make the same tool calls for the same inputs?
- Are there unexplained variations in reasoning?
Consistency is often the first thing to degrade when you update a prompt or model. It’s also the hardest to catch without structured evals.
Pillar 3: Cost and Latency
Can you afford to run this agent at scale? Will users tolerate the response time?
This pillar is often forgotten because it’s not about correctness. But it’s critical. An agent that’s 99% correct but takes 30 seconds to respond isn’t useful. An agent that costs $5 per request isn’t viable at scale.
Cost and latency evals answer:
- What’s the average token count per request?
- How many API calls does the agent make?
- What’s the P95 latency?
- What’s the cost per request at scale?
- How does performance degrade under load?
These three pillars must be evaluated together. A high-correctness, high-latency, high-cost agent is a research project, not a product.
Building Your Golden Dataset
Evaluations are only as good as the data you evaluate against. A golden dataset is a curated set of test cases that represent real-world usage patterns, edge cases, and failure modes.
What Goes Into a Golden Dataset
A golden dataset for a Claude agent should include:
Canonical examples (60% of dataset)
These are the cases your agent should handle perfectly. They represent the core use case. If your agent is a customer support responder, canonical examples are straightforward support questions with clear answers.
Example:
- Input: “How do I reset my password?”
- Expected output: Step-by-step reset instructions, with link to help docs.
- Expected tool calls: Retrieval from help centre, no external APIs.
Edge cases (25% of dataset)
These are the weird ones. They’re less common but real. They test whether your agent can handle variations, ambiguity, and boundary conditions.
Examples:
- Requests in different languages (if you claim multilingual support).
- Requests that are technically outside scope (“Can you help me with my competitor’s product?”).
- Requests that combine multiple intents (“Reset my password AND change my email AND update my profile”).
- Requests with missing or conflicting information.
Failure modes (15% of dataset)
These are cases where the agent should gracefully degrade. It should not hallucinate, not make up information, and not break.
Examples:
- Requests for information the agent doesn’t have access to.
- Requests that would require tool calls that fail.
- Adversarial inputs designed to trip up the model.
- Cases where the agent should defer to a human.
How to Build It
Step 1: Gather real usage. If you have an existing system (human support, previous version of the agent, user feedback), extract 50+ real requests. This is your foundation.
Step 2: Synthesise variations. For each real request, generate 3–5 variations. Change the wording, add context, remove context, combine intents. Use Claude to help generate these—it’s fast and creative.
Step 3: Define expected outputs. For each test case, document:
- What the agent should output.
- What tool calls it should make (or not make).
- What information it should retrieve.
- Any constraints on format or tone.
Step 4: Validate with humans. Have someone (ideally a domain expert) review the expected outputs. Catch disagreements now, not in production.
Step 5: Version and maintain it. Your golden dataset is a living artefact. As you discover new edge cases or failure modes in production, add them. Aim for 100–500 test cases. More is better, but diminishing returns kick in after ~300.
Storing and Managing Your Dataset
Use a simple format: JSON, CSV, or a database. Each row is a test case. Columns include:
{
"id": "test_001",
"category": "canonical",
"input": "How do I reset my password?",
"expected_output": "Step-by-step reset instructions...",
"expected_tools": ["retrieve_help_docs"],
"metadata": {
"language": "en",
"intent": "password_reset",
"complexity": "low"
}
}
Version control it. Treat it like code. When you update your agent, you’ll want to compare eval results across versions.
LLM-as-Judge: The Pattern That Actually Works
Manual evaluation doesn’t scale. You can’t review 300 test cases by hand every time you update your agent. Enter: LLM-as-judge.
The idea is simple: use another LLM (often Claude itself, or a different model) to evaluate whether your agent’s output meets the expected criteria. This is automated, repeatable, and fast.
As discussed in Beyond Vibe Checks: A PM’s Complete Guide to Evals, LLM-based evals are one of the most practical techniques for scaling evaluation. The key is designing the judge prompt carefully.
The Judge Prompt Pattern
Your judge prompt should:
- Define the criteria clearly. What makes an output “good”? Be specific.
- Provide examples. Show the judge what good and bad outputs look like.
- Ask for structured output. Scores, binary pass/fail, or detailed reasoning—whatever you need.
- Avoid ambiguity. The judge should interpret the same way every time.
Example judge prompt:
You are evaluating whether an AI agent's response correctly answers a customer support question.
Criteria for a PASS:
1. The response directly answers the question asked.
2. The response is factually accurate (does not hallucinate or invent information).
3. The response is formatted clearly (uses bullet points or numbered lists if applicable).
4. The response does not include irrelevant information.
Criteria for a FAIL:
1. The response does not answer the question.
2. The response includes false or hallucinated information.
3. The response is confusing or poorly formatted.
4. The response defers to a human when it shouldn't (or vice versa).
Question: {question}
Expected Answer: {expected_answer}
Agent Response: {agent_response}
Evaluate the agent's response. Respond with:
- PASS or FAIL
- Reasoning (1-2 sentences)
- Confidence (0-100)
Multi-Dimensional Scoring
For more nuance, score across multiple dimensions:
Score the agent's response on each dimension (0-10):
1. Correctness: Does it answer the question accurately?
2. Completeness: Does it cover all relevant aspects?
3. Clarity: Is it easy to understand?
4. Relevance: Is all information relevant to the question?
5. Tone: Is the tone appropriate for customer support?
Provide a single overall score (0-10) and reasoning.
This gives you granular insight into where your agent is strong and where it’s weak.
Validation and Calibration
Before you run evals at scale, validate your judge:
- Manually evaluate 20–30 test cases. Have a human expert score them using the same criteria.
- Run the LLM judge on the same cases. Compare results.
- Calculate agreement. Aim for 80%+ agreement on binary pass/fail, or high correlation on numeric scores.
- Refine the judge prompt. If agreement is low, iterate on the prompt until it’s consistent.
This calibration step is crucial. A poorly calibrated judge is worse than no judge at all.
Handling Edge Cases in Judging
Some cases are hard to judge automatically. Examples:
- Subjective quality. “Is this response well-written?” is subjective.
- Domain expertise required. “Is this medical advice accurate?” requires domain knowledge.
- Context-dependent correctness. The right answer depends on user intent, which may be implicit.
For these cases:
- Use human-in-the-loop evaluation. Flag edge cases for manual review.
- Use multiple judges. Run three different LLM judges and take the majority vote.
- Use domain-specific judges. Fine-tune a model on expert-labelled examples, then use it as your judge.
At PADISO, when building eval suites for our clients’ agents, we often combine LLM judges with spot-check human review. We automate 80% of evaluation and reserve human time for the 20% that requires judgment.
Deterministic Evals for Deterministic Tasks
Not everything requires an LLM judge. Some tasks are deterministic, and you can write simple rules to evaluate them.
Pattern Matching and Format Validation
If your agent outputs structured data, validate the structure:
import json
import re
def eval_json_format(output):
"""Check if output is valid JSON."""
try:
json.loads(output)
return True, "Valid JSON"
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {e}"
def eval_date_format(output):
"""Check if output contains dates in YYYY-MM-DD format."""
pattern = r'\d{4}-\d{2}-\d{2}'
matches = re.findall(pattern, output)
if matches:
return True, f"Found {len(matches)} dates in correct format"
else:
return False, "No dates in YYYY-MM-DD format found"
These are fast, reliable, and require zero LLM calls.
Retrieval Evaluation
If your agent retrieves information, evaluate whether it retrieved the right documents:
def eval_retrieval(retrieved_docs, expected_doc_ids):
"""Check if agent retrieved the expected documents."""
retrieved_ids = {doc['id'] for doc in retrieved_docs}
expected_ids = set(expected_doc_ids)
if retrieved_ids == expected_ids:
return True, "Retrieved all expected documents"
missing = expected_ids - retrieved_ids
extra = retrieved_ids - expected_ids
return False, f"Missing: {missing}, Extra: {extra}"
This isolates retrieval failures from reasoning failures. If retrieval is wrong, you know to fix your retrieval system, not your prompt.
Tool Call Validation
If your agent uses tools, validate that it calls the right tools in the right order:
def eval_tool_calls(agent_calls, expected_calls):
"""Check if agent made the expected tool calls."""
agent_call_names = [call['name'] for call in agent_calls]
expected_call_names = expected_calls
if agent_call_names == expected_call_names:
return True, "Correct tool call sequence"
else:
return False, f"Expected {expected_call_names}, got {agent_call_names}"
This catches cases where your agent is reasoning correctly but calling tools in the wrong order or missing steps.
Semantic Similarity
For free-form text, use embedding-based similarity to check if the agent’s output is semantically similar to the expected output:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def eval_semantic_similarity(agent_output, expected_output, embeddings_fn):
"""Check semantic similarity between outputs."""
agent_embedding = embeddings_fn(agent_output)
expected_embedding = embeddings_fn(expected_output)
similarity = cosine_similarity(
[agent_embedding],
[expected_embedding]
)[0][0]
if similarity > 0.85: # threshold
return True, f"High semantic similarity: {similarity:.2f}"
else:
return False, f"Low semantic similarity: {similarity:.2f}"
This is useful when there are multiple correct ways to phrase an answer. As discussed in resources on Prompt Engineering Guide, semantic understanding is crucial for evaluating agent outputs beyond exact string matching.
Regression Testing and Eval Cadence
Evaluations are only useful if you run them regularly. Define an eval cadence and stick to it.
The Eval Pipeline
Your eval pipeline should be automated and run on every change:
- Pre-commit evals (local, fast). Run a subset of evals (20–50 test cases) before you commit code. This catches obvious breaks.
- CI evals (full suite). Run the full eval suite on every commit to main. This takes longer but catches regressions.
- Nightly evals (extended suite). Run extended evals overnight, including stress tests and edge cases.
- Weekly manual review. Review eval results, spot-check failures, and update the golden dataset.
Regression Detection
Track eval scores over time. A regression is a drop in performance:
def detect_regression(current_score, previous_score, threshold=0.05):
"""Detect if performance has regressed."""
if current_score < previous_score * (1 - threshold):
return True, f"Regression detected: {previous_score} -> {current_score}"
else:
return False, "No regression"
Set thresholds based on your use case. For critical systems, even a 2% drop might be a regression. For less critical systems, 5% might be acceptable.
Tracking and Visualisation
Store eval results in a database or spreadsheet. Track:
- Timestamp
- Agent version
- Eval version
- Scores (pass rate, average correctness, latency, cost)
- Failures (which test cases failed, why)
Visualise trends over time:
Pass Rate Over Time
100% | ___
| / \
95% |___/ \___
| \
90% | \___
+---+---+---+---+---+
v1 v2 v3 v4 v5
When you see a dip, investigate immediately. Did you introduce a bug? Did the model behave differently? Did the golden dataset change?
Balancing Iteration Speed and Stability
There’s a tension: you want to iterate fast, but you also want to catch regressions. Here’s how to balance:
- For exploratory work (new features, new prompts). Run fast evals (20–50 test cases) frequently. You’re iterating quickly and expect some failures.
- Before shipping to production. Run the full eval suite. You’re confident in the change.
- In production. Run evals on every change, even small ones. You’re protecting your users.
At PADISO, our clients typically run full evals before shipping and quick evals during development. This catches regressions without slowing down iteration.
Instrumentation and Observability
Evals are only the start. You also need observability in production.
Logging Agent Execution
Log every agent execution:
import json
import logging
logger = logging.getLogger(__name__)
def run_agent(input_text):
# Run agent
output = agent.execute(input_text)
# Log execution
logger.info(json.dumps({
"timestamp": datetime.now().isoformat(),
"input": input_text,
"output": output,
"tool_calls": agent.tool_calls,
"latency_ms": agent.latency_ms,
"cost_usd": agent.cost_usd,
"model": agent.model,
"agent_version": agent.version,
}))
return output
This gives you a complete record of what the agent did. When something goes wrong, you can replay it and debug.
User Feedback Loops
Add mechanisms for users to flag bad outputs:
# After agent returns output
print(f"Output: {output}")
print("Was this helpful? [Y/N/Report]")
user_feedback = input()
if user_feedback.lower() == "report":
logger.warning(json.dumps({
"type": "user_reported_failure",
"input": input_text,
"output": output,
"user_feedback": input("Please describe the issue: "),
}))
User feedback is gold. It tells you what’s actually failing in the wild, not just in your test suite.
Automated Monitoring
Set up alerts for:
- High error rate. If >5% of requests fail, page the team.
- High latency. If P95 latency exceeds threshold, investigate.
- High cost. If cost per request spikes, something’s wrong.
- Low user feedback score. If users are reporting failures, fix it.
These alerts catch problems before they become disasters.
Real-World Eval Patterns from Production
Theory is useful, but real-world patterns are more useful. Here are patterns we’ve seen work at PADISO and our clients.
Pattern 1: Tiered Eval Suites
Not all test cases are equally important. Create tiers:
Tier 1: Critical path (20 test cases)
The core use case. If these fail, the agent is broken. Run these on every commit.
Tier 2: Extended functionality (100 test cases)
Common variations and edge cases. Run these before shipping to production.
Tier 3: Stress and adversarial (200+ test cases)
Edge cases, adversarial inputs, and stress tests. Run these nightly.
This approach lets you iterate fast (Tier 1) while still catching problems (Tiers 2 and 3).
Pattern 2: Eval-Driven Development
Write evals before you write the agent. This is test-driven development, but for AI:
- Define the golden dataset and evals.
- Implement the agent.
- Run evals. If they fail, iterate on the agent.
- Ship when evals pass.
This ensures your agent meets requirements from day one. It also gives you a regression test suite for free.
Pattern 3: Continuous Eval Improvement
Your evals will be wrong at first. That’s OK. Improve them over time:
- Run evals in production (log outputs, don’t block).
- Collect user feedback and production failures.
- Add failing cases to the golden dataset.
- Re-run evals. You’ll see a dip in pass rate—that’s expected.
- Iterate on the agent to fix the new failures.
- Your eval suite is now more representative of real usage.
This feedback loop is how you build a truly robust agent.
Pattern 4: Multi-Model Evaluation
Don’t just evaluate with Claude. Test with other models too:
- Claude 3.5 Sonnet for your production agent.
- Claude 3.5 Haiku for cost-sensitive variants.
- GPT-4o for comparison and robustness.
If your agent passes evals with multiple models, you’re more confident in its robustness. As discussed in Claude Code Agents 101: Build Your First AI Agent from Scratch, understanding the strengths and weaknesses of different models is crucial for building reliable agents.
Pattern 5: Eval Suites as Documentation
Your golden dataset and evals are documentation. They tell future developers (and your future self) what the agent is supposed to do. Keep them well-organised and commented:
{
"id": "test_password_reset_001",
"category": "canonical",
"description": "User asks how to reset password. Agent should provide step-by-step instructions.",
"input": "How do I reset my password?",
"expected_output": "1. Click 'Forgot Password' on the login page\n2. Enter your email...",
"notes": "This is a core use case. Should always pass."
}
When a new engineer joins, they can read the eval suite and understand the agent’s behaviour in minutes.
Common Pitfalls and How to Avoid Them
Pitfall 1: Overfitting to Your Eval Suite
You optimise your agent to pass evals, but it fails in production. This happens when:
- Your eval suite is too small (< 50 test cases).
- Your eval suite doesn’t represent real usage.
- You optimise the prompt specifically for your test cases.
Avoid it by:
- Using a large, diverse golden dataset.
- Regularly adding production failures to your eval suite.
- Evaluating on held-out test cases (don’t use the same cases for development and final evaluation).
Pitfall 2: Eval Suite Drift
Your eval suite becomes stale. It doesn’t represent current usage anymore. You’re optimising for the wrong thing.
Avoid it by:
- Reviewing and updating your eval suite quarterly.
- Adding new test cases based on production failures and user feedback.
- Removing test cases that are no longer relevant.
Pitfall 3: Judge Bias
Your LLM judge is biased. It consistently rates certain types of outputs higher or lower, regardless of quality.
Avoid it by:
- Calibrating your judge on human-labelled examples.
- Using multiple judges and taking the majority vote.
- Regularly spot-checking judge decisions against human judgment.
Pitfall 4: Ignoring Cost and Latency
You optimise for correctness but ignore cost and latency. Your agent is correct but unusable.
Avoid it by:
- Including cost and latency in your eval metrics from day one.
- Setting budgets (e.g., “must cost < $0.10 per request”) and enforcing them.
- Testing latency under load, not just in isolation.
Pitfall 5: Not Automating Evals
You run evals manually. They’re slow, error-prone, and rarely run. You miss regressions.
Avoid it by:
- Automating your eval pipeline from day one.
- Running evals on every commit (at least a subset).
- Setting up alerts for regressions.
Shipping with Confidence: Your Eval Checklist
Before you ship your Claude agent to production, use this checklist.
Golden Dataset
- You have 100+ test cases covering canonical examples, edge cases, and failure modes.
- Test cases are documented with expected outputs and reasoning.
- Test cases are version-controlled and reviewed by domain experts.
- You have a process for adding new test cases based on production failures.
Correctness Evaluation
- You have LLM-as-judge evals for free-form outputs.
- You have deterministic evals for structured outputs (JSON, dates, tool calls).
- Your judge is calibrated against human judgment (80%+ agreement).
- You’re tracking correctness scores over time and detecting regressions.
Consistency Evaluation
- You’re running the same test case multiple times and checking for consistency.
- You’re tracking consistency scores and flagging degradation.
- You’re using temperature=0 (or low temperature) for production agents where consistency matters.
Cost and Latency Evaluation
- You’re measuring token count per request and tracking trends.
- You’re measuring latency (P50, P95, P99) and ensuring it’s acceptable.
- You’re measuring cost per request and ensuring it’s viable at scale.
- You have budgets and are enforcing them (e.g., “must cost < $0.10 per request”).
Instrumentation and Observability
- You’re logging every agent execution (input, output, tool calls, latency, cost).
- You have a mechanism for users to report failures.
- You have alerts set up for high error rates, high latency, and high cost.
- You’re monitoring user feedback and production failures.
Eval Pipeline
- You have a CI/CD pipeline that runs evals on every commit.
- You have fast evals (< 5 minutes) for development and full evals for production.
- You’re tracking eval results over time and detecting regressions.
- You have a process for investigating and fixing regressions.
Documentation
- Your golden dataset is well-documented and serves as documentation for the agent’s behaviour.
- You have a README explaining your eval suite and how to run it.
- You have runbooks for common failure modes and how to debug them.
Moving Beyond Vibe Checks
Vibe checks are comfortable. They’re fast, they feel thorough, and they let you ship quickly. But they’re also a trap. They give you false confidence until they don’t—and then you’re debugging in production.
Proper evaluations take upfront time. You’ll spend time building your golden dataset, calibrating your judge, and automating your eval pipeline. But that time pays dividends immediately:
- You ship faster because you’re confident in your changes.
- You catch regressions before users do.
- You can iterate on your agent without fear.
- You build a feedback loop that makes your agent better over time.
At PADISO, we’ve helped startups and enterprises build eval suites for their Claude agents. The pattern is consistent: teams that invest in evals ship better products faster. Teams that rely on vibe checks ship slower and debug more.
If you’re building a Claude agent for production, build evals. Start with a golden dataset. Add LLM-as-judge and deterministic evals. Automate your pipeline. Monitor in production. Iterate.
Your users will thank you. Your team will thank you. Your sanity will thank you.
Next Steps
- Start with your golden dataset. Gather 50+ real examples from your use case. Define expected outputs.
- Build your judge. Write a judge prompt that clearly defines what “good” looks like. Calibrate it against human judgment.
- Automate your pipeline. Set up CI/CD to run evals on every commit. Start with a subset, expand over time.
- Ship with confidence. Use the checklist above. When you’re ready, ship to production with monitoring and feedback loops in place.
- Iterate continuously. Add production failures to your eval suite. Improve your judge. Expand your golden dataset.
If you’re building agents at scale—or if you need help designing eval suites for your Claude agents—PADISO specialises in AI & Agents Automation and can help you build production-grade evaluation systems. We’ve shipped evals for dozens of agents across startups and enterprises, and we know what works. Whether you need help with AI Strategy & Readiness, custom Platform Design & Engineering, or Venture Studio & Co-Build support, we’re here to help you ship reliable AI products.
The future of AI products is built on solid evals. Start building yours today.
Additional Resources
For deeper dives into specific topics:
- Learn how to orchestrate multiple agents in production with patterns from How We Ship Production Code with 200 Autonomous Agents, which covers testing and iteration strategies at scale.
- Understand the broader context of agent evaluation by reading Evaluating AI Models on End-to-End Web Application Development, an academic paper on benchmarking code generation and web agents.
- Get hands-on with Build a Bot with Claude, Anthropic’s official guide to building conversational agents.
- Dive into the Python SDK with Anthropic Python SDK, which includes examples of agent and tool use.
- Explore advanced prompt engineering techniques in the Prompt Engineering Guide, essential for optimising agent performance.
- Check out the Massive Text Embedding Benchmark Leaderboard to understand embedding model performance, relevant for retrieval-based agents.
For Sydney-based teams looking to build production AI systems, explore how AI Agency Sydney and AI Automation Agency Sydney partnerships can accelerate your development. Learn about AI Agency Methodology Sydney and how to measure success with AI Agency Metrics Sydney, AI Agency KPIs Sydney, and AI Agency ROI Sydney. If you’re comparing approaches, understand the differences between Agentic AI vs Traditional Automation and how agentic patterns deliver better ROI. For practical integration examples, see how Agentic AI + Apache Superset enables non-technical users to interact with data via Claude agents.