Table of Contents
- Why LLM-as-Judge Alone Isn’t Enough
- Reference-Based Evaluation Patterns
- Behavioural Evaluation Patterns
- Outcome-Based Evaluation Patterns
- Building a Balanced Evaluation Framework
- Common Pitfalls and How to Avoid Them
- Integrating Evaluation Into Your Deployment Pipeline
- Practical Next Steps
Why LLM-as-Judge Alone Isn’t Enough
If you’ve shipped a production AI agent—or you’re planning to—you’ve probably heard the pitch: use an LLM as a judge to evaluate your agent’s outputs. It’s fast. It scales. It doesn’t require ground truth labels. And it’s become the default pattern across most AI evaluation tooling.
The problem is that LLM-as-judge evaluation tells you almost nothing about whether your agent will actually work in production.
We’ve seen this play out repeatedly with our clients at PADISO. A team ships an agent with strong LLM-as-judge scores—maybe 8.5/10 on coherence, relevance, and task completion. They deploy to production. Within a week, the agent is hallucinating tool calls, getting stuck in retry loops, or making decisions that violate business logic. The LLM judge never caught any of it.
Why? Because LLM-as-judge evaluates output quality in isolation. It doesn’t measure whether your agent makes correct tool calls, respects constraints, recovers from failures, or delivers measurable business value. It doesn’t test whether the agent’s behaviour is consistent, deterministic, or safe.
Production AI agents live in a different world than the text-in-text-out model that LLM judges are built for. They interact with APIs, databases, and external systems. They make sequential decisions. They accumulate state. They fail in ways that matter.
This guide walks you through three complementary evaluation patterns that actually predict production performance: reference-based evaluation, behavioural evaluation, and outcome-based evaluation. Used together, these patterns give you confidence that your agent won’t break when it matters.
Reference-Based Evaluation Patterns
Reference-based evaluation compares your agent’s output against known-good examples or expected trajectories. It’s the most straightforward pattern, and it’s essential for catching regressions.
Exact Match and Semantic Similarity
The simplest reference-based check is exact match: does your agent produce the exact same output as a reference implementation? This works well for deterministic tasks—extracting structured data, routing requests, or generating SQL queries.
For example, if your agent’s job is to convert natural language into database queries, you can compare the agent’s generated SQL against a reference SQL statement. If they match, the agent passed. If they don’t, you investigate why.
Exact match is brittle, though. If your agent generates semantically equivalent SQL that’s formatted differently, exact match will fail even though the agent is correct. That’s where semantic similarity comes in.
Semantic similarity uses embedding models or specialised checkers to compare outputs at a higher level of abstraction. For SQL queries, you might run both the agent’s query and the reference query against a test database and compare the result sets. For natural language outputs, you can use embeddings to measure cosine similarity between the agent’s response and the reference response.
Tools like LangSmith make this straightforward. You define reference outputs, run your agent against a test set, and automatically compare the agent’s outputs against the references using embedding similarity, exact match, or custom comparison logic.
Trajectory Matching
Most agents don’t produce a single output—they produce a trajectory of steps. An agent might observe the current state, call a tool, observe the result, call another tool, and finally produce a response. Each step matters.
Trajectory matching compares the agent’s entire decision sequence against a reference trajectory. Did the agent call the right tools in the right order? Did it pass the correct arguments? Did it handle tool responses appropriately?
This is more powerful than output-only evaluation because it catches agents that stumble through the right answer by accident. If your agent calls three tools in the wrong order but happens to arrive at the correct final answer, trajectory matching will flag it. Output-only evaluation won’t.
You can implement trajectory matching by recording reference trajectories—sequences of observations, tool calls, and responses—and then comparing your agent’s actual trajectory against the reference. Arize AI’s LLM as a Judge primer includes practical examples of how to structure trajectory evaluation, including how to score partial matches when the agent takes a different but equally valid path.
Tool Call Correctness
One of the most common failure modes in production agents is incorrect tool calls. The agent might call the right tool with the wrong arguments, or call the wrong tool entirely. These failures often slip through LLM-as-judge evaluation because judges focus on final output quality, not intermediate steps.
Tool call correctness evaluation explicitly checks whether each tool call the agent makes is valid. Did the agent pass the required arguments? Are the argument types correct? Does the argument value fall within acceptable ranges? Does the tool call respect business logic constraints?
You can implement this as a whitelist of valid tool calls for each scenario, then check the agent’s actual calls against the whitelist. Alternatively, you can use a type-checking approach: define the expected schema for each tool, and validate that the agent’s calls conform to the schema.
This pattern is especially valuable for agents that interact with critical systems—payment processors, customer databases, operational dashboards. A single incorrect tool call can cost real money or break customer trust. Reference-based tool call evaluation catches these failures before deployment.
Behavioural Evaluation Patterns
Behavioural evaluation focuses on how your agent behaves across different scenarios, rather than comparing against a known-good reference. It’s about consistency, constraint satisfaction, and robustness.
Consistency and Determinism
In production, you need your agent to behave consistently. If you ask it the same question twice, it should give the same answer (or at least an answer from the same distribution). If it doesn’t, you can’t trust it.
Consistency evaluation runs your agent multiple times against the same input and measures whether outputs are consistent. For deterministic tasks, you’d expect exact match. For tasks that allow variation, you’d measure whether the outputs cluster around the same semantic meaning.
You can measure consistency using:
- Exact match rate: percentage of runs that produce identical output
- Semantic similarity variance: how much the embedding-based similarity varies across runs
- Decision consistency: whether the agent makes the same tool calls in the same order
High consistency is a prerequisite for production deployment. If your agent produces wildly different outputs for the same input, users will lose trust quickly. Consistency evaluation forces you to identify and fix sources of non-determinism before they reach production.
Constraint Satisfaction
Most agents operate under constraints. They shouldn’t:
- Call tools they don’t have access to
- Exceed rate limits
- Violate business rules (e.g., “don’t approve refunds over $10,000”)
- Access data they shouldn’t (e.g., other customers’ information)
- Produce outputs that violate formatting requirements
Constraint satisfaction evaluation explicitly tests whether your agent respects these boundaries. You define a set of constraints, generate test cases that try to violate them, and measure whether the agent catches the violations.
For example, if your agent manages customer refunds with a $10,000 limit, you’d generate test cases with refund requests above and below the limit. The agent should approve requests below the limit and reject (or escalate) requests above it. If it doesn’t, you’ve found a critical bug.
Constraint satisfaction is often the difference between a prototype and a production system. LLM-as-judge evaluation rarely measures constraints explicitly, which is why agents can look good in evaluation but fail in production.
Recovery and Error Handling
In production, things go wrong. Tools fail. APIs return errors. Networks time out. Your agent needs to handle these failures gracefully.
Recovery evaluation tests how your agent behaves when things break. You inject failures into the agent’s environment—make a tool call fail, return malformed data, introduce latency—and measure whether the agent:
- Detects the failure
- Attempts reasonable recovery strategies
- Escalates appropriately when recovery isn’t possible
- Doesn’t get stuck in infinite retry loops
- Provides useful information about what went wrong
You can implement recovery evaluation using tools like DeepEval’s LLM-as-a-judge framework, which includes patterns for testing error handling and resilience. You can also use chaos engineering approaches: intentionally break your agent’s dependencies and measure its behaviour.
Recovery evaluation is unglamorous but essential. It’s the difference between an agent that works 95% of the time and an agent that works 95% of the time and fails gracefully the other 5%.
Hallucination and Confabulation Detection
Hallucination—when an agent invents information or tool calls that don’t exist—is one of the most insidious failure modes. The agent sounds confident. It produces plausible-looking output. But it’s wrong.
Hallucination detection evaluation explicitly tests whether your agent invents tool calls, arguments, or responses. You can measure this by:
- Tool call invention: does the agent call tools that don’t exist or that aren’t available in its environment?
- Argument hallucination: does the agent pass arguments that don’t match the tool’s schema?
- Response fabrication: does the agent claim to have received information from tools that it never actually called?
You can implement hallucination detection using trace analysis (comparing the agent’s claimed actions against its actual trace) and by using LLM judges specifically trained to detect invented content. Galileo AI’s comprehensive guide to LLM-as-a-judge evaluation includes patterns for designing rubrics that catch hallucination.
Hallucination is particularly dangerous in agents that interact with external systems or that provide information to users. A hallucinating agent can damage trust, cause operational errors, or expose security vulnerabilities.
Outcome-Based Evaluation Patterns
Reference-based and behavioural evaluation tell you whether your agent works correctly. Outcome-based evaluation tells you whether it works for your business.
This is the pattern that most teams miss, and it’s why agents can pass all their tests and still fail in production.
Business Impact Metrics
The ultimate measure of an agent’s value is its impact on your business. Does it reduce operational costs? Does it accelerate time-to-decision? Does it improve customer satisfaction? Does it generate revenue?
Business impact evaluation measures these outcomes directly. You define metrics that matter to your business, run your agent against real or realistic scenarios, and measure the impact.
Examples include:
- Cost reduction: how much does the agent reduce operational costs? (e.g., “agent-assisted customer support reduces resolution time by 30%, cutting support costs from $50/ticket to $35/ticket”)
- Time-to-value: how much faster does the agent help users accomplish their goals? (e.g., “agent-assisted data analysis reduces time from query to insight from 2 hours to 15 minutes”)
- Revenue impact: does the agent help generate or protect revenue? (e.g., “agent-assisted lead scoring increases conversion rate from 12% to 15%, adding $500K annual revenue”)
- Quality improvements: does the agent improve output quality? (e.g., “agent-generated code passes 92% of test cases vs. 78% for baseline”)
- User satisfaction: do users prefer working with the agent? (e.g., “net promoter score increases from 45 to 62 after agent deployment”)
Business impact metrics are often harder to measure than output quality metrics, but they’re worth the effort. They force you to connect your agent’s performance to real value, and they give you a principled way to decide whether an agent is worth deploying.
User Acceptance and Satisfaction
Sometimes your agent can be technically correct but still fail because users don’t trust it or find it frustrating to use.
User acceptance evaluation measures whether real users find the agent helpful, trustworthy, and easy to work with. You can measure this using:
- Task completion rate: percentage of tasks users successfully complete with agent assistance
- User satisfaction scores: direct surveys asking users how satisfied they are with the agent
- Adoption rate: percentage of eligible users who actually use the agent
- Repeat usage: do users come back to the agent, or do they abandon it after first use?
- Support tickets: does agent usage increase or decrease support burden?
User acceptance evaluation often reveals failure modes that technical evaluation misses. An agent might be technically correct but produce outputs that are too verbose, too terse, or formatted in a way that users find confusing. User acceptance testing catches these issues before they cause adoption problems.
A/B Testing and Comparative Evaluation
The gold standard for outcome-based evaluation is A/B testing: deploy your agent to a subset of users and measure whether they have better outcomes than a control group.
A/B testing is expensive and slow, but it’s the most reliable way to measure real-world impact. You can run A/B tests on:
- Agent vs. baseline: does using the agent produce better outcomes than not using it?
- Agent version A vs. agent version B: which agent variant performs better?
- Agent-assisted vs. fully automated: for tasks where the agent can either assist humans or run fully automated, which approach delivers better outcomes?
For teams building production agents at scale, A/B testing becomes essential. You can’t rely on offline evaluation alone to predict production performance. Real users interact with your agent in ways you didn’t anticipate. They ask questions you didn’t test. They use the agent in contexts you didn’t consider.
A/B testing forces you to confront these realities early, before a bad agent causes widespread damage.
Building a Balanced Evaluation Framework
Now that you understand the three evaluation patterns, the question is: how do you combine them into a coherent framework?
Here’s the approach we recommend at PADISO when helping teams build production-grade agents:
Define Your Evaluation Pyramid
Think of evaluation as a pyramid with three levels:
-
Foundation (Reference-Based): These are your regression tests. They’re fast, deterministic, and cheap to run. You run them continuously—on every commit, ideally. They catch obvious breakage.
-
Middle (Behavioural): These are your robustness tests. They’re slower and more complex than reference-based tests, but they catch subtle bugs that reference-based tests miss. You run them regularly—daily or weekly—but not on every commit.
-
Peak (Outcome-Based): These are your business validation tests. They’re slow, expensive, and often require human involvement. You run them before major deployments or when you’re considering significant changes to the agent.
The pyramid structure ensures that you catch bugs early and cheaply (at the foundation) before they reach expensive testing (at the peak).
Build Test Cases Systematically
For each evaluation pattern, you need test cases. Here’s how to build them:
Reference-based test cases: Start with happy paths. What are the most common, straightforward scenarios your agent needs to handle? Build reference trajectories for these first. Then expand to edge cases: what are the tricky scenarios that your agent might get wrong? Build references for these too.
Behavioural test cases: Think about constraints and failure modes. What could go wrong? Build test cases that try to trigger these failures. For constraint satisfaction, enumerate your constraints and build a test case for each one. For recovery, intentionally break things and measure the agent’s response.
Outcome-based test cases: Start with a small pilot. Deploy your agent to a subset of users or scenarios. Measure real outcomes. Use these measurements to inform larger deployments.
Automate Evaluation Where Possible
Manual evaluation doesn’t scale. You need automation.
Tools like LangSmith provide trace-based evaluation: they automatically capture your agent’s execution traces and score them against your evaluation criteria. Tools like promptfoo let you define evaluation criteria as code and run evaluations across your entire test suite.
The key is to make evaluation a first-class part of your development process. Just like you wouldn’t ship code without running tests, you shouldn’t ship agent changes without running evaluation.
Set Quality Thresholds
Once you have evaluation metrics, you need thresholds. What score is “good enough” to deploy?
Thresholds depend on your use case:
- High-stakes decisions (e.g., approving loans, diagnosing diseases): you might want 99%+ accuracy on all evaluation patterns before deployment
- Moderate-stakes decisions (e.g., routing support tickets, summarising documents): you might want 95%+ accuracy
- Low-stakes decisions (e.g., generating suggestions, drafting content): you might accept 85-90% accuracy
Set thresholds explicitly. Document them. Make them part of your deployment checklist. This forces you to have honest conversations about what “good enough” means for your agent.
Common Pitfalls and How to Avoid Them
We’ve seen teams make the same mistakes repeatedly when building agent evaluation frameworks. Here’s what to watch out for:
Pitfall 1: Over-Relying on LLM-as-Judge
LLM judges are useful, but they’re not sufficient on their own. They excel at evaluating output quality (coherence, relevance, factuality) but they’re poor at evaluating:
- Whether tool calls are correct
- Whether constraints are satisfied
- Whether the agent recovers from failures
- Whether the agent delivers business value
How to avoid it: Use LLM judges as one component of a larger evaluation framework, not as your only evaluation method. Combine them with reference-based, behavioural, and outcome-based evaluation.
Pitfall 2: Evaluating in Isolation
Many teams evaluate their agents against static test sets in isolation from the rest of their system. They don’t test how the agent interacts with real APIs, databases, or external systems.
This is a recipe for production failures. An agent might pass all its isolated tests but fail when it interacts with a real API that has different latency, error rates, or response formats than the mock.
How to avoid it: Evaluate your agent in an environment that’s as close to production as possible. Use real APIs (or high-fidelity mocks). Test against production-like data volumes and latency profiles. If possible, run evaluation against a staging environment that mirrors production.
Pitfall 3: Static Test Sets
If you evaluate your agent against a fixed test set, you’ll eventually overfit to that test set. The agent will learn to handle the specific scenarios in your tests but fail on new scenarios.
How to avoid it: Continuously expand your test set. Every time your agent fails in production, add that failure case to your test set. Use adversarial testing to generate new test cases that might break your agent. Periodically review your test set and add new scenarios that reflect how your agent is actually being used.
Pitfall 4: Ignoring Variance and Uncertainty
Many evaluation metrics are noisy. An agent might score 85% on one run and 87% on another, just due to randomness. If you treat small differences in scores as meaningful, you’ll make poor deployment decisions.
How to avoid it: Run multiple evaluations and report confidence intervals, not just point estimates. Use statistical tests to determine whether a difference in scores is meaningful. Confident AI’s detailed evaluation guide includes practical approaches to managing variance in LLM-based evaluation.
Pitfall 5: Evaluating Without Context
An agent’s quality depends on context. The same agent might be excellent for one use case and terrible for another. The same evaluation metric might be meaningful for one agent and irrelevant for another.
How to avoid it: Always define evaluation criteria in the context of your specific use case. What matters for a customer support agent is different from what matters for a code generation agent. Make sure your evaluation criteria reflect your actual business needs.
Integrating Evaluation Into Your Deployment Pipeline
Evaluation is only valuable if it actually influences your deployment decisions. Here’s how to integrate it into your workflow:
Continuous Evaluation
Set up continuous evaluation that runs automatically:
- On every commit: Run fast reference-based tests. Block deployment if these fail.
- Daily: Run behavioural tests. Alert the team if scores drop below threshold.
- Before major releases: Run outcome-based evaluation. Require manual sign-off before deployment.
Tools like MLflow’s top agent evaluation frameworks provide infrastructure for continuous evaluation. You can integrate them into your CI/CD pipeline so that evaluation runs automatically as part of your deployment process.
Evaluation Dashboards
Make evaluation results visible. Create dashboards that show:
- Current evaluation scores across all metrics
- Trends over time (are scores improving or degrading?)
- Comparison between agent versions
- Breakdown by test category (reference-based, behavioural, outcome-based)
- Failure analysis (which test cases are failing? why?)
Visibility drives accountability. When everyone can see evaluation results, teams are more likely to take evaluation seriously.
Evaluation as a Gate
Make evaluation a hard gate in your deployment process. Don’t allow deployment unless:
- Reference-based evaluation passes 100% of tests
- Behavioural evaluation scores above your threshold
- Outcome-based evaluation (if applicable) shows positive impact
This might sound strict, but it’s the only way to prevent bad agents from reaching production. The cost of preventing one bad deployment far exceeds the cost of running comprehensive evaluation.
Post-Deployment Monitoring
Evaluation doesn’t stop at deployment. In production, your agent will encounter scenarios it never saw during testing. You need to monitor its performance and catch regressions early.
Set up monitoring that tracks:
- User satisfaction: are users happy with the agent’s outputs?
- Task completion rates: are users successfully completing their goals?
- Error rates: how often does the agent fail or produce invalid outputs?
- Tool call correctness: are the agent’s tool calls valid and appropriate?
- Business metrics: is the agent delivering the expected business impact?
When metrics degrade, trigger automated alerts. Investigate failures. Update your test suite to catch similar failures in the future. This creates a virtuous cycle where each production failure makes your evaluation framework stronger.
Practical Next Steps
If you’re building a production AI agent, here’s how to get started with comprehensive evaluation:
Week 1-2: Reference-Based Evaluation
- Define 10-15 reference scenarios that cover the happy path for your agent
- For each scenario, document the expected trajectory (sequence of tool calls and responses)
- Implement reference-based evaluation using a tool like LangSmith or promptfoo
- Run reference-based evaluation on your current agent. Document which scenarios pass and which fail
- Fix failures or update references if the agent’s behaviour is actually correct
Week 3-4: Behavioural Evaluation
- Enumerate the constraints your agent must satisfy
- Build test cases for each constraint (both cases that should pass and cases that should fail)
- Implement constraint satisfaction evaluation
- Build test cases for error handling and recovery
- Run behavioural evaluation. Document failure modes
- Prioritise fixes based on severity (high-severity failures first)
Week 5+: Outcome-Based Evaluation
- Define the business metrics that matter for your agent
- If possible, run a small pilot with real users
- Measure baseline performance (how well do users perform without the agent?)
- Measure agent-assisted performance (how well do users perform with the agent?)
- Calculate business impact
- Use these results to inform decisions about broader deployment
Ongoing: Continuous Improvement
- Every time your agent fails in production, add that failure case to your test suite
- Review evaluation results weekly. Look for trends and patterns
- Update evaluation criteria as your use case evolves
- Run A/B tests to validate that evaluation improvements translate to real-world impact
Connecting to PADISO’s Services
If you’re building a production AI agent and want expert guidance on evaluation frameworks, architecture, and deployment, PADISO can help. Our AI & Agents Automation service covers agent design, implementation, and evaluation. We’ve helped teams at seed-stage startups and mid-market enterprises build agents that actually work in production.
We also offer AI Strategy & Readiness assessments that include evaluation framework design. And for teams that need hands-on support, our CTO as a Service programme provides fractional technical leadership for AI projects.
If you’re in Sydney or Australia more broadly, you can book a call with our team at PADISO’s Sydney office. We work with founders, operators, and engineering teams building the next generation of AI-powered products.
For a quick assessment of where you stand with agent evaluation, consider our AI Quickstart Audit—a fixed-fee, two-week diagnostic that tells you where you actually are with your AI implementation, what to build first, and what 90 days could unlock.
Summary
LLM-as-judge evaluation is a useful tool, but it’s not sufficient for production AI agents. You need a balanced evaluation framework that combines:
- Reference-based evaluation: catches regressions and obvious bugs
- Behavioural evaluation: ensures your agent is robust, consistent, and respects constraints
- Outcome-based evaluation: validates that your agent delivers real business value
Build these evaluation patterns into your development process from the start. Automate evaluation where possible. Set quality thresholds and make evaluation a hard gate before deployment. Monitor agent performance in production and use failures to improve your evaluation framework.
This approach requires more upfront investment than LLM-as-judge alone, but it’s the difference between agents that work 95% of the time and agents that work reliably in production. For mission-critical agents, this difference is worth millions.
Start with reference-based evaluation this week. Add behavioural evaluation next week. Run your first outcome-based test within a month. Build from there. The teams that get evaluation right are the ones shipping agents that users actually trust.