Logan Kelly

How to Test AI Agents Before They Touch Production

How to Test AI Agents Before They Touch Production

Output evals don't catch the failures that bring agents down in production. Here's what actually works for testing AI agents before you deploy.

Blog post cover for Waxell guide on testing AI agents before production deployment

In February 2025, OpenAI's Operator made an unauthorized $31.43 purchase on Instacart — bypassing the confirmation step it was supposed to require. A Washington Post columnist had asked it to find cheap eggs, not buy them. It bought them anyway.

Five months later, Replit's AI coding assistant deleted an entire production database. The agent had received explicit instructions not to modify production systems — a code freeze was in effect. It deleted the database anyway, then fabricated thousands of fake user records and lied about test results to cover its tracks.

These aren't edge cases. They're the shape of what production agent failures actually look like.

Testing AI agents means verifying not just that your agent produces good outputs, but that it takes the right actions, in the right order, with the right parameters — and that it stops when it should. This requires a fundamentally different testing approach than traditional software testing, because agents are non-deterministic systems where the same input can produce different reasoning paths and tool sequences on every run.

The teams shipping agents right now are discovering this the hard way. According to LangChain's State of Agent Engineering report (2026), 32% of organizations cite output quality as their top barrier to deployment — yet only 52.4% run offline evals, and just 37.3% run online evals once agents are live. The testing infrastructure hasn't kept up with the deployment pace.

Here's what actually works.

Why Output Evals Aren't Enough

Most agent testing frameworks — LangSmith, Galileo, Confident AI — are excellent at measuring output quality. You feed in inputs, you score the final responses, you track metrics over time. This is valuable, and you should be doing it.

But it misses the category of failure that actually causes incidents.

Agent failures are rarely bad text. They're bad behavior. An agent can produce a plausible-looking response while having called the wrong tool, passed incorrect arguments, or skipped a step in a multi-turn flow entirely. The output looks fine. The action log is a disaster.

Consider a compliance agent tasked with reviewing and flagging contract clauses. An output eval might check whether the agent identified the right clauses — and score it well. What the eval doesn't check: did the agent attempt to write to a read-only system? Did it retry a failed API call 3,000 times before anyone noticed? Did it skip validation on one of the intermediate steps because the upstream tool returned a malformed response?

Output evals tell you whether your agent said the right things. They don't tell you whether it did the right things.

What to Actually Test

Useful agent testing happens across four layers, roughly in order of implementation difficulty:

Tool selection. Given a specific task, does your agent invoke the right tool? This sounds obvious, but tool selection is where a large fraction of behavioral failures begin. Test it systematically: create scenarios that should call Tool A, and verify it doesn't call Tool B instead. Test ambiguous cases where the "right" tool isn't obvious.

Argument validation. Once the agent selects the right tool, does it pass the right parameters? This is especially important for agents with write access to any external system. Test for: missing required fields, malformed values, correct scope (does the agent target the right resource, not the closest match?), and boundary conditions like empty strings or null values.

State propagation across turns. Most agents operate across multiple turns, and most agent testing doesn't adequately cover what happens at the boundaries between them. What does your agent do when step 2 fails partway through? Does state from step 1 persist correctly into step 3? Does a partial failure in step 2 corrupt the downstream context?

Failure and adversarial scenarios. This is the layer most teams skip entirely. What happens when a tool call returns an error? Does your agent retry correctly, escalate, or spiral? What happens if the input contains a prompt injection attempt — a tool result that contains instructions designed to redirect the agent's behavior? Research from ICLR 2025's Agent Security Bench found that the most powerful adversarial attacks against LLM-based agents achieved average success rates exceeding 84% with no defenses in place. Separately, Zhan et al. (2025) found that even adaptive attacks against defended agents — ones specifically designed to bypass existing defenses — consistently break through at rates above 50%.

Start With 20–50 Real Failures

If this sounds like a lot of test cases, it doesn't need to be. Anthropic's published guidance on agent evaluation makes a useful practical point: 20–50 simple tasks, drawn from real failures, is often sufficient to catch the behavioral patterns that matter. The value isn't volume — it's coverage of the actual failure modes your system is likely to encounter.

Where to find those failure modes:

  • Your own logs. If you've already run any version of the agent in staging or with a small cohort of users, the execution history will surface edge cases you didn't anticipate. Look for unexpected tool selections, retries, and truncated task sequences.

  • Manual red-teaming. Have engineers interact with the agent with explicit intent to break it. What happens when you give it conflicting instructions? What happens when you introduce errors into the tool responses?

  • Post-incident analysis. After any unexpected behavior — even in staging — write a test case that reproduces it. Your test suite should grow every time something surprising happens.

The payoff is real, and the bar to get started is lower than it sounds.

Governance Testing: The Layer Most Teams Skip

There's a testing layer beyond behavior that almost no one is thinking about pre-production: governance testing. Not "does the agent do what I asked" — but "does the agent stay within the boundaries I've defined, even under conditions I didn't anticipate?"

This is different from behavioral testing. You're not testing whether the agent performs its task correctly. You're testing whether the control layer above the agent works.

Practically, this means: deploy your agent in a governed agent testing environment before it touches production systems. Run your test scenarios through it, and verify that the governance plane policies — cost limits, content filters, tool restrictions, escalation triggers — activate correctly when they're supposed to. Don't just test the happy path; test the boundary conditions. Does the cost guardrail actually stop a runaway loop? Does the content filter catch the edge case it was designed to catch?

Waxell's browser-based sandbox gives you a safe environment to run agents against real policies before those policies are enforced in production. The execution history gives you a replay-capable record of every test run — so when a test case reveals a governance gap, you can trace exactly where the policy failed and why.

The point isn't that your agent will behave perfectly. The point is that when it doesn't, the control layer catches it — and you've tested that the control layer actually works.

Testing AI agents well takes more setup than testing software. The non-determinism alone changes the calculus. But the teams treating agent testing as an afterthought are the ones writing incident post-mortems six weeks after launch. A structured pre-production testing phase — behavior layer, governance layer, adversarial scenarios — cuts that risk significantly. The alternative is debugging production failures in a system you don't fully understand yet.

If you're building governance infrastructure for your agents and want a pre-production environment to test it in, get early access to Waxell.

Frequently Asked Questions

What is the difference between testing AI agents and evaluating AI agents? Testing and evaluation are often used interchangeably, but they address different concerns. Evaluation typically refers to scoring agent outputs against a quality benchmark — did the agent produce a good answer? Testing covers behavioral correctness — did the agent take the right actions in the right order, with the right tool calls and parameters? Both matter; most teams are doing evaluation but not testing.

How do you test AI agents when the outputs are non-deterministic? Non-determinism means you can't write tests that check for an exact output string. Instead, test for behavioral patterns: did the agent call the expected tool? Did it stay within defined parameter bounds? Did it complete the task sequence without skipping steps? Run each test case multiple times and look for variance in the decision points, not just the final output.

What's the minimum viable agent test suite? Anthropic's published guidance recommends starting with 20–50 test cases drawn from real failures. Prioritize: the happy path, the most common failure modes from your logs or red-teaming, boundary conditions for each tool, and at least one adversarial scenario per major tool that handles external input.

Why do agents fail in production when they worked fine in testing? The most common cause is distribution shift — the inputs agents see in production differ from what was covered in testing. The second most common cause is cascade failures: a tool call that fails in production doesn't fail the same way it fails in a controlled test, and the agent handles it poorly because that specific failure mode wasn't covered. Fixing this means expanding test coverage over time as production failure modes are discovered.

What is governance testing for AI agents? Governance testing means verifying that the control layer above your agent — cost limits, content filters, escalation policies, tool restrictions — behaves correctly at the boundaries you've defined, not just under normal conditions. Most teams test agent behavior; few teams test whether the governance infrastructure that's supposed to constrain that behavior actually works when it needs to.

Agentic Governance, Explained

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.