Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Apr 13, 2026

Your APM Tells You the Agent Is Up. It Has No Idea If the Agent Is Working.

Uptime checks and error rates won't catch AI agent failures. Here's what production health monitoring for agents actually requires — and why your current stack is flying blind.

Waxell blog cover: AI Agent Health Monitoring — Metrics That Actually Matter

Here is the scenario production AI monitoring researchers documented in early 2026: an agent spends three months learning that database utilization drops 40% on weekends. On one particular weekend — month-end processing — it applies that lesson and autonomously scales down the production cluster. The APM shows green the whole time. The agent is running, responding, returning 200s. It is also wrong — the production database is degraded — and it takes hours to diagnose because every system that was supposed to catch problems says everything is fine.

This is the canonical AI agent monitoring failure: not a crash, not a timeout, not an error rate spike. A confident, technically successful execution of the wrong thing.

Standard APM was built for deterministic systems — where the same input reliably produces the same output, where "healthy" means "running," and where failure looks like a non-200 response. AI agents break all three assumptions. An agent can be running, responding correctly at the network layer, and completely failing the user's intent — and your monitoring infrastructure has no visibility into any of it.

AI agent health monitoring is the practice of instrumenting and alerting on behavioral metrics — goal completion rate, tool call success rate by individual tool, cost-per-task deviation, session retry depth, and behavioral drift — that reveal whether an agent is working, not just whether it is running. It is distinct from infrastructure monitoring (which detects crashes and latency spikes) and from AI observability (which records execution traces after the fact). Health monitoring closes the gap between "the agent is up" and "the agent is doing what it's supposed to do." Most teams operating production agents have the first. Very few have the second.

Why do AI agents fail silently in production?

Infrastructure monitoring catches infrastructure failures: the process crashed, the API timed out, memory exhausted. For web services and APIs, this covers most failure modes. If the service is up and responding under 200ms, it's healthy.

AI agents have a failure surface that infrastructure monitoring can't reach.

Behavioral failure. An agent can return a valid, well-formed response that is wrong. There's no exception, the request completes with a 200, and nothing in your error monitoring triggers. The agent hallucinated a customer name, misread a date, or applied a learned pattern at exactly the wrong moment. Error monitoring catches exceptions. It has no concept of "this output is incorrect."

Silent tool call failure. Tool calls fail in ways invisible to surface-level monitoring. An API returns a successful response with stale data. A schema changed three weeks ago and the agent has been silently misreading field names ever since. Authentication credentials rotated and the agent is now working against a cached session that returns partial results. All of these register as 200s. None register as errors.

Retry loops. An agent encountering a failure it can't resolve will retry. Without enforcement limits, it retries until something stops it — the session timeout or the token budget, whichever is higher. OneUptime's March 2026 analysis of production agent failures documented one case where an agent retried a failed API call 847 times, accumulating $2,000 in token costs before anyone was paged — because every individual request succeeded. Zero error alerts fired.

Behavioral drift. This is the slow failure. An agent's outputs shift gradually over sessions due to model updates, prompt injection accumulating in memory, or distribution shift in input data. No single session looks wrong. The aggregate trend is a problem that only becomes visible if you're tracking behavioral metrics over time. Uptime monitoring cannot surface it.

The uncomfortable implication: the monitoring stack most teams have for their agents tells them almost nothing about whether those agents are working.

What metrics actually tell you an agent is healthy?

Your APM gives you uptime, HTTP error rate, P50/P95 latency, and resource utilization. These are worth tracking — but they're necessary, not sufficient. An agent can score perfectly on all of them while failing behaviorally.

The metrics that actually indicate agent health are different.

Goal completion rate. Did the agent accomplish what it was asked to do? This requires defining what "done" means for each task type and instrumenting the outcome, not just the response. Goal completion rate is the closest thing to a user-facing health metric that an agent has. A drop here is a real signal even when nothing else looks wrong.

Tool call success rate by tool. Aggregate tool success rate is a trailing indicator. Per-tool success rate tells you which integration is breaking. When the CRM connector's success rate drops from 99% to 87%, you know exactly where to look. When aggregate rate dips 2%, you're investigating everything.

Cost-per-task deviation. If your agent normally consumes 8,000 tokens to complete a support ticket and it's now consuming 24,000, something changed — input complexity, model behavior, or a looping condition. Cost-per-task as a rolling metric detects runaway behavior before it hits billing, which is too late.

Session retry depth. How many attempts does the agent make before completing or failing? An agent that normally resolves tasks in one or two steps and is now averaging five is signaling a problem, even if each individual step succeeds.

Behavioral consistency score. For agents doing similar tasks repeatedly, output distribution should be stable. Tracking whether outputs are shifting in ways that correlate with changing inputs — versus drifting independently — is early warning for model updates and prompt injection effects that no infrastructure metric will surface.

None of these come from standard APM. They require instrumenting the full execution graph — every tool call, every step, every cost increment — and computing behavioral metrics over sessions and rolling time windows, not just individual requests.

What should your on-call runbook actually say?

The 3 AM call for a web service is usually clear: something crashed, find the bad deploy. The 3 AM call for an AI agent is different, because the system can be up while the agent is failing.

Your on-call runbook for AI agents needs to answer questions your web service runbook never had to address.

Is the agent running, or is the agent working? Separate infrastructure health from behavioral health immediately. If the infrastructure is healthy but behavioral metrics are degraded, the investigation path is completely different — and faster to close when you know which path you're on.

What changed? Behavioral degradation has three common causes: a model update (did the underlying model update without announcement?), a tool-layer change (check authentication status and API response schemas for every tool the agent touches), or input distribution shift (is the character of today's requests different from baseline?). Your runbook should have a specific check sequence for each.

What's the blast radius? Unlike a crashed service, a misbehaving agent may have already written to production systems — databases, external APIs, downstream workflows — during the degraded period. Before you fix the agent, assess what it may have done while wrong.

What triggers a page vs. what goes to the queue? Pages should fire when goal completion rate drops below threshold, when cost-per-task exceeds 3× the rolling baseline, when a critical tool's success rate drops below its floor, or when any active session exceeds retry depth limits. These are active, compounding problems. Gradual behavioral drift under threshold, non-critical tool degradation trending slowly — those belong in the queue, not the pager.

Most teams don't have this runbook. They have a web service runbook applied to agents, which means the first time an agent behaves badly without crashing, the on-call rotation has no protocol for it.

How Waxell handles this

How Waxell handles this: The foundation of production agent health monitoring is complete execution tracing — not just LLM call logging, but every step the agent takes. Waxell Observe instruments agents across any framework with execution tracing that makes behavioral health metrics computable: every tool call, every external request, every token cost, every session captured in one data model. Production telemetry surfaces those behavioral metrics in real time — cost-per-task, tool success rates by individual tool, session depth — the signals your APM can't produce.

On top of observability, Waxell's governance plane adds operational circuit breakers that function as proactive health enforcement: a cost policy terminates a runaway session before it burns thousands in tokens; a retry-depth policy stops the agent before its eight-hundredth failed call; an operational policy triggers human escalation when goal completion falls below threshold. Your APM tells you the agent is up. Waxell's policies enforce the conditions under which it's allowed to keep running.

If you want to see what behavioral agent health monitoring looks like in practice, get early access.

Frequently Asked Questions

What metrics should I use to monitor AI agents in production?
The core behavioral health metrics for production AI agents are: goal completion rate (did the agent accomplish what it was asked?), tool call success rate by individual tool, cost-per-task over a rolling window, session retry depth, and behavioral consistency over time. These complement infrastructure metrics like latency and error rate but are more diagnostic for agent-specific failures. Most agent failures show up in behavioral metrics first — sometimes days before anything appears in error rate.

Why doesn't standard APM work for AI agent monitoring?
APM was built for deterministic systems where failure means an exception or a non-200 response. AI agents fail behaviorally: an agent can return HTTP 200 with a confidently wrong output, complete a tool call against stale data, or apply a learned pattern at exactly the wrong moment — none of which trigger error monitoring. APM tells you the agent is running. It cannot tell you whether the agent is working.

What does an AI agent health check look like?
A production AI agent health check should verify: that the agent is reachable (infrastructure layer), that recent goal completion rate is above threshold (behavioral layer), that critical tool success rates haven't degraded (integration layer), that cost-per-task is within normal range (cost layer), and that no active session has exceeded retry depth limits (operational layer). The first check is what most teams have. The rest require instrumenting the full execution graph and computing metrics over sessions.

How do I detect behavioral drift in a production AI agent?
Behavioral drift requires tracking output distribution over time — not individual request quality, but whether the pattern of outputs across sessions is shifting. Practical approaches: measure semantic similarity between outputs for similar inputs over rolling windows, track task complexity versus token consumption ratios over time, and monitor per-tool success rates for gradual degradation. Single-request evaluation misses drift entirely.

What should trigger an on-call alert for an AI agent?
Page when goal completion rate drops below a defined threshold, when cost-per-task exceeds 3× the rolling baseline, when a critical tool's success rate drops below its floor, or when any active session exceeds retry depth limits. These are conditions where something is wrong now and impact may be compounding. Gradual drift signals — cost trending up over days, non-critical tool degradation — belong in a queue, not a page.

Sources

OneUptime, Monitoring AI Agents in Production: The Observability Gap Nobody's Talking About (March 2026) — https://oneuptime.com/blog/post/2026-03-14-monitoring-ai-agents-in-production/view
OneUptime, Your AI Agents Are Running Blind (March 2026) — https://oneuptime.com/blog/post/2026-03-09-ai-agents-observability-crisis/view
Braintrust, AI observability tools: A buyer's guide to monitoring AI agents in production (2026) — https://www.braintrust.dev/articles/best-ai-observability-tools-2026
UptimeRobot, AI Agent Monitoring: Best Practices, Tools & Metrics for 2026 — https://uptimerobot.com/knowledge-hub/monitoring/ai-agent-monitoring-best-practices-tools-and-metrics/
Zylos Research, Process Supervision and Health Monitoring for Long-Running AI Agents (February 2026) — https://zylos.ai/research/2026-02-20-process-supervision-health-monitoring-ai-agents

Agentic Governance, Explained

Waxell blog cover: LiteLLM Breach — AI Teams Without Audit Trails

Ten Days After LiteLLM: Why AI Teams Without Audit Trails Are Flying Blind in Breach Response

The LiteLLM supply chain attack exposed 40,000+ people and triggered five contractor lawsuits. Ten days later, most enterprises still can't prove which of their agent sessions ran the malicious version.

Logan Kelly

Apr 10, 2026

Waxell blog cover: GDPR Transparency and AI Agents — What the EDPB Is Asking

The EDPB Is Asking About Your AI Agents. Most Teams Can't Answer.

The EDPB's 2026 enforcement action asks what personal data your AI agents processed per session. Most teams can't answer. Here's what you need.

Logan Kelly

Apr 10, 2026

Waxell blog cover: The $400M AI FinOps Gap — Cost Visibility vs. Runtime Enforcement

The $400M AI FinOps Gap: Why Cost Visibility Isn't the Same as Cost Control

Gartner says only 44% of orgs have AI financial guardrails. Here's why cost dashboards and budget alerts can't stop a runaway agent session — and what actually can.

Logan Kelly

Apr 9, 2026

Waxell blog cover: CI/CD for AI Agents — Why Your Pipeline Isn't Ready

Your CI/CD Pipeline Doesn't Know Your Agent Changed

Shipping a new agent version isn't like shipping a code change. Here's what your delivery pipeline is missing — and what a production-ready agent pipeline actually needs.

Logan Kelly

Apr 9, 2026

Ten Days After LiteLLM: Why AI Teams Without Audit Trails Are Flying Blind in Breach Response

Logan Kelly

Apr 10, 2026

The EDPB Is Asking About Your AI Agents. Most Teams Can't Answer.

The EDPB's 2026 enforcement action asks what personal data your AI agents processed per session. Most teams can't answer. Here's what you need.

Logan Kelly

Apr 10, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product