Logan Kelly
Feb 13, 2026
Shipping an agent and running one are different disciplines. Here's what actually breaks in production, what operational questions you need to answer, and what reliability looks like for AI agents.

Shipping an agent is an act of optimism. Running it is an act of engineering discipline.
These are different skills. The skills that get you from idea to working demo — prompt engineering, tool design, context management, iteration speed — are not the same skills you need when the thing is live and real users are depending on it and something is going wrong at 11pm and you need to figure out what.
Most engineering teams have learned this the hard way with microservices. The same lesson is being learned right now, repeatedly, with agents. The architecture is different, but the core insight is the same: building something is not the same as operating something.
Here's what changes when you shift from "we shipped an agent" to "we're running an agent in production."
Running AI agents in production means operating them with the same engineering discipline you'd bring to any critical service: defined SLAs, cost controls enforced in real time, incident response procedures, and behavioral governance policies. The skills that get you to v1 — prompt engineering, tool design, iteration speed — are not the skills that keep the agent reliable, cost-controlled, and compliant under real-world load. Most teams discover this after their first production incident rather than before it. (See also: Why AI agent costs spiral → · What is agentic governance →)
What Breaks First
Not what breaks catastrophically — what breaks first, subtly, in the ways you don't catch until they've been broken for a while.
Latency. Your p95 latency looked fine in testing. In production, you have tail cases that didn't appear in your test set — long context windows, tool calls that take longer under load, retry sequences. Your p99 latency is significantly worse than your median, and users in that tail are having a bad experience. You find out from support tickets, not monitoring.
Behavioral drift after upstream changes. The model provider releases a new version. A tool your agent depends on changes its response schema. Your system prompt is modified for a product reason and the downstream effects on agent behavior weren't fully mapped. None of these show up as errors. The agent still runs. Its behavior just shifts, in ways that may or may not be acceptable, in ways you may or may not detect quickly.
Context window edge cases. Your testing probably covered typical sessions. Production has long sessions, confused users who restart mid-conversation, inputs that contain unexpected content, tool responses that are much longer than anticipated. These edge cases stress your context management in ways testing didn't reveal.
Concurrent session behavior. Tested sequentially, your agent works great. Under concurrent load, you discover resource contention, rate limits on downstream tools, session isolation issues. Problems that only appear when multiple things are happening at once.
Cost variance. Your average case cost is well within budget. Your variance is not under control. Long sessions, retry chains, and aggressive tool use by some user segments are running up bills that your average-case estimates didn't capture.
The Operational Questions You Need to Be Able to Answer
The test of whether you're actually running an agent — versus just having deployed one — is whether you can answer the operational questions on demand. Not after a two-hour investigation. On demand.
Latency questions. What's my current median session latency? What's my p99? Which sessions today took the longest, and why — which steps were the bottleneck?
Reliability questions. What fraction of sessions in the last 24 hours completed successfully? What fraction hit an error? What were the most common failure modes?
Behavior questions. Did the agent's behavior change after the system prompt update yesterday? Are there query categories where the agent consistently produces outputs below quality threshold?
Cost questions. What's my average cost per session today vs. last week? Which sessions were in the top 1% of cost, and what made them expensive?
Data questions. What PII has entered agent context in the last 7 days, how was it handled, and where did it end up?
If you can't answer these questions in a reasonable timeframe without a manual investigation, you have observability. You don't have operational capability.
Building an SLA for Your Agent
This is the question that triggers the most discomfort in teams that haven't thought about it: what does reliability mean for an AI agent?
For a traditional API, reliability is relatively straightforward. Uptime percentage. Error rate. Latency percentiles. Defined, measurable, contractable.
For an agent, reliability is more complex because it has a behavioral dimension. An agent that returns a response within your latency target but gives a confidently wrong answer isn't reliable in the sense that matters. You need to define reliability along multiple axes.
Availability. The agent responds. This is table stakes but worth defining — what's your acceptable error rate, what's your acceptable latency at which percentiles?
Behavioral consistency. The agent responds in ways that are consistent with its intended behavior, as evaluated against some defined quality criteria. This requires having defined what "correct" or "acceptable" behavior looks like, which is a product problem before it's an engineering problem.
Governance compliance. The agent operates within its defined policy envelope. It doesn't violate spend budgets, it handles PII according to policy, it doesn't invoke tools it shouldn't invoke. This is measurable if you have governance infrastructure; it's invisible if you don't.
Degradation behavior. When the agent can't handle a request (edge case, tool failure, ambiguous input), what does it do? A well-defined SLA specifies acceptable fallback behavior, not just acceptable success behavior.
Starting an explicit SLA conversation with your stakeholders forces the right questions. It makes "the agent sometimes does weird things" a quantifiable problem rather than a vague worry.
Incident Response for AI Behavior
Traditional software incident response has a clear shape: something is broken, you find the root cause, you fix it, you deploy the fix. The incident is bounded in time and scope.
Agent behavior incidents are different in important ways:
The incident may have been happening for a while before you knew. Behavioral issues don't always surface as errors. They can look like slightly worse user retention, slightly higher support volume, slightly more frequent escalations. By the time it's clearly an incident, you need to reconstruct what happened over an extended period.
The fix may not be a code deploy. If the problem is a system prompt change that had unintended consequences, reverting it might fix it. If the problem is a model behavior shift from an upstream provider, the fix is adapting your prompting or governance layer. If the problem is a PII handling gap, the fix is a policy update. The intervention options are different from traditional software.
Rollback has a different meaning. Rolling back your agent deployment is not the same as rolling back a traditional service because the agent's "behavior" is distributed across the model, the prompt, the tools, and the governance policies. You need to know which layer the problem is at before you know what to revert.
Impact may not be fully reversible. If the agent produced bad outputs that users acted on, or that got logged in downstream systems, those effects may persist after the agent is fixed. Your incident response needs to account for data cleanup and user communication, not just the technical fix.
Having thought through these questions before an incident — ideally having documented your response procedures and defined what "resolved" looks like — is the difference between a managed incident and a chaotic one.
How Do You Move from Reactive to Governed Agent Operations?
The teams that operate agents well long-term aren't the ones that get good at firefighting. They're the ones that systematically reduce the number of fires. That means moving from reactive operations — find it when it breaks — to governed operations — define what acceptable looks like, enforce it continuously, and know immediately when it's violated.
That transition requires:
A policy layer that makes "acceptable behavior" explicit, not implicit in your hope that the model will do the right thing.
An enforcement mechanism that applies those policies in real time, not after-the-fact.
Instrumentation that makes the operational questions answerable without investigation.
Incident playbooks that treat agent behavior incidents as a distinct category with their own response procedures.
None of this is particularly exotic. It's engineering discipline applied to a new category of system. The teams that treat their agent deployments with the same operational rigor they'd bring to a database or a microservice find that agents are perfectly operable. The teams that treat agents as something categorically different — as software that's too intelligent to need real ops — are the ones with the war stories.
How Waxell handles this: Waxell provides the governance and operational layer that makes "running agents" different from "having deployed agents" — real-time cost tracking, behavioral policy enforcement, PII controls, and a queryable audit trail. The operational questions (latency, cost per session, behavioral drift, data handling) become answerable on demand without engineering investigation. No rewrites. Deploy over whatever you've already built. Start free →
Frequently Asked Questions
What breaks first when you run AI agents in production? In order of typical appearance: latency tail cases (p99 latency significantly worse than median, discovered through support tickets not monitoring), behavioral drift after upstream changes (model updates, tool schema changes, prompt modifications with unmapped downstream effects), context window edge cases (long sessions, unexpected tool response lengths), and cost variance (average-case costs in budget, but outlier sessions running up the tail).
How do you build an SLA for an AI agent? An agent SLA needs to cover four dimensions: availability (error rate, latency at defined percentiles), behavioral consistency (responses meet defined quality criteria, evaluated against some benchmark), governance compliance (agent operates within its policy envelope — spend, PII, tool constraints), and degradation behavior (defined fallback when the agent can't handle a request). Each dimension requires having defined what "acceptable" looks like before you can measure it, which is a product decision before it's an engineering one.
What does AI agent incident response look like? Agent behavior incidents differ from traditional software incidents in four ways: the problem may have been happening for days before detection; the fix may not be a code deploy (it might be a policy update, a prompt revert, or a governance layer change); rollback has a different meaning because behavior is distributed across model, prompt, tools, and policies; and impact may not be fully reversible if bad outputs were acted on or logged downstream. Response procedures need to account for all of this before an incident, not during one.
How is operating AI agents different from traditional software operations? Traditional systems are deterministic — you control the code, the code executes predictably. Agents are probabilistic — the same inputs can produce different outputs, and behavior is distributed across model, prompt, tools, and governance layer. This means traditional on-call runbooks don't translate directly. Agent operations requires understanding which layer a problem is at (model behavior? prompt? tools? policies?), what "rollback" means for each layer, and how to measure behavioral compliance, not just technical availability.
What operational questions should you be able to answer about your AI agents? On demand, without investigation: current p50 and p99 session latency; fraction of sessions in the last 24 hours that completed successfully vs. hit errors; average cost per session today vs. last week; top 1% of sessions by cost and what made them expensive; any PII that entered context in the last 7 days and how it was handled; whether agent behavior shifted after the last system prompt change. If any of these require a manual investigation rather than a dashboard query, you have observability but not operational capability.
Agentic Governance, Explained




