Waxell

Product

Compare

Waxell

Logan Kelly

Mar 17, 2026

AgentOps: The Discipline That Comes After You Ship Your First Agent

MLOps taught us that shipping models to production is a different discipline than training them. AgentOps is the same lesson, one layer up — and it's harder.

Three weeks after your first agent ships to production, something changes. The agent starts behaving slightly differently than it did in staging. Maybe it's choosing different tools. Maybe it's taking an extra step it didn't before. Maybe it's occasionally producing responses that don't match the pattern it was trained on. You go to debug it and realize: you don't know which version of the system prompt is running. You don't have a trace of what the agent actually did during the problematic session. You don't know if a tool API changed its response schema. You can't reproduce the failure.

That moment — the gap between "it worked in the demo" and "we can explain what it's doing in production" — is where AgentOps begins.

MLOps went through the same reckoning starting around 2015. Teams had gotten reasonably good at training machine learning models. What they discovered, after deploying those models to production at scale, was that training and running were entirely different disciplines. Monitoring for drift, managing data pipelines, versioning model artifacts, building retraining cycles, debugging live failures — none of that had mature tooling or established practice. It took the industry roughly five years to develop what became MLOps: a coherent set of engineering practices for production ML.

AgentOps is that same lesson arriving one layer up. The gap between building an agent and operating one in production is turning out to be just as large — and in some ways larger, because agents fail differently than models do. According to LangChain's State of Agent Engineering report (2026), only 37.3% of engineering teams run any form of online evaluation once agents are deployed to production. More than six in ten teams shipping production agents have no systematic way to know if those agents are behaving correctly.

AgentOps is the emerging discipline of running AI agents reliably in production: versioning agent configurations, monitoring behavior across deployments, managing tool dependencies, enforcing governance policies, and debugging failures in non-deterministic systems. It draws from MLOps practices but requires different tooling and different mental models, because agents fail in ways that models don't.

What MLOps got right — and what doesn't transfer

MLOps established several practices that are directly relevant to agentic systems and worth carrying forward.

Versioning artifacts beyond code. MLOps taught us that a deployed model is not just its weights — it's the training data version, the preprocessing pipeline, the serving configuration. The same logic applies to agents: a deployed agent is not just its code. It's the model pinned to a specific version, the system prompt, the tool list, the policy configuration. Treating the agent as a coherent, versioned artifact is a direct inheritance from MLOps.

Continuous evaluation. MLOps moved eval from a one-time pre-deployment gate to an ongoing process: evals run in production, not just in development. AgentOps requires the same shift — not as a nice-to-have, but as the baseline. Agents that pass every pre-deployment test can still behave incorrectly once they're exposed to real user inputs and real tool responses.

Domain-specific observability. MLOps recognized that standard software monitoring — uptime, latency, error rates — doesn't capture the relevant failure modes for ML systems. You need to monitor the thing the model produces: prediction distributions, drift metrics, calibration. AgentOps makes the same observation one level up: you need to monitor the thing the agent does, which is actions and tool calls, not just what it outputs.

Where the transfer breaks down is in the nature of what's failing.

A machine learning model that drifts produces worse predictions. The damage is gradual and statistical. You detect it through distribution shift in outputs, escalating error rates in business metrics. A production agent that drifts does worse things. Each bad action has immediate, discrete consequences: a wrong API call, an incorrect database write, an email sent to the wrong recipient. The failure mode is individual and behavioral, not aggregate and statistical. The monitoring needs to match.

The specific ways agents fail in production that MLOps doesn't cover

Version creep without versioning. A production agent is a composition of several things that can each change independently: the underlying model (which the provider can update without notice), the system prompt, the tools and their current API behavior, the policy configuration, the runtime version. In a typical software deployment, "what version is running in production" has a clear answer. For agents, it often doesn't. When behavior changes, the first debugging question — what changed? — has no reliable answer, because no one versioned the full configuration as a deployable artifact.

Tool API drift. ML models have a clear dependency boundary: inputs and outputs. Agents have a more complex dependency graph — they depend on every tool they call, and tools change. An external API updates its response schema. A search tool changes its ranking behavior. A database tool returns results in a different order. The agent's behavior changes downstream of these changes without any modification to the agent itself. This is a dependency management problem that MLOps never had to contend with: your model didn't fail because a third-party API changed.

Behavioral drift without model updates. MLOps monitors for model drift caused by input distribution shift: production inputs diverge from training inputs, and model performance degrades. Agents experience an analogous but subtler phenomenon: the context they receive in production — real user inputs, real tool responses, accumulated conversation history — differs from the context in development in ways that meaningfully change behavior, without any change to the model, the prompt, or the tools. The same agent, given production-realistic context, behaves differently than it did in staging. This isn't model drift. It's context drift, and it requires different instrumentation to detect.

Cascade failure in multi-agent systems. When a model call fails, the failure is contained. When an agent fails inside a multi-agent pipeline, the failure propagates. Downstream agents receive incorrect context and make decisions based on it. Tracing the root cause requires reconstructing the execution graph across multiple agents, identifying where in the chain the bad output originated, and understanding how it was amplified downstream. Standard observability tools — which show request-response pairs, not agent execution graphs — don't support this analysis.

Rollback without a rollback procedure. Software deployments have rollback: revert to the last known-good artifact, redeploy, done. For agents, rollback means restoring a previous configuration of model version plus system prompt plus tool bindings plus policy version — a set of artifacts that spans code, runtime config, and external dependencies. Most teams building agents today have no defined rollback procedure. When a new agent version causes problems in production, "roll it back" often means manually editing configs under pressure.

What AgentOps actually requires

The gap between MLOps and AgentOps isn't just more monitoring. It's different monitoring for a different category of artifact.

Agent versioning as a first-class practice. The agent configuration — model pinned to a specific version, system prompt hashed, tool list fixed, policy set numbered — needs to be treated as a deployable artifact with a version number. Every production deployment should be traceable to a specific, reproducible configuration. "What version of the agent was running when this incident happened?" should always have an answer.

Behavioral monitoring, not output quality scoring. The right thing to monitor in production isn't whether the agent produced a good response. It's whether the agent took the expected sequence of actions with the correct parameters. Tool selection monitoring catches behavioral drift early. Parameter validation monitoring catches the class of failures where the agent chooses the right tool but passes wrong arguments — a failure mode that looks fine at the output level but has bad downstream consequences.

Execution tracing as the foundation of debugging. Every production run should produce a replay-capable record: what was in the context window at each step, what tool calls were made in what order, what parameters were passed, what responses came back. This is the equivalent of what logging is for traditional software, but richer — it needs to capture not just the final output but the full execution graph. Without this, post-incident analysis is speculation. With it, you can reconstruct exactly what happened and why.

Registry management across environments. A production agent registry tracks which agent version is deployed in which environment, what tools it's bound to, and what policy version is enforced. Promotion workflows — test in staging, gate on behavioral evals, deploy to production with monitoring — give you the same confidence in agent deployments that CI/CD gives you in software deployments.

Policy consistency across versions. As agents are updated, the governance policies enforcing their behavior need to be explicitly versioned alongside the agent configuration. A policy that was tested against version 3 of the agent may not cover the behavior introduced in version 4. Treating policies as first-class versioned artifacts — not as settings that get adjusted ad hoc — is the governance equivalent of versioning model weights.

How Waxell handles this: Waxell's agent registry gives you a first-class deployment artifact for agents — model version, system prompt, tool bindings, and policy layer versioned together as a named configuration. The execution history captures every production run as a replay-capable trace, making "what version was running and what did it actually do?" answerable for any incident. The governance plane enforces policies consistently across agent versions, so governance doesn't drift as the agent evolves.

MLOps matured from a collection of individual practices into a recognized engineering discipline because enough teams hit the same production failures and developed shared vocabulary for solving them. AgentOps is in that earlier phase — the phase where teams are discovering independently that operating agents in production is a serious engineering problem, without yet having shared tools or shared language for it.

The MLOps parallel suggests two things. First, this discipline will mature, probably faster than MLOps did because the industry is moving faster and the failures are more visible. Second, the teams that invest in AgentOps practices early — versioning, behavioral monitoring, execution tracing, registry management — build compounding advantage. Every agent they ship is easier to operate than the last, and every incident leaves them better equipped to handle the next one.

The alternative is the common path: ship agents quickly, operate them poorly, and discover the gap the hard way — three weeks after go-live, staring at production behavior you can't explain.

If you're ready to treat agent operations as a serious engineering concern, get early access to Waxell.

Frequently Asked Questions

What is AgentOps? AgentOps is the set of engineering practices for running AI agents reliably in production. It covers agent versioning, behavioral monitoring, execution tracing, tool dependency management, policy enforcement, and deployment workflows. The term is modeled on MLOps — the discipline that emerged when teams realized that training ML models and operating them in production required fundamentally different practices and tooling. AgentOps addresses the same gap for agentic systems.

How is AgentOps different from MLOps? MLOps was built for systems whose primary failure mode is statistical degradation: a model's predictions drift over time, and you detect this through distribution shift in outputs and error rates in business metrics. AgentOps is built for systems whose primary failure mode is behavioral: an agent takes a wrong action, calls a wrong tool, passes wrong parameters, or fails to complete a task sequence. The monitoring needs to cover actions, not just outputs. The versioning needs to cover the full agent configuration — model, prompt, tools, policies — not just model weights. The debugging requires execution traces, not just prediction logs.

What should I monitor for AI agents in production? The most important things to monitor are tool selection (is the agent choosing the right tool for each task?), argument validity (is it passing correct parameters?), task completion rate (is it successfully completing multi-step sequences, or getting stuck?), and behavioral anomalies (are action patterns diverging from what was observed in evaluation?). Output quality scoring is useful but insufficient — agents can produce plausible-looking outputs while having taken incorrect actions that the final response obscures.

How do I version AI agents? Treat the agent configuration as a single deployable artifact containing: the model identifier pinned to a specific version, the system prompt stored as a hashed string, the tool list with pinned versions of each tool's schema, and the policy configuration version. Every production deployment should be traceable to a specific, immutable combination of these components. When something goes wrong, "what changed between the last known-good deployment and this one?" should have a clear, deterministic answer based on configuration diff — not on memory of what someone changed when.

What does an early AgentOps stack look like? At minimum: a way to version and deploy agent configurations as coherent artifacts, execution tracing that captures the full tool-call graph for every production run, behavioral monitoring that alerts on anomalies in action sequences, and a defined rollback procedure that restores a previous versioned configuration. Beyond the minimum: a staging environment with behavioral eval gates before production promotion, a registry that tracks which agent version is running in which environment, and policy enforcement at the tool-call layer that's versioned alongside the agent. Most teams start with execution tracing and versioning — the two practices that make every other debugging task tractable.

Sources

LangChain, State of Agent Engineering (2026) — https://www.langchain.com/state-of-agent-engineering
Sculley, D. et al., Hidden Technical Debt in Machine Learning Systems (NeurIPS 2015) — https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
Google, MLOps: Continuous delivery and automation pipelines in machine learning — https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Shankar, S. et al., Operationalizing Machine Learning: An Interview Study (2022) — https://arxiv.org/abs/2209.09125

Agentic Governance, Explained

Prompt Injection Is an Agent Problem, Not a Model Problem

Security vendors frame prompt injection as a model safety issue. They're wrong. When models have tools, the attack surface changes shape entirely — and model-level defenses stop working.

Logan Kelly

Mar 16, 2026

Testing Governance, Not Just Behavior: What's Different About Agent QA

Behavioral testing tells you if your agent works. Governance testing tells you if the control layer that's supposed to stop it actually will. Most teams only do one.

Logan Kelly

Mar 13, 2026

Blog post cover for Waxell guide on testing AI agents before production deployment

How to Test AI Agents Before They Touch Production

Output evals don't catch the failures that bring agents down in production. Here's what actually works for testing AI agents before you deploy.

Logan Kelly

Mar 12, 2026

How to Evaluate an MCP Server Before You Connect It to Your Agents

Most teams install MCP servers the same way they installed npm packages — quickly, without scrutiny. Here's the pre-connection evaluation framework your stack needs.

Logan Kelly

Mar 11, 2026

Prompt Injection Is an Agent Problem, Not a Model Problem

Security vendors frame prompt injection as a model safety issue. They're wrong. When models have tools, the attack surface changes shape entirely — and model-level defenses stop working.

Logan Kelly

Mar 16, 2026

Testing Governance, Not Just Behavior: What's Different About Agent QA

Behavioral testing tells you if your agent works. Governance testing tells you if the control layer that's supposed to stop it actually will. Most teams only do one.

Logan Kelly

Mar 13, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product

Company