Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Apr 9, 2026

Your CI/CD Pipeline Doesn't Know Your Agent Changed

Shipping a new agent version isn't like shipping a code change. Here's what your delivery pipeline is missing — and what a production-ready agent pipeline actually needs.

Waxell blog cover: CI/CD for AI Agents — Why Your Pipeline Isn't Ready

You can roll back code in 90 seconds with a git revert. Rolling back an AI agent requires three separate decisions: which version of the prompt, which model checkpoint, and what to do about the tool configurations that changed since the last known-good state. Most teams discover this on a Tuesday night, during an incident, for the first time.

The instinct to treat agent deployment like software deployment is understandable. You have a codebase. You have a CI pipeline. You have staging. It looks the same. It isn't.

Software ships code. Agents ship code plus a prompt, a model version, a set of tool definitions, and — if you're building seriously — a set of governance policies that constrain what the agent can do. A change to any one of those five layers changes agent behavior in production. Your current CI/CD pipeline almost certainly versions only one of them.

CI/CD for AI agents is the practice of shipping, validating, and rolling back changes to autonomous AI systems where a "change" can mean updated code, a new prompt version, a different model, modified tool configurations, or updated governance policies — each of which can independently alter agent behavior in production. Because agents are non-deterministic and stateful, traditional pass/fail testing and code versioning don't cover the full change surface. Production-safe agent delivery requires behavioral validation, multi-layer versioning, and governance policy testing run before deployment — not after the incident that reveals their absence.

Why Shipping a New Agent Version Isn't a Deployment

When you ship a web service, the artifact is deterministic: the code does what the code does, every time, for every input. You can test it against known inputs and assert exact outputs. When the tests pass, you ship.

An agent's behavior is a function of its model, its prompt, the tools it can access, and the execution context. Change the system prompt by two sentences, and you've changed agent behavior — possibly substantially — without touching a single line of code. Update the underlying model version — which can happen when you're using an unpinned model alias like gpt-4 or claude-3-5-sonnet that redirects to the provider's latest default — and behavior can shift in ways that code review would never surface.

This is what practitioners mean when they say agents are non-deterministic: not that they're random, but that their outputs aren't fully determined by code alone. A test suite that validates code correctness tells you nothing about whether the updated prompt makes the agent more likely to escalate customer complaints inappropriately, or whether the new model version changes how it interprets tool call instructions.

The structural consequence: you need a versioning system that tracks all five layers of an agent's definition together, and a testing system that validates behavior across those layers. Most teams have neither. What they have is a code pipeline that passes, and an assumption that behavior stayed roughly the same.

The New Stack captured this gap in a headline: Your CI/CD Pipeline Is Not Ready To Ship AI Agents. That's not a provocation — it's where most production teams find themselves when the first significant agent update doesn't behave as expected.

Why Traditional Pass/Fail Testing Doesn't Work

Software tests pass or fail. An assertion holds or it doesn't. You can write 2,000 unit tests and know with precision what your code handles.

Agent behavior doesn't lend itself to assertions. Ask an agent to summarize a customer complaint, and you'll get a different response each time — not because something is broken, but because that's how LLMs work. Testing with exact string matching is useless. Testing with fuzzy semantic evaluation requires a separate evaluation model, a defined rubric, and a judgment call about what "close enough" means.

The pattern that's established itself in production teams is LLM-as-a-judge: run the agent against a representative dataset, use a separate model to score outputs against defined criteria, and fail the pipeline when scores fall below a threshold. LangSmith's CI/CD integration, for example, uses GitHub Actions to run behavioral evaluation before promoting a new agent version to production. This works well as far as it goes.

It has two limitations worth naming. First: behavioral evaluation at scale is expensive. Running a full dataset on every PR is non-trivial at model API rates. Most teams end up with a small smoke-test suite for PRs and a larger nightly run. Neither gives complete behavioral coverage.

Second — and this is the gap that observability platforms haven't addressed — behavioral evaluation only tests what the agent produces. It doesn't test what the agent is allowed to produce. An agent that scores well on your quality rubric can still route PII to an external API, burn past its token budget, or invoke a tool it shouldn't have access to in a given context. Governance policy validation is a separate testing requirement, and no standard evaluation platform covers it.

What "Rollback" Actually Means for an Agent in Production

In software, rollback means reverting to a previous commit. In agents, rollback means answering: which previous commit of what?

Here are the five layers that need to align in a proper agent rollback.

Code. The orchestration logic, tool definitions in code, framework configuration. This is the only layer most teams version rigorously. It's necessary but not sufficient.

Prompt. If the system prompt changed since the last known-good state, rollback means reverting to the previous version — which only works if prompt versions were tracked at all. Many teams edit prompts in-place, in a shared config file, with no version history.

Model. If the underlying model updated between now and the last good state, behavioral changes may not be reversible by reverting code or prompt alone. You need to have pinned the model version and documented it as a deployment artifact, or you have no model version to roll back to.

Tool configurations. The set of tools the agent can access, and under what conditions, directly shapes behavior. If a tool was added between the working state and the broken state, rollback means removing it — and auditing what it did during the window it was available.

Governance policies. If policies changed alongside the code change, a code revert without a policy revert leaves the agent running under a governance posture that doesn't match the version you're rolling back to.

The teams that handle rollbacks reliably maintain a deployment manifest that versions all five layers together: code commit hash, prompt version, model pin, tool config snapshot, governance policy version. Any change to any layer triggers a full deployment with the complete validation suite attached. Any team that hasn't built this discovers why they need it when they're trying to isolate the cause of an incident while the clock is running.

Where Governance Fits in the CI/CD Pipeline

The typical agent CI/CD pipeline: code change → unit tests → behavioral eval → deploy.

What's missing: governance validation. Before a new agent version reaches production, you need to confirm that the governance policies covering that agent still enforce correctly against the new version's behavior — and that no edge cases the change introduces slip past policy coverage.

Three concrete examples of where this matters.

A prompt change that makes the agent more verbose in clarifying questions creates new conversational flows where users share personal information. Your content policy needs to be tested against those flows before the prompt hits production — not after the first PII exposure surfaces in the logs.

A model update that changes how the agent interprets tool call priority might increase the frequency of write API calls in contexts where read access is the intent. Your access policy needs to be tested against the new model's behavior pattern before the update reaches users.

A tool configuration change that adds an external API endpoint introduces new data that flows into the agent's context. Your content policies need to evaluate what that endpoint can return and whether it creates a data handling gap.

Governance policy testing isn't a post-deployment audit. It's a pre-deployment gate — the same posture that puts security scanning before the deploy rather than after the breach. The governance plane sits at the infrastructure layer, evaluating before each agent action executes. That's exactly why it needs to be validated before the new behavioral version of your agent reaches it.

How Waxell Handles This

Waxell's agent CLI gives you governance policy management from the command line — so the policy state covering your agents is accessible and manageable as part of your standard deployment workflow rather than a separate concern tracked outside the pipeline.

Before deployment, Waxell runs governance policies against simulated agent execution in the testing environment. A prompt change, model update, or tool configuration change gets evaluated against the current policy set before production exposure. The test surfaces whether any part of the change creates a policy violation or an edge case that existing policies don't cover.

Execution logs capture which policy versions were active during every production session. When something goes wrong and you need to roll back, the enforcement record shows exactly which policy state applied to each session in the incident window — so you know what to revert to, not just what to revert from.

Runtime telemetry gives you the full execution graph of every agent session: LLM calls, tool invocations, external API requests, costs, and policy evaluation events in a unified trace. Behavioral regressions become traceable — you can see not just what the agent did differently after the change, but which policies evaluated and what they allowed or blocked across the version boundary.

Governance policies sit above the agent code and update on their own version track. A policy doesn't change when you push a new prompt. That independence is what makes the governance layer a reliable control point in the pipeline: it can be validated, versioned, and rolled back separately from the agent code that runs beneath it.

If you're building the delivery pipeline for your agent fleet, request early access to test Waxell's governance layer in your own CI/CD workflow.

Frequently Asked Questions

What is CI/CD for AI agents?
CI/CD for AI agents is the practice of shipping, validating, and rolling back changes to autonomous AI systems using delivery pipelines adapted for non-deterministic behavior and multi-layer versioning. Unlike software CI/CD, agent CI/CD must version and validate code, prompts, model versions, tool configurations, and governance policies together — because a change to any one of these can independently alter agent behavior in production.

How do you test AI agents in a CI/CD pipeline?
The current standard is behavioral evaluation using an LLM-as-a-judge pattern: run the agent against a representative dataset, score outputs against a defined rubric using a separate evaluation model, and fail the pipeline when behavioral scores fall below a threshold. This covers output quality. Governance compliance — whether policies still enforce correctly after a change — requires a separate pre-deployment step that simulates agent execution against the policy layer before the new version reaches production.

How do you roll back an AI agent in production?
A reliable rollback addresses all five layers of an agent's definition: code version, prompt version, model version, tool configuration snapshot, and governance policy version. Reverting only the code while leaving prompts, model, or policies in a newer state creates a hybrid that may behave unpredictably. Teams that handle rollbacks reliably maintain a deployment manifest that versions all five layers together and treat a change to any layer as a full deployment requiring validation.

What makes CI/CD for AI agents different from CI/CD for software?
Software CI/CD assumes deterministic behavior: the same input produces the same output. Agent behavior is a function of code, prompt, model, tools, and execution context — making traditional pass/fail testing insufficient and rollback significantly more complex. Agent CI/CD also requires behavioral evaluation (did outputs change in acceptable ways?) and governance validation (do policies still enforce correctly?), neither of which has a direct equivalent in standard software delivery pipelines.

How do you version AI agent behavior?
Behavior versioning for agents means treating the prompt, model pin, tool configuration, and governance policy state as versioned artifacts alongside code — not as separate concerns tracked in different systems. Each deployment should be tagged with the state of all five layers, so you can reconstruct exactly what the agent's full behavioral definition was at any point in time. Without this, isolating a behavioral regression becomes guesswork and rollbacks are unreliable.

Does CI/CD for AI agents require different tooling?
Yes. Behavioral evaluation requires LLM-as-a-judge frameworks or dedicated evaluation platforms. Prompt versioning requires tooling separate from code version control. Model pinning and governance policy versioning require capabilities that most standard CI/CD stacks don't include today. The ecosystem is actively building toward this — Docker's cagent tool (2026) addresses the deterministic replay problem for agent testing, recording real provider interactions and replaying them in CI without live API calls — but teams that need the full picture are still assembling it from components rather than getting it from a single integrated system.

Sources

The New Stack, Your CI/CD Pipeline Is Not Ready To Ship AI Agents — https://thenewstack.io/your-ci-cd-pipeline-is-not-ready-to-ship-ai-agents/
Harness, AI Deployment in 2026: CI/CD for LLMs & Agents — https://www.harness.io/blog/ai-deployment-in-production-orchestrate-llms-rag-agents
LangChain, Implement a CI/CD pipeline using LangSmith Deployment and Evaluation — https://docs.langchain.com/langsmith/cicd-pipeline-example
Ricardo Cataldi, CI/CD for Non-Deterministic Systems: Testing AI Agents in Production (Level Up Coding, March 2026) — https://levelup.gitconnected.com/ci-cd-for-non-deterministic-systems-testing-ai-agents-in-production-d8e07c6f96b5
Microsoft Azure AI Foundry Blog, From Zero to Hero: AgentOps — End-to-End Lifecycle Management for Production AI Agents — https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/from-zero-to-hero-agentops---end-to-end-lifecycle-management-for-production-ai-a/4484922
Docker, Deterministic AI Testing with Session Recording in cagent — https://www.docker.com/blog/deterministic-ai-testing-with-session-recording-in-cagent/
InfoQ, Docker's Cagent Brings Deterministic Testing to AI Agents (2026) — https://www.infoq.com/news/2026/01/cagent-testing/

Agentic Governance, Explained

Waxell blog cover: The OpenClaw Security Crisis — Runtime Governance Gap

The OpenClaw Security Crisis: 135,000 Exposed AI Agents and the Runtime Governance Gap

CVE-2026-25253, 135,000+ exposed instances, 341 malicious marketplace skills. The OpenClaw crisis isn't just a patching problem — it's a governance architecture problem. Here's what runtime controls prevent.

Logan Kelly

Apr 8, 2026

Prompt Injection Doesn't Come from Your Users

Most AI agent teams filter user inputs for prompt injection. Attackers are injecting through tool call results — database records, web pages, emails your agent reads. **Primary keyword:** prompt injection AI agents

Logan Kelly

Apr 8, 2026

Waxell blog cover: AWS Security Agent Is GA. Is Your Governance?

AWS Security Agent Is Generally Available. Is Your Governance?

AWS Security Agent went GA on March 31, 2026. It runs autonomous penetration tests at $50/task-hour with no built-in human approval gate before high-risk actions. Here's what that means for governance.

Logan Kelly

Apr 7, 2026

Waxell blog cover: Multi-Agent Governance Blind Spot

Your Multi-Agent System Has a Governance Blind Spot. Here's Where to Look.

Governing each agent individually isn't enough when they delegate to each other. The coordination layer — context handoffs, policy inheritance, trust boundaries — is where multi-agent incidents originate.

Logan Kelly

Apr 7, 2026

The OpenClaw Security Crisis: 135,000 Exposed AI Agents and the Runtime Governance Gap

Logan Kelly

Apr 8, 2026

Prompt Injection Doesn't Come from Your Users

Logan Kelly

Apr 8, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product

Overview