Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Feb 10, 2026

Policy Enforcement for AI Agents: How to Set Rules Your Agents Actually Follow

"Guardrails" means different things in different contexts. Here's how policy enforcement for AI agents actually works — pre-execution blocking, mid-execution interception, and post-execution audit.

AI agent policy enforcement refers to infrastructure-level mechanisms that govern what AI agents can do, acting independently of the model's reasoning process. Unlike prompt-based guardrails — which empirical research confirms can be bypassed with up to 100% evasion success against major systems including Azure Prompt Shield and Meta Prompt Guard — policy enforcement operates at three enforceable points: pre-execution (blocking actions before they fire), mid-execution (intercepting the action stream in real time), and post-execution (auditing and triggering remediation workflows). This is distinct from system prompt instructions, which are suggestions to a probabilistic model, not controls.

"Guardrails" is the word that does the most work in the AI safety conversation while doing the least amount of specification.

Ask ten engineers what they mean by guardrails and you'll get ten different answers. Some mean output filtering — checking what the model says before it goes to the user. Some mean system prompt instructions that tell the model to behave well. Some mean topic restrictions that prevent the model from engaging with certain domains. These are all real things. None of them is what I mean by policy enforcement, and none of them constitutes a governance strategy.

This post is about what policy enforcement for AI agents actually looks like when it needs to be reliable, auditable, and effective in production — not just plausible on a demo.

AI agent policy enforcement is the practice of implementing rules that govern agent behavior through technical mechanisms outside the agent's own reasoning — not through system prompt instructions, but through enforcement layers that act regardless of what the model decides. Effective enforcement operates at three temporal moments: pre-execution (blocking a proposed action before it fires), mid-execution (intercepting the action stream as it's happening), and post-execution (auditing completed actions and triggering remediation). Each moment catches different risks; together they constitute a complete governance plane. (See also: What is agentic governance → · The governance gap →)

The Fundamental Problem with Prompt-Based Guardrails

The most common approach to agent behavior control is the system prompt. You include instructions like "do not share confidential information," "always ask for confirmation before sending emails," "do not call the payment API without explicit user approval." These feel authoritative. They're not.

In May 2026, a Security Boulevard analysis of encoded prompt injection made this concrete: according to the report, an attacker drained approximately $175,000 from Grok's AI-controlled wallet — using Bankrbot, an automated finance agent connected to xAI's Grok through a tool-calling layer, to execute the transfer — via a tweet written in Morse code. No output filter caught the instruction. No system prompt stopped the transfer. The attack was encoded in a format the model's guardrail layer was never designed to handle, and the wallet executed the transfer without any model-layer check flagging it as anomalous. A 2025 empirical study (arxiv: 2504.11168) that tested six major protection systems — including Microsoft's Azure Prompt Shield and Meta's Prompt Guard — found that character injection and imperceptible adversarial attacks achieved in some instances up to 100% evasion success while maintaining adversarial utility. These are not exotic attacks. They represent the current baseline.

The problem isn't the model. It's where the enforcement lives. System prompt instructions are suggestions to a probabilistic system. LLMs follow them most of the time. They don't follow them all of the time. Under adversarial conditions — prompt injection, unusual input formats, carefully constructed edge cases — compliance rates with system prompt constraints drop significantly. Under distribution shift (inputs that don't match your training or testing distribution), they drop unpredictably.

This doesn't mean you shouldn't use system prompts thoughtfully. You should. But system prompts are not a governance layer. They're part of the user experience design. Governance requires enforcement mechanisms that exist outside the model's reasoning process — mechanisms that act regardless of what the model decides. The failure mode explored in depth in prompt injection via tool call results shows this precisely: the attack surface isn't the input you control, it's the tool response that feeds back into the agent's context.

Those three enforcement moments — pre-execution, mid-execution, and post-execution — each catch different risks. Together they constitute a complete governance layer.

Pre-Execution: Block Before It Happens

Pre-execution enforcement intercepts a proposed action before it's executed. The agent has decided it wants to do something. The policy layer evaluates whether it's allowed to. If not, the action is blocked, and the agent receives a response indicating the block and why.

This is the most powerful enforcement position because it prevents consequences before they occur. No data is transmitted. No tool is called. No cost is incurred. The bad action simply doesn't happen.

According to the Cloud Security Alliance (May 2026), 53% of organizations report that AI agents exceed their intended permissions occasionally or regularly — and only 8% say agents never exceed permissions. Pre-execution action authorization is the control that changes this number. The scope violation problem examined in AI agent scope violations and permission enforcement is almost entirely a pre-execution failure: the agent was never blocked from taking the action in the first place.

What pre-execution enforcement covers:

Input inspection. Before the agent's input is processed by the LLM, it can be scanned for content that violates policy — PII that shouldn't enter the context, injection patterns, content categories you've flagged as restricted. If the input fails inspection, you can sanitize it, reject it, or route it differently before it ever reaches the model.

Action authorization. Before a tool call is executed — before an API is hit, a file is written, a database is updated, an email is sent — a policy check determines whether this action is permitted. The authorization decision can be based on the action type, the parameters of the action, the session context, the user's permission level, or any combination. This is where you enforce "do not call the payment API without explicit user approval" in a way that actually works — not through a prompt instruction, but through an enforcement gate that the model cannot reason its way around.

Spend pre-authorization. Before initiating an operation that will incur cost — a long context call, an expensive tool invocation — a budget check determines whether the session has remaining allocation. If not, the operation is blocked before cost is incurred. See AI agent token budget enforcement for how this plays out in multi-step agentic workflows where costs compound rapidly across loops.

Mid-Execution: Intercept In-Flight

Pre-execution enforcement assumes you can predict what actions an agent will want to take. For simple, well-defined agents, you can. For more complex agents with multi-step reasoning and dynamic tool selection, there will be cases where an action sequence you didn't fully anticipate emerges.

Mid-execution enforcement intercepts the agent's action stream as it's happening and applies policies in real time, including policies based on accumulated context that wasn't available at the start of the session.

What mid-execution enforcement covers:

Tool result inspection. A tool call was made and permitted. The result comes back. Before that result is appended to the agent's context, it's inspected — for PII that shouldn't enter context, for injection patterns, for content policy violations, for schema anomalies that indicate something unexpected happened. If the result fails inspection, it can be redacted, replaced with a sanitized version, or flagged. The Bankrbot incident is a mid-execution failure: the Morse-encoded tweet arrived as a fetched result, and inspection at the tool-result layer — rather than the model layer — is where encoding anomalies of this kind are catchable before they reach the context.

Sequence-level policy evaluation. Some policies only make sense at the sequence level, not the individual action level. If an agent has made five different external API calls in a single session, that pattern may be a policy violation even if each individual call was permitted. Mid-execution monitoring can track patterns across a session and trigger policy responses based on accumulated behavior.

Budget enforcement. As token spend accumulates within a session, mid-execution monitoring tracks against the budget ceiling and triggers predefined responses — compression, warning, capping — as thresholds are approached and crossed.

Post-Execution: Audit and Remediate

Post-execution isn't enforcement in the sense of preventing actions — the action has already occurred. It's the foundation of your audit trail and the trigger for remediation workflows.

What post-execution covers:

Audit record creation. Every action the agent took, every policy evaluation that was performed, every enforcement decision that was made — logged with full context. The audit record should be sufficient to reconstruct what happened and why, including what the agent's context was at the time a decision was made. "The call was made" is not sufficient. "The call was made, here is the full context at that moment, here is the policy that was evaluated, here is the outcome" is sufficient.

Violation flagging. Actions that completed but should be reviewed — either because they barely passed policy or because they fit a pattern that warrants attention — get flagged for human review. This creates a workflow for operationalizing governance, not just logging it.

Retrospective detection. For behavioral patterns that are only apparent in aggregate — a class of queries where the agent consistently underperforms, a tool call pattern that's technically within policy but warrants investigation, a cost distribution that's shifted in a concerning direction — post-execution analysis surfaces signals that weren't visible at the individual event level.

Policy Definition: The Work Before the Enforcement

None of this enforcement machinery matters if you haven't done the harder work of defining what your policies actually are.

Policy definition requires answering questions that feel abstract but have concrete implications:

What are the hard constraints — the things the agent must never do, regardless of context? These become blocking rules at the pre-execution layer.

What are the conditional permissions — things the agent may do under certain conditions? These become conditional authorization rules with context-dependent evaluation logic.

What are the budget parameters — token allocation per session, per user, per day; cost ceilings at various granularities? These become spend guardrails with defined response actions at threshold crossings.

What needs to be auditable — which actions, which data flows, which policy evaluations? These determine your audit log schema and retention requirements.

Policies need to be explicit, versioned, testable, and documented. An implicit policy — "the agent shouldn't do X" based on a system prompt instruction — is not a policy in the governance sense. An explicit policy — "action type Y with parameter P matching pattern Z is blocked for sessions with context flag Q" — is. Waxell Runtime ships with 45 policy categories out of the box, covering input inspection, action authorization, spend pre-authorization, tool result inspection, and sequence-level behavioral rules — so you're enforcing documented, versioned policies from day one, not writing enforcement logic from scratch.

Testing Your Policies

A policy layer that only gets exercised in production incidents is not sufficient. Your policies need to be tested against known scenarios before they're deployed, and they need to be validated against your real production traffic on an ongoing basis.

This means having a test suite for your governance layer — not just for your agent's core behavior. Tests that verify:

Known bad inputs are blocked at the right layer
Known good inputs pass without unnecessary friction
Budget guardrails trigger at the right thresholds with the right responses
PII detection catches the patterns you care about with acceptable false positive rates
Audit records are created for the events that need records

The teams that have done this work have a dramatically different experience during incidents than the teams that haven't. When something goes wrong, the question isn't "do our policies work?" — it's "which policy handled this, and did it handle it correctly?"

That's a much better problem to have.

How Waxell handles this: Waxell Runtime implements all three enforcement layers — pre-execution blocking (input inspection, action authorization, spend pre-authorization), mid-execution interception (tool result inspection, sequence-level policy evaluation, budget enforcement), and post-execution audit (full decision context, violation flagging, retrospective analysis) — as a deployable governance plane over your existing agents. You define policies; Waxell Runtime enforces them across 45 policy categories. No rebuilds required. Get Waxell access →

Frequently Asked Questions

What is AI agent policy enforcement?
AI agent policy enforcement is implementing governance rules through technical mechanisms that act outside an agent's reasoning process, regardless of what the model decides. It's distinct from system prompt guardrails, which are suggestions to a probabilistic model and fail under adversarial conditions. Policy enforcement requires an external layer — typically at the infrastructure level — that evaluates and acts on agent behavior independently of the model.

How is AI agent policy enforcement different from system prompt guardrails?
System prompt instructions are suggestions to a probabilistic model. LLMs follow them most of the time — not all of the time. Under adversarial conditions (prompt injection, unusual inputs), compliance drops significantly — a 2025 empirical study (arxiv 2504.11168) found that adversarial evasion techniques achieved up to 100% success in some instances against six major guardrail systems including Azure Prompt Shield and Meta Prompt Guard. Policy enforcement uses mechanisms outside the model's reasoning: pre-execution gates that block actions before they fire, mid-execution interceptors that apply rules regardless of what the model decided, and post-execution audit that documents every governance decision. The model cannot reason its way around these.

What are the three layers of AI agent policy enforcement?
Pre-execution enforcement blocks proposed actions before they execute — input inspection, action authorization, spend pre-authorization. This is the most powerful position because it prevents consequences before they occur. Mid-execution enforcement intercepts the agent's action stream in real time — tool result inspection, sequence-level policy evaluation, budget tracking. Post-execution enforcement creates the audit trail and triggers remediation — logging every governance decision with full context, flagging violations for human review, enabling retrospective behavioral analysis.

What is pre-execution enforcement for AI agents?
Pre-execution enforcement intercepts a proposed action before it executes. The agent has decided it wants to take an action; the policy layer evaluates whether it's permitted to. If not, the action is blocked and the agent receives a response explaining the block. This is the strongest enforcement position — no data transmitted, no tool called, no cost incurred, no consequences to remediate. Pre-execution covers input inspection (PII, injection patterns), action authorization (is this tool call permitted under current conditions?), and spend pre-authorization (does this session have remaining budget?).

How do you test AI agent policies?
Policies need a test suite separate from your agent's core behavior tests. The suite should verify: known bad inputs are blocked at the right layer; known good inputs pass without friction; budget guardrails trigger at the correct thresholds with the correct responses; PII detection catches targeted patterns at acceptable false positive rates; and audit records are created for all events that require them. Running this suite before every deployment that changes policies is the difference between finding out a policy broke in production and finding out in CI.

Can LLM guardrails be bypassed in practice?
Yes — and the research is unambiguous. A 2025 empirical study (arxiv 2504.11168) tested six major guardrail systems including Microsoft Azure Prompt Shield and Meta Prompt Guard. Using character injection and imperceptible adversarial evasion, researchers achieved in some instances up to 100% evasion success while maintaining adversarial utility. Real-world incidents corroborate this: according to a May 2026 Security Boulevard report, an attacker drained approximately $175,000 from an AI-controlled finance agent using a Morse-code-encoded tweet — bypassing all model-layer checks. This is why enforcement needs to live at the infrastructure layer, not inside the model's reasoning process.

Sources

Cloud Security Alliance, AI Agent Security Starts with Scope Control (May 2026) — https://cloudsecurityalliance.org/blog/2026/05/12/ai-agent-security-starts-with-scope-control
Security Boulevard, Encoded Prompt Injection: Why LLM Guardrails Are at the Wrong Layer (May 2026) — https://securityboulevard.com/2026/05/encoded-prompt-injection-why-llm-guardrails-are-at-the-wrong-layer/
Perez et al., Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems (April 2025) — https://arxiv.org/abs/2504.11168
VentureBeat / Gravitee, The enforcement gap: 88% of enterprises reported AI agent security incidents (2026) — https://venturebeat.com/security/most-enterprises-cant-stop-stage-three-ai-agent-threats-venturebeat-survey-finds
Atlan, AI Agent Risks & Guardrails: 2026 Enterprise Security Guide (2026) — https://atlan.com/know/ai-agent-risks-guardrails/

Agentic Governance, Explained

Waxell blog cover: Samsung ChatGPT enterprise governance

Samsung ChatGPT Ban Ends: The Content Fix That Made It Safe

Samsung lifted its 3-year ChatGPT ban after deploying enterprise content controls. Here's the governance architecture behind the 125K-employee rollout.

Logan Kelly

Jun 25, 2026

Waxell blog: AI Agent Approval Workflows

AI Agent Approval Workflows: Human Oversight That Scales [2026]

Only 19.7% of orgs ship AI agents with full approval. Three workflow architectures that add human oversight without killing velocity or burning out reviewers.

Logan Kelly

Jun 25, 2026

Waxell blog cover: AI agent cost enforcement before and after

AI Agent Cost Enforcement: Before vs. After [2026]

Uber hit its 2026 AI budget by April. One company got a $500M Claude bill. Here's what changes when teams move from cost alerts to enforcement.

Logan Kelly

Jun 24, 2026

Waxell blog cover: DeepMind AI control roadmap insider threat governance

DeepMind Treats Its AI Agents as Insider Threats [2026]

DeepMind's AI Control Roadmap treats deployed agents as insider threats. Here's the defense-in-depth framework it established — and how Waxell Runtime enforces it without rebuilds.

Logan Kelly

Jun 23, 2026

Samsung ChatGPT Ban Ends: The Content Fix That Made It Safe

Samsung lifted its 3-year ChatGPT ban after deploying enterprise content controls. Here's the governance architecture behind the 125K-employee rollout.

Logan Kelly

Jun 25, 2026

AI Agent Approval Workflows: Human Oversight That Scales [2026]

Only 19.7% of orgs ship AI agents with full approval. Three workflow architectures that add human oversight without killing velocity or burning out reviewers.

Logan Kelly

Jun 25, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product