Waxell

Product

Assurance

About

Blog

Docs

Waxell

Logan Kelly

Mar 25, 2026

Waxell vs. Arize Phoenix: The Iteration Tool vs. the Production Control Plane

Arize Phoenix helps you iterate and optimize AI agents before shipping. Waxell governs what they can do after you ship. Here's when you need each.

Waxell blog cover: Waxell vs. Arize Phoenix — Iteration Tool vs Production Control Plane

There's a gap that opens up between testing and production that most ML teams discover the hard way.

The scenario: an engineering team spent three weeks tuning their agent in Arize Phoenix. Prompt versions, model comparisons, LLM-based scoring on 200 historical examples. The eval results were clean — 91st percentile quality scores, no regressions, consistent behavior across edge cases. They shipped.

In production, the agent had write access to the customer database. The eval suite had covered every question in the prompt quality dimension. It hadn't covered what happens when a tool orchestration bug causes the agent to issue an UPDATE query without a WHERE clause. Phoenix showed them exactly what happened, span by span. There was no layer that required approval before the destructive query ran.

This isn't a failure of Arize Phoenix. It's a failure to recognize that Phoenix answers a different question than the one production safety requires.

Arize Phoenix is an open-source ML observability and experimentation platform: version prompts, run experiments, evaluate output quality, detect regressions — the research and iteration workflow before you ship. Waxell is a production control plane: enforce runtime policies, control tool access, filter outputs, and govern what agents can do once you ship. Phoenix answers "was that output good?" Waxell answers "is that action allowed?" These are different questions. Both matter. Neither replaces the other.

What is Arize Phoenix built for?

Phoenix is designed for the iteration phase of AI development — the work that happens between "we have an agent" and "we have an agent we trust enough to ship."

The core workflow is experiment tracking. You take a version of your prompt, model, and parameters; run them against a dataset of representative examples; score outputs with custom metrics or LLM-based evaluation; and compare the results to a baseline. When you swap models or change your system prompt, Phoenix shows you whether quality went up or down and where specifically it regressed. The prompt playground lets you iterate visually rather than in code.

Phoenix is fully open-source with no feature gates. Self-hosted, free to run, no per-trace fees. For ML teams that evaluate constantly, this is a meaningful cost advantage — especially compared to managed platforms that charge at volume. The Arize AX managed offering adds enterprise features and support for teams that want to avoid the infrastructure overhead.

The thesis is sound: ship agents that you've actually measured. Know what changed between versions. Catch quality regressions before users do.

Where does Arize Phoenix stop?

The scope decision Phoenix makes is to focus entirely on the evaluation and iteration phase. Once an agent is in production, Phoenix gives you visibility into what happened. It does not give you control over what happens next.

No runtime policy enforcement. Phoenix can tell you that an agent made a bad call. It cannot intercept that call before it executes. The database query runs, the external API gets hit, the PII leaves your system — and Phoenix records it accurately. That's not a flaw; it's the architecture of an evaluation tool.

Limited agent instrumentation. Phoenix is built around LLM traces — the prompt in, the completion out. Full agent workflows involving multi-step orchestration, tool calls, and external system interactions require deeper instrumentation than Phoenix was designed for. You get the LLM call visibility; you don't get the full execution graph of what your agent did at each step.

No tool access control. There's no mechanism in Phoenix to restrict which tools an agent can invoke, under what conditions, or with what parameters. An agent that scored 91st percentile in your eval suite can still call any tool it has access to in production.

Complex setup relative to its scope. Phoenix requires real infrastructure even in its self-hosted form. For teams that want to go from agent code to production-governed deployment quickly, the infrastructure overhead is a friction point.

These aren't criticisms — they're scope. Phoenix is built for one phase of the workflow. The problem emerges when teams treat evaluation coverage as a substitute for production controls.

What Waxell adds

Waxell picks up where evaluation leaves off. Three lines of SDK code instrument your agent — any framework, any architecture — with execution tracing that captures the full graph: LLM calls, tool invocations, external API calls, timing, token usage, costs. That's the observability layer.

On top of it, runtime governance policies operate before each tool call and output. A policy can require approval before any destructive database operation. A policy can block outbound requests containing PII. A policy can terminate a session that exceeds a per-session cost threshold. A policy can escalate to a human reviewer when confidence falls below a defined threshold.

The architectural point that matters: policies are infrastructure, not instructions. They don't live in the prompt, they don't depend on the model following directions, and they can be updated without a deployment. An agent that was 91st percentile in Phoenix evals and then hits a novel edge case in production is still subject to the governance policies — because those policies operate at the execution layer, not the reasoning layer.

For the database scenario: a tool access policy requiring human approval before any write operation would have intercepted the UPDATE query before it ran. The agent's eval score would have been unchanged. The production incident wouldn't have happened.

Feature comparison

Capability	Waxell	Arize Phoenix
Observability
Execution tracing	✅	✅
LLM call logging	✅	✅ (core)
Full agent workflow tracing	✅	⚠️ LLM calls only
Tool call tracing	✅	✅ (span-level, since v8.25)
Experimentation & Iteration
Prompt versioning	❌	✅ (core)
Experiment tracking	❌	✅ (excellent)
Model comparison	⚠️ Via traces	✅ (excellent)
Prompt playground	❌	✅
Evaluation
LLM-based output scoring	⚠️	✅ (built-in)
Regression detection	⚠️ Via alerts	✅ (automated)
Dataset versioning	❌	✅
Governance & Runtime Control
Runtime policy enforcement	✅ Core	❌
Tool access control	✅	❌
Output filtering / content controls	✅	❌
Rate limiting (per session)	✅	❌
Human-in-the-loop escalation	✅	❌
Compliance audit trail	✅	❌
Framework & Protocol Support
Framework-agnostic	✅	✅
MCP-native	✅ Only	❌
Deployment
Cloud SaaS	✅	✅ (Arize AX)
Self-hosted	✅	✅ (fully open-source)
Setup complexity	Low (3 lines)	Moderate (infrastructure)
Pricing
Free tier	✅	✅ Fully free (self-hosted)
Managed cloud	Flexible	$50/month (Pro); Enterprise custom (Arize AX)

Two workflows, one agent lifecycle

The cleanest way to think about Phoenix and Waxell is as sequential layers of the same agent deployment lifecycle.

Phase 1 — Iteration (Phoenix): You're building and improving. You want to know if your prompt changes improved quality. You want to catch regressions before they reach users. You want to compare models and track which configuration performs best. Phoenix does this well. It's the tool for the question: "Should I ship this?"

Phase 2 — Production (Waxell): You've shipped. Your agents are running against real users, real data, and real external systems. You need to know what they're doing, stop them from doing things they shouldn't, and produce documentation that they operated within defined constraints. Waxell does this. It's the tool for the question: "Can I prove this is safe?"

Most teams need both phases covered. The gap the database scenario exposed isn't unusual — it's what happens when teams invest heavily in Phase 1 tooling and skip Phase 2 tooling.

When to use Arize Phoenix

Phoenix is the right choice when prompt engineering and model optimization are your primary workflow. If you're iterating on configurations, comparing models, or tracking quality regressions across versions, Phoenix's experiment and evaluation capabilities are among the best available — and the fully open-source, self-hosted model makes it accessible at any budget.

For ML-ops and research teams that evaluate constantly, Phoenix's free tier with no per-trace fees is a significant cost advantage. It's also the more complete tool for teams whose agents are still in the pre-production iteration phase, where the primary question is output quality rather than production safety.

When to use Waxell

Waxell is the right choice once agents are in production and interacting with real systems. Policy enforcement, tool access control, compliance audit trails, and human-in-the-loop gates aren't capabilities Phoenix approximates — they're out of scope for an evaluation platform by design.

For teams in regulated industries, teams deploying agents with broad tool access, or teams that need to demonstrate to auditors that their agents operated within defined constraints, Waxell provides what evaluation tooling cannot: evidence of enforcement, not just evidence of behavior.

How Waxell handles this: Waxell's runtime governance policies sit at the infrastructure layer — above agent code, evaluated before each tool call and output. A policy requiring human approval before write operations would have intercepted the database query before execution. Waxell's execution tracing captures the full agent workflow graph — not just LLM calls, but every tool invocation, external request, and session event — alongside the enforcement records showing what governance policies evaluated and what they allowed or blocked. Three lines of SDK to instrument; policies updated without a deployment.

Frequently Asked Questions

What is the difference between Waxell and Arize Phoenix? Arize Phoenix is an open-source ML observability and experimentation platform designed for the development and iteration phase: prompt versioning, experiment tracking, output evaluation, and regression detection. Waxell is a production control plane designed for the deployment phase: runtime policy enforcement, tool access control, compliance audit trails, and governance of agent behavior in production. Phoenix helps you decide whether to ship an agent; Waxell governs what that agent can do once you ship it.

Can Arize Phoenix enforce runtime policies for AI agents? No. Arize Phoenix is built for evaluation and observability — it records and analyzes what agents did, but it doesn't intercept or control agent behavior at runtime. It has no mechanism to block tool calls, filter outputs, enforce cost limits, or require human approval before actions execute. These are production governance capabilities that require a separate control-plane layer.

Is Arize Phoenix free? Arize Phoenix is fully open-source and free to self-host with no feature gates and no per-trace fees. The managed Arize AX platform (Pro tier) starts at $50/month; Enterprise pricing is custom. For teams comfortable managing their own infrastructure, self-hosted Phoenix has no cost ceiling.

Should I use Arize Phoenix and Waxell together? For teams moving agents from development into production, yes — they serve different phases. Phoenix covers the iteration and evaluation phase (should I ship this?). Waxell covers the production governance phase (can I prove this is safe?). They don't duplicate each other's capabilities, and using both gives you quality measurement in development and safety enforcement in production.

Does Waxell support MCP (Model Context Protocol)? Yes — Waxell is MCP-native, with built-in support for MCP server configurations, tool definitions, and policy enforcement at the MCP layer. Arize Phoenix does not have native MCP support. As MCP adoption grows across agent frameworks and tool ecosystems, MCP-native governance becomes increasingly important for teams building with MCP for tool definitions.

Sources

Arize AI, Phoenix Documentation (2026) — https://docs.arize.com/phoenix
Arize AI, Phoenix GitHub Repository — https://github.com/Arize-ai/phoenix
LangChain, State of Agent Engineering (2026) — https://www.langchain.com/state-of-agent-engineering
NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0) (2023) — https://doi.org/10.6028/NIST.AI.100-1

Agentic Governance, Explained

Waxell blog cover: Waxell vs. Braintrust — Runtime Governance vs Evaluation

Waxell vs. Braintrust: When Evaluation Isn't Enough

Braintrust optimizes AI agent quality in development. Waxell governs what agents can do in production. Here's when you need each — and when you need both.

Logan Kelly

Mar 24, 2026

The Best AI Agent Observability and Governance Tools in 2026

Comparing the leading AI agent observability and governance platforms in 2026. LangSmith, Helicone, Arize Phoenix, Braintrust, and Waxell — what each does best, and how to choose.

Logan Kelly

Mar 23, 2026

Waxell vs. LangSmith: When You Need a Control Plane, Not a Dashboard

LangSmith gives you unmatched visibility into LangChain agents. Waxell governs what agents can do at runtime. Here's how to choose — and when you need both.

Logan Kelly

Mar 20, 2026

The Kill Switch Problem: How to Stop an AI Agent That's Gone Wrong

Every production agent eventually needs to be stopped — mid-run, immediately, or permanently. Most teams have no defined way to do it. Here's what an emergency stop layer actually looks like.

Logan Kelly

Mar 19, 2026

Waxell vs. Braintrust: When Evaluation Isn't Enough

Braintrust optimizes AI agent quality in development. Waxell governs what agents can do in production. Here's when you need each — and when you need both.

Logan Kelly

Mar 24, 2026

The Best AI Agent Observability and Governance Tools in 2026

Comparing the leading AI agent observability and governance platforms in 2026. LangSmith, Helicone, Arize Phoenix, Braintrust, and Waxell — what each does best, and how to choose.

Logan Kelly

Mar 23, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product

Company