Waxell

Product

Assurance

About

Blog

Docs

Waxell

Logan Kelly

Mar 24, 2026

Waxell vs. Braintrust: When Evaluation Isn't Enough

Braintrust optimizes AI agent quality in development. Waxell governs what agents can do in production. Here's when you need each — and when you need both.

Waxell blog cover: Waxell vs. Braintrust — Runtime Governance vs Evaluation

Consider a team running a tight eval suite. Every Friday, they run 500 real production transcripts through Braintrust scorers, iterate on prompts with Loop, and ship only when quality hits above 8.5/10. Their evals are genuinely good — not the performative kind.

Then one of their agents starts routing customer support tickets through an external summarization API. PII goes with them. The eval score? Still 8.7/10. The summarization is excellent. The governance isn't.

The problem wasn't Braintrust. Braintrust was doing exactly what it's designed to do: measure and optimize quality. The problem was that "quality" and "safe to run in production" are different questions, and the team was using one tool to answer both.

Braintrust is a developer-centric evaluation and experiment platform: score outputs, tune prompts, track quality regressions, and use AI-powered optimization to improve agent behavior before you ship. Waxell is a runtime governance control plane: enforce policies at execution time, control tool access, filter outputs, and produce compliance audit trails for what agents do in production. Braintrust answers "is this agent producing good outputs?" Waxell answers "is this agent allowed to do what it's doing right now?" You need both questions answered. They're not the same question.

What is Braintrust built for?

Braintrust is built around the evaluation workflow: you have agents producing outputs, you want to know if those outputs are good, and you want a systematic way to improve them. The product is built around that loop.

The core capability is browser-based evaluation. You write scorers — functions that assess output quality on whatever dimensions matter to you — run them against versioned datasets of examples, and see results immediately. No infrastructure to manage, no Python scripts to maintain. If you're iterating on prompts or model choices, Braintrust's Loop feature takes this further: it uses AI to suggest prompt improvements, tests them against your dataset, and iterates based on your annotated results. For teams that evaluate heavily, it's the most efficient workflow in the category.

Braintrust also connects dev to production. Quality metrics from production traces flow back into the eval workflow, so you can see whether the configuration that tested at 8.5/10 is actually delivering 8.5/10 in the wild. Dataset versioning lets you track performance across prompt versions. The free tier — 1 GB processed data/month and 10K evaluation scores — is generous enough for serious development use, and the $249/month Pro plan includes 5 GB processed data/month with overage pricing beyond that.

It's a well-designed tool for a specific job: help development teams ship higher-quality agents.

Where does Braintrust stop?

The limits of Braintrust aren't flaws — they're scope decisions. The platform is built for the development phase. When you move to production with agents that touch sensitive systems, access external APIs, or operate in regulated environments, you're in territory Braintrust doesn't address.

No runtime policy enforcement. Braintrust can tell you that your agent produced a low-quality output. It cannot prevent the agent from making that API call, accessing that file, or routing that PII before the output ever gets evaluated. Observation happens after the fact; enforcement has to happen before it.

No tool access control. If you're running agents with MCP tools, database access, or external API integrations, Braintrust has no mechanism to restrict which tools an agent can invoke, under what conditions, or with what parameters. An agent that passes your eval suite can still call any tool it has access to in production.

No compliance audit trail. Braintrust produces observability data. It doesn't produce enforcement records — documentation showing that specific policies were evaluated before specific actions, that certain behaviors were blocked, that a session was terminated when it exceeded defined thresholds. For a compliance audit, the distinction matters.

No rate limiting or cost controls at the session level. A session that loops unexpectedly in production will run until it hits an external limit or you intervene manually. Braintrust will show you it happened; it won't stop it.

None of this is a knock on Braintrust. These are production governance problems, and Braintrust is an evaluation tool. The issue arises when teams assume that passing evals is equivalent to production readiness — which the PII scenario above illustrates clearly.

What Waxell adds

Waxell instruments agent executions across any framework — LangChain, CrewAI, LlamaIndex, custom Python — with execution tracing that captures spans, tool calls, token usage, and timing. That's the observability layer.

On top of it, runtime governance policies evaluate before each tool call and output: what tools is this agent allowed to use? What data is it allowed to handle? What does it cost per session? What output content is blocked? If a policy trips, the action is intercepted before execution — not logged after.

The important architectural detail: policies sit above the agent code. They're not baked into prompt instructions, not embedded in the agent's logic, not dependent on the agent "behaving" correctly. They operate at the infrastructure layer, which means they enforce even when the agent's reasoning would lead somewhere else, and they can be updated independently of the agent without a deployment.

For the PII scenario: a content policy checking outbound requests for sensitive data fields would have caught the routing before the external API call completed. The eval score wouldn't have caught it, because the summarization was high quality. The governance layer would have caught it, because the data handling violated policy.

Feature comparison

Capability	Waxell	Braintrust
Observability
Execution tracing	✅	✅
Tool call logging	✅	✅
Production monitoring	✅	✅ (excellent)
Evaluation & Quality
Browser-based eval suite	❌	✅ (excellent)
AI-powered prompt optimization (Loop)	❌	✅ (unique)
Dataset versioning	⚠️ Manual	✅ Built-in
Regression detection	⚠️ Via alerts	✅
Custom scorers	✅	✅
Governance & Runtime Control
Runtime policy enforcement	✅ Core	❌
Tool access control	✅	❌
Output filtering / content guardrails	✅	❌
Rate limiting (per session)	✅	❌
Compliance audit trail	✅	⚠️ Limited
Human-in-the-loop escalation gates	✅	❌
Data residency enforcement	✅	❌
Framework & Protocol Support
Framework-agnostic	✅	✅
MCP-native	✅ Only	❌
Deployment
Cloud SaaS	✅	✅
Self-hosted	✅	✅ Enterprise
Pricing
Free tier	✅	✅ 1 GB/mo + 10K scores
Pro plan	Flexible	$249/month (5 GB/mo, overage pricing)

Three production scenarios where evaluation isn't sufficient

Scenario 1: The PII leak. Your customer service agent has a 9/10 eval score. In production, a ticket contains a patient's medical history. The agent routes it to an external summarization API as part of its workflow. Your eval scored the summary quality — not the data handling. A Waxell content policy on outbound requests would intercept before the external call. Braintrust shows you the trace afterward.

Scenario 2: The loop. An agent encounters an edge case your eval dataset didn't cover, enters a retry loop, and burns 200x its normal session cost before timing out. Your Braintrust dashboard shows a quality anomaly. A Waxell cost policy on per-session token budgets terminates the session automatically when it exceeds threshold. Other sessions are unaffected.

Scenario 3: The compliance audit. Your security team needs to demonstrate that your agents operated within defined data handling constraints during Q1. Braintrust produces logs showing what your agents did. Waxell produces enforcement records: each policy evaluation, what triggered, what action was taken. The difference between evidence of behavior and evidence of controls.

When to use Braintrust

Braintrust is the right choice when evaluation and quality optimization are the primary workflow. If you're iterating on prompts, comparing model performance, tracking quality regressions over time, or using Loop to automate prompt tuning — Braintrust is the best tool in the category for this. The browser-based eval workflow and the free tier's generosity make it accessible to teams at any scale.

If your agents are internal-only, don't touch sensitive systems, and don't have compliance requirements, Braintrust may be sufficient on its own for production as well. But the moment tool access, external APIs, PII, or regulatory requirements enter the picture, you're asking Braintrust to answer a question it wasn't built to answer.

When to use Waxell

Waxell is the right choice when production governance is required: policy enforcement, tool access control, compliance audit trails, human-in-the-loop gates. These aren't capabilities Braintrust approximates — they're simply not in scope for an evaluation platform.

For teams in regulated industries — healthcare, financial services, legal — Waxell provides the documented enforcement record that compliance audits require. For teams deploying agents with broad tool access, Waxell provides the control layer that evals can't substitute for.

The stack most production teams end up running

The honest answer is that these tools aren't really competitors — they address different phases of the same workflow:

Development: Braintrust for evaluation and prompt optimization. Run evals constantly, use Loop to improve quality, track regressions before they ship.

Production: Waxell for runtime governance. Enforce policies at execution time, control tool access, produce audit trails, handle the edge cases your evals didn't cover.

The PII scenario above isn't an argument against using Braintrust — it's an argument for adding Waxell to the production layer. Both tools running together give you quality optimization in development and safety enforcement in production. Running only one of them leaves either your quality or your governance unaddressed.

How Waxell handles this: Waxell's runtime governance policies operate at the infrastructure layer — above agent code, evaluated before each tool call and output. A content policy catches PII before it leaves your system. A cost policy terminates runaway sessions automatically. A compliance policy produces enforcement records alongside execution traces, giving you the audit documentation that observability data alone doesn't provide. Three lines of SDK code to instrument; policies defined once and enforced across every agent regardless of framework.

Frequently Asked Questions

What is the difference between Braintrust and Waxell? Braintrust is an evaluation and quality optimization platform for AI agents. It helps teams score outputs, run evaluations against datasets, detect regressions, and use AI-powered prompt tuning (Loop) to improve agent quality during development. Waxell is a runtime governance control plane for production agents. It enforces policies before tool calls execute, controls what agents are allowed to do, and produces compliance audit trails. Braintrust tells you whether your agent is producing good outputs; Waxell controls what your agent is allowed to do while producing them.

Can Braintrust replace Waxell for production governance? No. Braintrust monitors and evaluates quality — it doesn't enforce runtime policies, control tool access, block output content, or produce compliance documentation. These are governance capabilities that require infrastructure-layer enforcement rather than post-hoc evaluation. For production agents with compliance requirements, tool access, or sensitive data handling, Braintrust and Waxell serve different and complementary functions.

Should I use Braintrust and Waxell together? For most production teams, yes. The typical stack is Braintrust in the development and evaluation phase (evals, Loop optimization, regression tracking) and Waxell in the production governance phase (policy enforcement, audit trail, runtime controls). They're designed for different parts of the agent lifecycle and don't duplicate each other's core capabilities.

Is Waxell a Braintrust alternative? Only partially. For observability and production monitoring, Waxell is a full replacement. For Braintrust's evaluation workflow — browser-based evals, AI-powered prompt optimization with Loop, dataset versioning — Waxell doesn't replicate those features. Teams that replace Braintrust with Waxell typically add a separate evaluation tool or maintain Braintrust alongside Waxell for the dev/eval phase.

What is Braintrust's Loop feature, and does Waxell have something equivalent? Braintrust Loop is an AI-powered prompt optimization feature: it analyzes your evaluation results, suggests prompt improvements based on annotated outputs, and iterates to find configurations that improve your quality scores. Waxell doesn't have an equivalent feature — Loop is one of Braintrust's most distinctive capabilities and has no direct counterpart in the governance category. If AI-powered prompt optimization is a core requirement, Braintrust is the right tool for it.

Sources

Braintrust, Documentation (2026) — https://www.braintrust.dev/docs
Braintrust, Pricing (2026) — https://www.braintrust.dev/pricing
LangChain, State of Agent Engineering (2026) — https://www.langchain.com/state-of-agent-engineering
NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0) (2023) — https://doi.org/10.6028/NIST.AI.100-1

Agentic Governance, Explained

The Best AI Agent Observability and Governance Tools in 2026

Comparing the leading AI agent observability and governance platforms in 2026. LangSmith, Helicone, Arize Phoenix, Braintrust, and Waxell — what each does best, and how to choose.

Logan Kelly

Mar 23, 2026

Waxell vs. LangSmith: When You Need a Control Plane, Not a Dashboard

LangSmith gives you unmatched visibility into LangChain agents. Waxell governs what agents can do at runtime. Here's how to choose — and when you need both.

Logan Kelly

Mar 20, 2026

The Kill Switch Problem: How to Stop an AI Agent That's Gone Wrong

Every production agent eventually needs to be stopped — mid-run, immediately, or permanently. Most teams have no defined way to do it. Here's what an emergency stop layer actually looks like.

Logan Kelly

Mar 19, 2026

Don't Build Governance Into Your Agents. Build It Above Them.

Most teams enforce agent governance through system prompt rules and conditional code. That's the wrong architecture — and it fails in exactly the situations where you need it most.

Logan Kelly

Mar 18, 2026

The Best AI Agent Observability and Governance Tools in 2026

Comparing the leading AI agent observability and governance platforms in 2026. LangSmith, Helicone, Arize Phoenix, Braintrust, and Waxell — what each does best, and how to choose.

Logan Kelly

Mar 23, 2026

Waxell vs. LangSmith: When You Need a Control Plane, Not a Dashboard

LangSmith gives you unmatched visibility into LangChain agents. Waxell governs what agents can do at runtime. Here's how to choose — and when you need both.

Logan Kelly

Mar 20, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product

Company