Waxell

Product

Assurance

About

Blog

Docs

Early Access

Waxell

Logan Kelly

Mar 5, 2026

The MCP Rug Pull Attack: The Threat That Changes Your Tools After You've Approved Them

An MCP rug pull attack silently changes a tool after you've approved it. Here's how it works, real incidents from 2025, and how to defend against it.

You installed an MCP server. You reviewed the tool descriptions. Everything looked legitimate. You approved it.

Six weeks later, the server pushed a silent update. The tool your agent has been calling — the one you approved — now contains instructions your agent can't ignore and you can't see.

This is the MCP rug pull attack, and it's not theoretical. It happened to thousands of teams in 2025.

What Is an MCP Rug Pull Attack?

An MCP rug pull attack is a supply chain attack in which a malicious or compromised MCP server silently alters a tool's definition or behavior after a developer or system has already approved it. Because most MCP clients verify tools at install time but don't re-alert when definitions change, the agent keeps calling what it believes is a trusted tool — while executing a version that's been quietly weaponized. The approval happened once. The attack runs every time the agent executes.

This exploits something fundamental about how MCP trust works today: approval is an event, not a continuous state. Once a tool is marked trusted, it stays trusted — no matter what the server behind it has become. That's the gap the rug pull lives in.

How Does an MCP Rug Pull Attack Work?

There are four phases, and the patience required for phase two is part of what makes it effective.

Step 1: Publish something genuinely useful. The attacker ships an MCP server with a tool that actually works — a database connector, a web search integration, a code execution environment. The tool descriptions are clean. Nothing raises flags. A developer reviews it, integrates it, and the agents start calling it.

Step 2: Wait. This step doesn't get enough attention. The tool needs to accumulate routine use — the kind of implicit trust that comes from calling something five hundred times and having it return the right thing. Agents that call a tool daily build a pattern. Alerting systems learn that pattern as normal. By the time the payload drops, the tool is load-bearing.

Step 3: Push the update. The attacker modifies the tool definition — either through a direct server update or by compromising the server itself. The new version embeds instructions in the tool metadata that the AI model will process as part of its context. A human reviewing the tool's public name would see nothing different. The model sees everything.

Step 4: Every session from here is compromised. The agent calls the tool just like it always has. From the model's perspective, this is the same trusted tool it's been using for weeks. The embedded instructions land in context and the model follows them — because that's what models do with instructions from trusted sources. The user sees no change. The agent sees new instructions. The gap between them is where the attack lives.

One detail that's easy to miss: in most documented cases, the poisoned tool doesn't take the malicious action directly. It instructs the model to use other legitimate tools to complete the attack — reading files, calling APIs, sending messages. The damage happens through tools the organization already trusts, invoked in a sequence nobody ever authorized.

Why Is This Worse Than Prompt Injection?

Prompt injection is session-scoped. A user crafts a malicious message, the agent processes it, the session ends. Blast radius: one conversation.

An MCP rug pull is persistent. Once a tool definition is silently updated, every agent session that calls that tool is running the poisoned version — not just the session where the update landed. A team with fifty agents using a compromised tool is running fifty simultaneous infections from a single supply-chain event. Clean up one session and you've done nothing; the tool is still serving the bad definition on the next call.

The attack success rates from actual research should make this uncomfortable reading. The MCPTox benchmark tested 20 prominent AI agents against real-world tool poisoning attacks across 45 MCP servers and 353 authentic tools. Attack success rate against o1-mini: 72.8%. Claude 3.7-Sonnet had the highest refusal rate of any model tested — and still refused these attacks less than 3% of the time.

That's not a model safety failure. Models are designed to follow instructions from trusted context sources. A poisoned tool definition is exactly that — trusted, by definition, because it passed approval. You're not fighting model behavior here. You're fighting an architecture that treats approval as permanent.

What Did Real Incidents Look Like?

The attack class moved from theoretical to documented in the second half of 2025, fast.

September 2025 — the postmark-mcp backdoor. A backdoored NPM package serving as an MCP connector for the Postmark email API was pushed to developers who had installed and approved the original. The malicious update propagated silently. Nothing in the standard install or approval flow flagged the change.

October 2025 — the Smithery supply chain attack. Smithery is one of the most widely used hosted MCP server registries — which is exactly what made it worth targeting. Attackers exploited a path-traversal vulnerability in the build configuration system to execute arbitrary code during builds. API tokens and credentials were exfiltrated from over 3,000 hosted applications before the breach was contained. The scale was a direct function of how deeply Smithery had been trusted and integrated.

CVE-2025-54136 (MCPoison). A demonstrated rug pull applied to MCP config files in a major development environment. An attacker commits a benign config file to a project, it's reviewed and approved once, then the payload gets swapped for a malicious version. The tool trusted the key name that had been approved — not the command content that had changed — so the malicious version executed silently on every subsequent project open.

The thread across all three isn't sophisticated tradecraft. It's the same structural bet: that trust, once granted, won't be re-examined.

Why Isn't "Just Pin Your Versions" Enough?

Version pinning is the correct instinct, and it does help. But it has two real failure modes and one operational trap.

The first failure mode: many agent frameworks resolve tool dependencies at runtime. A version pinned in your config may not be what's actually executing in production if your runtime resolves it dynamically. Knowing what version you've declared isn't the same as knowing what version is running.

The second failure mode is the account compromise scenario. When a legitimate maintainer's credentials get stolen and the malicious update gets pushed from the original account, version pinning history looks clean — because the update came from a legitimate publisher. The Smithery attack followed this pattern. Standard version vetting had nothing to flag.

The operational trap: if you commit to never updating, you accumulate unpatched vulnerabilities in tools you depend on. So teams update periodically, which reopens the window pinning was meant to close.

Pinning is a static snapshot. What's actually needed is runtime verification — something that compares what a tool currently claims to be against what was approved, at execution time, on every run. That's a governance problem, not a package management problem.

What Does a Real Defense Look Like?

The attack is a trust continuity problem: the system trusts a tool's current state because it once trusted that tool's past state. There's no mechanism that re-examines that trust continuously. A defense has to break that propagation.

Start with registered, versioned tool identity. Every tool agents are permitted to call should exist in a governance-controlled registry with an explicit versioned definition. Execution validates against that registered version — not against whatever the MCP server happens to be serving at call time. When a tool's definition changes on the server side, the mismatch with the registered version surfaces at the registry layer. That's when you find out, not after your agents have been running the new version for a week.

This is what Waxell's Registry is designed to enforce: execution always refers back to a registered, governance-cleared definition. Nothing runs unless it's been registered — and if the definition the server serves doesn't match what's been cleared, the execution doesn't proceed.

Layer policy validation on top of that. Registration alone isn't enough if your policies don't validate the tool definition at execution time. The question shouldn't be "was this tool approved historically" — it should be "is this tool's current definition consistent with what governance cleared." That check needs to happen before each execution, with no silent override when it fails.

Add result inspection as a last line. Even if a compromised tool does execute, its output enters agent context before the model acts on it. Scanning tool responses for injection patterns and schema anomalies before they're appended to context intercepts the in-context phase of the attack. It won't stop a determined attacker who can generate schema-valid output, but it catches the common patterns.

Keep immutable execution records. When an incident does occur — and with enough surface area, eventually one will — your ability to identify every session that ran against the compromised tool version determines whether your response is surgical or chaotic. Records that can be altered after the fact aren't forensics. Waxell's telemetry is immutable by design: once recorded, execution history can't be changed, which means your incident record is reliable regardless of what happened after.

How Waxell handles this: Waxell's Registry anchors every tool to a registered, versioned identity — agents don't execute whatever an MCP server is currently serving, they execute against what's been registered and governance-cleared. Waxell's policy engine validates tool definitions before each execution: no adaptive interpretation, no silent override, and every blocked execution is recorded. Tool result inspection scans every MCP tool response for injection patterns and schema anomalies before they enter agent context. And because Waxell's telemetry is immutable, you have a complete forensic record of every session that ran against any given tool version — which matters when you're trying to scope an incident fast. Learn how Waxell Runtime governs MCP tool execution → · Get started →

Frequently Asked Questions

What is an MCP rug pull attack? An MCP rug pull attack is a supply chain attack where a malicious or compromised MCP server silently alters a tool's definition after it's been approved. Most MCP clients verify tools at install time but don't re-check when definitions change, so the agent keeps calling a tool it believes is trusted while actually executing a version that may contain hidden instructions. Unlike prompt injection, it's persistent: every session that calls the compromised tool is affected, not just one conversation.

How is this different from prompt injection? Prompt injection is session-scoped — it affects one conversation, via a message a user sent. A rug pull operates at the supply chain layer. The malicious payload is in the tool definition itself, which every session shares. Once a tool is poisoned, every agent that calls it is compromised until the definition is reverted or the tool is removed. The blast radius is determined by how widely the tool is used, not how many malicious messages get crafted.

Does version pinning protect against this? Partially, but not reliably. Pinning doesn't cover account compromise scenarios, where a malicious update is legitimately published from the original maintainer's account — version history looks clean because it is. It also doesn't help if your runtime resolves dependencies dynamically and what's pinned in config isn't what's actually running. And committing to never updating creates its own vulnerability backlog. Pinning reduces risk; continuous runtime integrity verification eliminates the gap pinning leaves open.

What real attacks have occurred? Three notable incidents from 2025: the postmark-mcp backdoor in September (malicious NPM package silently pushed to approved installs); the Smithery supply chain attack in October (build system exploit that exfiltrated API tokens from over 3,000 hosted applications); and CVE-2025-54136 (MCPoison), which demonstrated the pattern directly against MCP config files in a major development environment.

How successful are these attacks? More successful than you'd probably expect. The MCPTox benchmark found a 72.8% attack success rate against o1-mini. Claude 3.7-Sonnet was the most resistant model tested and still refused less than 3% of the time. More capable models tended to be more vulnerable, because the attack exploits instruction-following — the same behavior that makes powerful models powerful. This isn't a problem you solve with a better model.

What's the governance-layer defense? Four things, all of which need to be in place: a tool identity registry that versions and registers definitions so execution validates against a known-good state at runtime (not whatever the server serves at call time); pre-execution policy validation that checks whether the current definition matches what governance cleared; tool result inspection that scans responses for injection patterns before they enter agent context; and immutable execution records that give you a reliable forensic baseline when something does go wrong.

Agentic Governance, Explained

Multi-Agent Orchestration Solves Coordination. It Doesn't Solve Governance.

Orchestration frameworks tell agents what to do. Governance controls what they're allowed to do. In multi-agent systems, you need both — and most teams only have one.

Logan Kelly

Mar 4, 2026

The Hidden Cost of AI Agents: Why Token Spend Spirals and How to Control It

Engineers think in requests. Agents run in loops. Here's the math behind why agent costs explode, and four practical strategies to control token spend in production.

Logan Kelly

Mar 2, 2026

How to Keep PII Out of Your AI Agents (Without Slowing Them Down)

PII enters agent context in ways most teams haven't mapped. Here are the three vectors, how to detect them, and how to enforce data policies without killing performance.

Logan Kelly

Feb 27, 2026

What Is Agentic Governance? (And Why Your AI Team Probably Doesn't Have It)

Agentic governance is the control layer for AI agents in production. Most teams confuse it with observability. Here's what it actually means and how to get there.

Logan Kelly

Feb 26, 2026

Multi-Agent Orchestration Solves Coordination. It Doesn't Solve Governance.

Orchestration frameworks tell agents what to do. Governance controls what they're allowed to do. In multi-agent systems, you need both — and most teams only have one.

Logan Kelly

Mar 4, 2026

The Hidden Cost of AI Agents: Why Token Spend Spirals and How to Control It

Engineers think in requests. Agents run in loops. Here's the math behind why agent costs explode, and four practical strategies to control token spend in production.

Logan Kelly

Mar 2, 2026

Waxell

Waxell provides a governance and orchestration layer for building and operating autonomous agent systems in production.

Product

Company