You can’t patch your way out of prompt injection: AI agents need a different defense

Prompt injection has gone from a lab curiosity to a zero-click data breach in production. The fix isn’t a better filter; it’s a different architecture.

byRilton Franzone

May 20, 2026

5 minute read

A breach with no click

In June 2025, researchers at Aim Security disclosed a flaw in Microsoft 365 Copilot that should change how teams think about AI security. They called it EchoLeak, and it asked nothing of the victim. An attacker sent an ordinary-looking email.

When the user later asked Copilot something about their inbox, the assistant pulled that email into its context, followed instructions hidden inside it, and exfiltrated the user’s private data. The attack required no click, download, or attachment opening by the victim. Microsoft assigned it CVE-2025-32711, scored it 9.3 out of 10, and patched it (1).

EchoLeak was described by researchers as the first real-world zero-click prompt injection exploit against a production LLM system, but the technique behind it is neither new nor exotic. It is one of the central security problems of the agent era, and many teams are still trying to solve it with the wrong tools.

What is a prompt injection?

At the point of interpretation, a language model sees a single stream of tokens. Modern APIs mark off system, user, and tool roles, but the model still resolves them into one context, and those labels do not reliably stop it from following an instruction that arrived as data. Prompt injection is just text that the model treats as a command when you intended it as data.

The dangerous form is indirect. Rather than typing a malicious instruction yourself, an attacker plants it in something the agent will later read: a web page, a support ticket, a PDF, a calendar invite, even low-contrast text in a document or hidden characters in an image. When the agent ingests that content, the smuggled instruction lands in its context, where the model may treat it as actionable even though it arrived as data. Greshake and colleagues formalized this class in 2023 and demonstrated it against real LLM-integrated applications (2). EchoLeak was that paper’s warning made real at enterprise scale.

This is why OWASP has ranked prompt injection the number one risk to LLM applications for two editions running, and frames it as a weakness teams can mitigate but never fully eliminate (3).

Why doesn’t patching end it

The instinct, when something like EchoLeak surfaces, is to patch it and move on. That instinct only half works. You can fix a specific exploit, but you cannot patch the category, because the category is the model doing exactly what it was built to do: follow the instructions in its context.

You can see the limit in the patch record. Months after EchoLeak, Capsule Security disclosed ShareLeak, an indirect prompt-injection flaw in Copilot Studio that Microsoft patched as CVE-2026-21520. The patch closed that specific hole, but Capsule’s research showed how an agent with authorized actions could still be steered toward data exposure through Outlook workflows (4).

The exploit was fixed; the way the attack works was not. Every filter, classifier, and system-prompt guardrail raises the cost for an attacker without closing the category. The most common defense is to add a second model that judges whether the input looks malicious, which only hands the attacker a second model to fool. Guarding an AI with more AI inherits the weakness you started with.

The lethal trifecta

The clearest way to see the real risk comes from Simon Willison, who coined the term prompt injection. He calls the danger zone the lethal trifecta (5). An agent becomes capable of leaking your data the moment it holds three things at once: access to private data, exposure to untrusted content, and a way to communicate with the outside world.

Each can be managed on its own. Combined, they become an exfiltration engine. The untrusted content carries the attacker’s instruction, the private data is the prize, and the external channel is how the prize leaves the building. EchoLeak is the trifecta in a single screenshot: an untrusted email comes in, private inbox data is in reach, and an automatically loaded image is the way out.

Defend by design, not by filter

If the problem is structural, the defense has to be structural too. The most promising research points away from smarter filters and back toward old security principles. Google DeepMind’s CaMeL is the clearest example (6). Instead of asking a model to behave, it treats the user’s request as a trusted program and everything the model retrieves as untrusted input, then uses control-flow and data-flow rules, capability limits, and least privilege to govern what that untrusted data is allowed to touch.

In practice, the system fixes in advance what each step may read and where its output may go, so a line buried in a document cannot redirect the program into emailing your files to a stranger. The injected text can be processed, but under that policy model, it is not allowed to steer the program’s flow. None of that is an AI idea. It is the same discipline that contained untrusted input for decades before LLMs arrived.

For most teams, the everyday version is simpler than CaMeL: break the trifecta. If an agent reads untrusted documents, do not also give it standing access to your secrets and an open outbound channel. Take away one leg. Scope its data to what the task actually needs, route any outbound action through a human or a tight allowlist, and treat every byte of retrieved content as hostile until proven otherwise.

What this looks like in practice

I lead AI engineering at a legal-research company, which means we are handed the lethal trifecta by default. Our agents work with privileged client material; they read contracts, filings, and uploads we did not write, and they draft and retrieve on a user’s behalf. We cannot wish any of that away, so we design around it.

Source text is treated as data and never as instructions, and an agent’s reach is scoped to the matter in front of it. When one of our agents summarizes an uploaded filing, for example, it works with read access to that single document and no open channel to the public internet, so a hidden instruction inside the file has nowhere to send what it reads. Anything that does leave the system clears a boundary we control.

None of this eliminates prompt injection. It makes a successful injection unable to do much, which is the right bar to aim for. EchoLeak’s real lesson is not that one vendor missed a filter. It is that any system holding the trifecta can be one cleverly worded document away from a breach, and the durable answer is to build so that reading a hostile instruction is not enough to act on it.

References

EchoLeak, disclosed by Aim Labs (Aim Security), June 2025; Microsoft 365 Copilot, CVE-2025-32711 (Microsoft CNA CVSS 9.3). Technical analysis: Pavan Reddy and Aditya Sanjay Gujral, “EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System,” AAAI Fall Symposium Series, 2025, arXiv:2509.10540. https://arxiv.org/abs/2509.10540
Kai Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection,” 2023. arXiv:2302.12173. https://arxiv.org/abs/2302.12173
OWASP, “LLM01:2025 Prompt Injection,” OWASP Top 10 for LLM Applications, 2025. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Capsule Security, “ShareLeak: Taking the Wheel of Microsoft’s Copilot Studio (CVE-2026-21520),” 2026. https://www.capsulesecurity.io/blog-post/shareleak-taking-the-wheel-of-microsofts-copilot-studio-cve-2026-21520 . See also “Microsoft patched a Copilot Studio prompt injection. The data exfiltrated anyway,” VentureBeat, 2026.
Simon Willison, “The lethal trifecta for AI agents: private data, untrusted content, and external communication,” June 16, 2025. https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
“Defeating Prompt Injections by Design” (CaMeL), Google DeepMind, 2025. arXiv:2503.18813. https://arxiv.org/abs/2503.18813

The Latest

SabPaisa Partners with AccuKnox for Zero Trust AI-Powered Cloud Security to Secure Its Payments Platform

Anthropic Says Claude Models Hacked 3 Organizations During Cyber Tests

How Status Labs Helps Brands Get Cited in ChatGPT: The Data Behind AI Search Visibility

Wordfence Finds Critical Backdoor in ARVE WordPress Plugin