Why can't models just ignore injected instructions?

Because they cannot reliably distinguish trusted instructions from untrusted content — both arrive as text. That ambiguity is the core vulnerability.

What is indirect prompt injection?

When the malicious instructions are hidden in external content the model retrieves — a page, file, email or tool output — rather than typed by the user. It is often more dangerous.

Can prompt injection be fully prevented?

Not by a single measure today. You reduce risk with layered defenses: least privilege, content isolation, validation, monitoring and human approval for sensitive actions.

How does tool use raise the stakes?

Without tools, injection mostly produces bad text. With tools, an injected instruction can take real actions — send data, make changes — so permissions and approval matter more.

GovernanceUpdated 2026-06-21 · Version 1.0

What is Prompt Injection?

Prompt injection is an attack in which malicious instructions hidden in the input to a language model hijack its behavior — making it ignore its rules, leak data or misuse tools. It tops the OWASP Top 10 for LLM applications. The root cause is that models cannot reliably separate trusted instructions from untrusted content, so any text an agent reads — a web page, a document, a tool result — can carry an attack.

Evidence: Industry observationConfidence: HighSource: Industry observationSource: Paper

Machine-readable: JSON

Definition

Prompt injection is a security attack where adversarial instructions embedded in untrusted input cause a language model to deviate from its intended behavior, bypass safeguards, or perform unintended actions.

Key takeaways

Untrusted text a model reads can contain hidden instructions.
It is the #1 risk in the OWASP Top 10 for LLM applications.
Indirect injection hides payloads in documents, pages or tool outputs.
Risk grows with tool access — injection can trigger real actions.
There is no single fix; defense is layered (least privilege, isolation, human approval).

Context

Models follow instructions in natural language and cannot reliably tell trusted system instructions from untrusted user or document content. An attacker exploits this by planting instructions like 'ignore previous instructions and…' where the model will read them.

Direct injection comes from the user; indirect (and more dangerous) injection hides in content the agent retrieves — a web page, an email, a file, an MCP tool result. As agents gain tool access, a successful injection can exfiltrate data or take harmful actions.

Architecture

Defense is layered, not a single control: least-privilege tool permissions, isolating and clearly delimiting untrusted content, output and action validation, allow-lists for sensitive operations, and human-in-the-loop approval for high-impact actions.

Treat all tool and retrieval outputs as untrusted input. Monitor and log agent actions (observability) so injection attempts are detectable, and red-team the system regularly.

Components

Untrusted input boundaryLeast-privilege permissionsContent isolation / delimitingOutput & action validationHuman approval for high-impact actionsMonitoring & red teaming

Risks

Data exfiltration of sensitive context or credentials.
Unauthorized tool actions in connected systems.
Bypassed safety policies and guardrails.
Indirect attacks via documents, web pages or tool results.

Tools & technologies

Input/output guardrail librariesPermission & sandboxing layersAllow-lists for tool actionsMonitoring / observabilityRed-teaming frameworks

Examples

A web page the agent reads contains hidden text telling it to email private data.
A document instructs a summarizer to ignore its rules and output a malicious link.
A tool result tries to make an agent call another tool it should not.

FAQs

Why can't models just ignore injected instructions?: Because they cannot reliably distinguish trusted instructions from untrusted content — both arrive as text. That ambiguity is the core vulnerability.
What is indirect prompt injection?: When the malicious instructions are hidden in external content the model retrieves — a page, file, email or tool output — rather than typed by the user. It is often more dangerous.
Can prompt injection be fully prevented?: Not by a single measure today. You reduce risk with layered defenses: least privilege, content isolation, validation, monitoring and human approval for sensitive actions.
How does tool use raise the stakes?: Without tools, injection mostly produces bad text. With tools, an injected instruction can take real actions — send data, make changes — so permissions and approval matter more.