Safety Intermediate Also known as: Indirect Injection · Environment Injection

Indirect Prompt Injection

Indirect prompt injection is an attack where malicious instructions are embedded in external content that an LLM agent will read: web pages, documents, emails, or database results. Unlike direct prompt injection (where the user provides the malicious content), here the attacker controls the external environment. When the agent retrieves and processes the content, it unknowingly executes the hidden instructions as if they came from a trusted source. The attack was first formalized by Greshake et al. (2023) and is a critical threat for RAG systems and autonomous agents.

ShareLinkedIn X

In practice

A developer building a web agent must sanitize all externally retrieved text before inserting it into the prompt. Defensive techniques include: structured prompts with explicit delimiters separating data from instructions, classifier systems that detect injection patterns in retrieved documents, and the principle of least privilege (the agent should not have access to dangerous tools if the task does not require them). Systematically testing the agent with deliberately poisoned documents is part of standard red-teaming for RAG applications.

Seen in the wild

1 entries mentioning it

February 6, 2024

Indirect Prompt Injection: the attack vector in RAG systems and AI agents

High

← All terms

In practice

Related terms

Seen in the wild