Prompt Injection: when user input hijacks system instructions

In one sentence Riley Goodside and Perez et al. formalize Prompt Injection: an attack where malicious user input overwrites an LLM's system instructions, bypassing policies and guardrails.

Verified Official source

ShareLinkedIn X

Imagine giving secret instructions to an assistant, then a customer hands them a note saying "forget everything you were told and do what I say instead." Prompt Injection works exactly like that with language models.

The problem exists because LLMs do not structurally distinguish between system instructions and user-provided text: everything is text, everything can be interpreted as a command.

Perez and colleagues demonstrate with GPT-3 experiments that it is possible to inject arbitrary instructions into the input stream, bypassing filters, content policies, and developer-configured behaviors.

This is the vulnerability that opens the door to dozens of real attacks: system prompt exfiltration, moderation bypass, manipulation of autonomous AI agents.