Prompt Injection: when user input hijacks system instructions
In one sentence Riley Goodside and Perez et al. formalize Prompt Injection: an attack where malicious user input overwrites an LLM's system instructions, bypassing policies and guardrails.
Imagine giving secret instructions to an assistant, then a customer hands them a note saying "forget everything you were told and do what I say instead." Prompt Injection works exactly like that with language models.
The problem exists because LLMs do not structurally distinguish between system instructions and user-provided text: everything is text, everything can be interpreted as a command.
Perez and colleagues demonstrate with GPT-3 experiments that it is possible to inject arbitrary instructions into the input stream, bypassing filters, content policies, and developer-configured behaviors.
This is the vulnerability that opens the door to dozens of real attacks: system prompt exfiltration, moderation bypass, manipulation of autonomous AI agents.
Companies
Stanford, Anthropic
Tools
GPT-3, InstructGPT
Tags
Sources