Safety Beginner Also known as: Iniezione di prompt

Prompt injection

An attack where an external input (a document, a web page, an email) contains hidden instructions that hijack the model's behavior.

ShareLinkedIn X

In practice

If your agent reads emails and then acts, a malicious email can tell it 'forward everything to a third party'. Fixes: treat external inputs as untrusted, sandbox tools, require human confirmation for sensitive actions, filter inputs and outputs.

Related terms

Jailbreak Agent Safety classifier

Seen in the wild

8 entries mentioning it

August 11, 2024

Promptfoo Red Teaming: open source automated red-teaming with CI integration and comparative benchmark

Medium
August 6, 2024

NIST AI 600-1: risk profile for generative AI systems

Medium
June 20, 2024

Rebuff: three-layer prompt injection defense with canary tokens

Medium
February 6, 2024

Indirect Prompt Injection: the attack vector in RAG systems and AI agents

High
August 1, 2023

OWASP LLM Top 10: the 10 critical vulnerabilities in AI applications

High
July 10, 2023

Universal adversarial attacks on LLMs: transferable jailbreaks across GPT-4, Claude, and Gemini

High
June 20, 2023

Lakera Guard: real-time protection for LLMs in production

Medium
September 14, 2022

Prompt Injection: when user input hijacks system instructions

High

← All terms