Safety Beginner Also known as: Aggiramento delle protezioni

Jailbreak

A technique where a user talks the model into ignoring its own safety rules, for example by asking it to pretend to be a character with no restrictions.

ShareLinkedIn X

In practice

Different from prompt injection: here it is the user who tries. If you offer a public LLM service this means doing red teaming, logging conversations, and running a safety classifier in cascade over responses.

Related terms

Prompt injection Alignment Red teaming

Seen in the wild

8 entries mentioning it

August 11, 2024

Promptfoo Red Teaming: open source automated red-teaming with CI integration and comparative benchmark

Medium
April 17, 2024

Many-Shot Jailbreaking: safety training overridden by context length

High
March 20, 2024

HarmBench: standardized benchmark for evaluating LLM jailbreaks and defenses

High
February 28, 2024

Crescendo: the multi-turn jailbreak that bypasses guardrails through gradual escalation

Medium
January 12, 2024

Garak: the open source vulnerability scanner for LLMs

Medium
September 27, 2023

PAIR: automated LLM-vs-LLM jailbreaking

High
July 10, 2023

Universal adversarial attacks on LLMs: transferable jailbreaks across GPT-4, Claude, and Gemini

High
June 20, 2023

Lakera Guard: real-time protection for LLMs in production

Medium

← All terms