In practice
Different from prompt injection: here it is the user who tries. If you offer a public LLM service this means doing red teaming, logging conversations, and running a safety classifier in cascade over responses.
Related terms
Seen in the wild
8 entries mentioning it- MediumPromptfoo Red Teaming: open source automated red-teaming with CI integration and comparative benchmark
- HighMany-Shot Jailbreaking: safety training overridden by context length
- HighHarmBench: standardized benchmark for evaluating LLM jailbreaks and defenses
- MediumCrescendo: the multi-turn jailbreak that bypasses guardrails through gradual escalation
- MediumGarak: the open source vulnerability scanner for LLMs
- HighPAIR: automated LLM-vs-LLM jailbreaking
- HighUniversal adversarial attacks on LLMs: transferable jailbreaks across GPT-4, Claude, and Gemini
- MediumLakera Guard: real-time protection for LLMs in production