Many-Shot Jailbreaking
Many-shot jailbreaking is an attack technique that exploits long context windows by prepending 100-256 or more fake harmful question-answer pairs before the actual malicious request. The in-context examples override safety training by inducing the model to follow the demonstrated pattern rather than its guardrails. Effectiveness scales with context length: models with larger context windows are more vulnerable. The attack was disclosed by Anthropic in 2024 and prompted revisions to safety mechanisms for very long-context models.
In practice
From a defensive standpoint, a developer evaluating a deployed model's robustness should include many-shot tests in their red-teaming: construct a prompt with 200+ malicious Q&A examples and measure the model's compliance rate. To mitigate the risk in production, one can apply artificially capped context windows for certain tasks, input classifiers that detect repeated Q&A patterns on risky topics, or logging systems that flag unusually long prompts for review.
Related terms
Seen in the wild
0 entries mentioning itNo archive entry mentions it explicitly. Appears in broader contexts.