Many-Shot Jailbreaking: safety training overridden by context length
In one sentence Anthropic publishes research on many-shot jailbreaking: providing 256+ fake harmful Q&A pairs in the context window gradually overrides safety training. The vulnerability scales with context length. Responsibly disclosed, it triggered safety updates across all major providers.
AI models are trained to refuse harmful requests. But Anthropic discovered an unexpected way to make models forget these rules: filling the context with fake examples.
Imagine asking someone something inappropriate. They refuse. But what if you first show them 300 fake conversations where someone asked the same inappropriate question and received an answer? In some cases, the model starts behaving as if answering is normal.
This is many-shot jailbreaking. The more examples you put in the context — often called "shots" — the higher the probability that the model ignores its training and responds to the harmful request. The technique scales: with 256 examples you get significantly better results than with 64.
The problem becomes more relevant as models support increasingly long context windows — from 8,000 tokens to 100,000, to 1 million. The longer the context, the more shots you can insert, and the stronger the attack. Anthropic published this research responsibly and collaborated with other providers to develop countermeasures.
Companies
Anthropic
Tools
—
Tags
Sources