PAIR: automated LLM-vs-LLM jailbreaking

In one sentence CMU and UPenn publish PAIR: an attacker LLM that automatically refines its prompts against a target LLM, finding effective jailbreaks in under 20 queries with no human in the loop.

Needs review Reputable source

ShareLinkedIn X

Jailbreaking an AI normally required hours of manual trial and error: write a prompt, see if it works, adjust it, try again. Like cracking a combination safe by randomly spinning the dial.

Researchers at CMU and UPenn automated this process elegantly: instead of a human trying prompts by hand, they use a second language model as the "attacker." This model receives feedback from the target — "response refused" or "response given" — and iteratively refines its attack prompt until it finds one that works.

The system is called PAIR (Prompt Automatic Iterative Refinement). It finds an effective jailbreak in fewer than 20 iterations on average, even against models like GPT-3.5 and GPT-4. It requires no access to the model's weights or internal representations — it works purely by reading text responses, just as any regular user would.

The implication is direct: if a human needed hours, PAIR needs seconds. The scale of attacks becomes a real problem. Any security system based on static filters or the assumption that jailbreaking requires human effort needs rethinking.