Universal adversarial attacks on LLMs: transferable jailbreaks across GPT-4, Claude, and Gemini
In one sentence Zou et al. (CMU) demonstrate optimized suffixes that simultaneously jailbreak GPT-3.5/4, Claude, and Gemini: the first systematic proof of attack transferability across different models.
All major language models have safety filters that prevent them from responding to dangerous requests. Until 2023, these filters were thought to be robust enough to resist systematic attacks.
The CMU team discovered that by appending a seemingly random string of text to the end of a malicious request, it is possible to make virtually any LLM ignore its filters. The string is found through automated optimization on open-source models like Vicuna.
The most alarming finding is transferability: a suffix optimized on Vicuna also works on GPT-4, Claude, and Gemini — models the researcher never saw during optimization. This means the attack is structural, not tied to a specific weakness of one model.
The paper forced all AI labs to reassess their assumptions about the robustness of safety training.
Companies
CMU, OpenAI, Anthropic, Google
Tools
GCG Attack
Tags
Sources