Crescendo: the multi-turn jailbreak that bypasses guardrails through gradual escalation
In one sentence Microsoft discovers that a sequence of innocent requests, each slightly shifting the boundaries of the previous turn, leads GPT-4 and Claude to produce output that a single direct request would never obtain.
Language model guardrails are trained to recognize and block explicitly dangerous requests. But what happens when no single request is dangerous on its own?
The Crescendo technique, developed by Microsoft researchers, builds a multi-turn conversation where every message is innocuous and plausible in the context established by previous messages. Slowly, the conversation is steered toward territory the model would never have explored if the final destination had been requested directly.
It is the conversational equivalent of the boiling frog: no single step exceeds the model's alarm threshold, but the sum of steps leads to output that would clearly violate policy if requested directly.
The attack works on GPT-4 and Claude, two of the most sophisticated guardrail systems available, demonstrating that the problem is structural in stateless models that evaluate each turn in local context.
Companies
Microsoft
Tools
Crescendo
Tags
Sources