Multi-Agent Debate: making multiple LLMs argue improves reasoning by +20%

In one sentence MIT and Google researchers show that having multiple LLM instances debate and critique each other's answers over N rounds leads to more accurate results: +20% on arithmetic and reasoning benchmarks vs single agent. Establishes the debate-based verification pattern in modern agents.

Needs review Reputable source

ShareLinkedIn X

If you ask the same difficult math question to ten different people, and then have them discuss the answers together, the group's final answer is more accurate than that of any individual. This phenomenon, known for centuries in philosophy and sociology, works with language models too.

MIT and Google researchers took multiple copies of the same LLM, had each answer the question, and then had each read the others' answers and revise its own. By repeating this cycle for a few rounds, the answers converge toward correct ones much more often than with a single model.

The improvement is substantial: on arithmetic and logical reasoning problems, the percentage of correct answers rises by about 20%. You don't need a bigger or more expensive model: just multiple instances of the same model that critique each other.

This result has inspired many subsequent systems that use debate as an internal verification mechanism.