Diffusion Policy: robot imitation learning goes multi-modal with diffusion models

In one sentence MIT and Columbia apply denoising diffusion models to robot imitation learning, learning multi-modal action distributions instead of deterministic policies. They achieve a 46.9% improvement on manipulation benchmarks.

Needs review Reputable source

ShareLinkedIn X

When you teach someone a physical task, there are often many correct ways to do it. A traditional robot trained by imitation always picks one way and follows it blindly, even when adapting would be better. Diffusion Policy solves exactly this problem.

The idea comes from the world of AI image generation: diffusion models learn to "denoise" starting from random data to produce realistic outputs. MIT and Columbia applied the same principle to robot actions: instead of predicting a single definitive action, the robot gradually "denoises" toward the best possible action, considering all plausible solutions.

The practical result is that the robot can handle ambiguous tasks where multiple approaches are valid, can change strategy mid-task, and generally performs better in situations where a traditional robot would get stuck choosing the wrong solution.

The numbers speak clearly: plus 46.9% improvement on standard manipulation benchmarks. Diffusion Policy quickly became a standard component in modern robotic systems, including ALOHA 2 and Physical Intelligence's pi-zero.