Zephyr-7B: DPO on Mistral 7B beats Llama-2-70B-chat on MT-Bench

In one sentence HuggingFace trains Zephyr-7B with dSFT + Direct Preference Optimization on Mistral 7B base, achieving an MT-Bench score higher than Llama-2-70B-chat with 10x fewer parameters.

Verified Official source

ShareLinkedIn X

Zephyr-7B is a 7-billion-parameter model that manages to beat a 70-billion-parameter model on conversation benchmarks. How is that possible?

The key is the alignment method: instead of using classic RLHF (which requires a separate reward model), HuggingFace used Direct Preference Optimization (DPO), a simpler algorithm that directly optimizes for human preferences.

Zephyr shows that with the right alignment technique, a well-trained small model can be more useful than a much larger but worse-aligned one.