Zephyr-7B: DPO on Mistral 7B beats Llama-2-70B-chat on MT-Bench
In one sentence HuggingFace trains Zephyr-7B with dSFT + Direct Preference Optimization on Mistral 7B base, achieving an MT-Bench score higher than Llama-2-70B-chat with 10x fewer parameters.
Zephyr-7B is a 7-billion-parameter model that manages to beat a 70-billion-parameter model on conversation benchmarks. How is that possible?
The key is the alignment method: instead of using classic RLHF (which requires a separate reward model), HuggingFace used Direct Preference Optimization (DPO), a simpler algorithm that directly optimizes for human preferences.
Zephyr shows that with the right alignment technique, a well-trained small model can be more useful than a much larger but worse-aligned one.
Companies
HuggingFace
Tools
Zephyr-7B, Mistral-7B
Tags
Sources