Mixtral 8x7B: open-source Mixture of Experts that beats GPT-3.5

In one sentence Mistral drops Mixtral 8x7B via magnet link with no warning: SMoE with 8 experts of 7B, 13B active params out of 47B total. Performance matches/exceeds GPT-3.5. Apache 2.0.

Verified Official source

ShareLinkedIn X

Mistral does something that instantly becomes legend: posts a magnet-link torrent on Twitter with zero explanation. Inside is Mixtral 8x7B, a model using Mixture of Experts (MoE) architecture: eight "experts" of 7B parameters each, but only two of them used per token. Result: 47B total parameters, but inference cost like a 13B model.

Translated: it runs as fast as a small model but knows as much as a mid-large one. Benchmarks show Mixtral matching or beating GPT-3.5 and LLaMA 2 70B on MMLU, MT-Bench, GSM8K — using 5× less compute at inference.

Apache 2.0 license: free commercial, no restrictions. Four days after the torrent Hugging Face hosts it officially, llama.cpp adds support, and it gets downloaded billions of times. It's the first seriously capable open MoE, and rewrites the roadmap of anyone training dense models.