DeepSeek-V2: Multi-head Latent Attention and the first highly efficient Chinese open MoE

In one sentence DeepSeek releases V2: 236B-total / 21B-active MoE with Multi-head Latent Attention (MLA), drastically cuts KV cache, slashes Chinese API prices by 90%, and ignites a price war.

Verified Official source

ShareLinkedIn X

A Chinese startup called DeepSeek releases the weights of a big model for free with two new ideas.

First: like Mixtral, it uses a huge "Mixture of Experts" (236 billion total) but activates only 21 billion per word.

Second: a new technique called MLA that drastically compresses the "memory" the model has to keep during long conversations. Result: 5-10× cheaper to run.

They also offer an API at very low prices (~14× cheaper than GPT-4-Turbo). In China this triggers a price war: Alibaba, Baidu, ByteDance all cut prices up to 90%.