DeepSeek-V3: GPT-4o Quality at $0.55/M Tokens via MLA and FP8 Pipeline
In one sentence DeepSeek-V3 technical report reveals Multi-head Latent Attention and a complete FP8 pipeline achieving GPT-4o-level performance at $0.55/M tokens, training 671B parameter MoE on an H800 cluster under tight budget constraints.
When DeepSeek released its V3 model at the start of 2025, the AI world received a shock: a Chinese model trained with a much lower budget than Western competitors delivered comparable performance to the best commercial models, and the cost of use was 20-40 times lower.
The technical report revealed the engineering reasons for this result. The first is a new type of attention mechanism called Multi-head Latent Attention (MLA), which drastically compresses the KV cache needed during generation, enabling larger batches and reduced memory costs. The second is a completely FP8 training pipeline — half the standard numerical precision — which halved memory requirements and increased training speed.
The impact was enormous: it demonstrated that the arms race in training budgets is not the only possible path. With the right architectural and engineering choices, you can build a frontier model spending less than 6 million dollars of compute, at a time when competitors were spending hundreds. This report immediately became required reading for any team working on AI infrastructure.
Companies
DeepSeek
Tools
—
Tags
Sources