Skip to content
AImpact
IT EN
High AI Infrastructure · 1 min read

DeepSeek-V3: GPT-4o Quality at $0.55/M Tokens via MLA and FP8 Pipeline

In one sentence DeepSeek-V3 technical report reveals Multi-head Latent Attention and a complete FP8 pipeline achieving GPT-4o-level performance at $0.55/M tokens, training 671B parameter MoE on an H800 cluster under tight budget constraints.

Needs review Reputable source
ShareLinkedInX
Reading level

When DeepSeek released its V3 model at the start of 2025, the AI world received a shock: a Chinese model trained with a much lower budget than Western competitors delivered comparable performance to the best commercial models, and the cost of use was 20-40 times lower.

The technical report revealed the engineering reasons for this result. The first is a new type of attention mechanism called Multi-head Latent Attention (MLA), which drastically compresses the KV cache needed during generation, enabling larger batches and reduced memory costs. The second is a completely FP8 training pipeline — half the standard numerical precision — which halved memory requirements and increased training speed.

The impact was enormous: it demonstrated that the arms race in training budgets is not the only possible path. With the right architectural and engineering choices, you can build a frontier model spending less than 6 million dollars of compute, at a time when competitors were spending hundreds. This report immediately became required reading for any team working on AI infrastructure.

Companies

DeepSeek

Tools

Tags

DeepSeek V3MLAFP8cost-efficientinferenceMoEH800

Sources