FP8 Training with NVIDIA Transformer Engine: Half the Memory, Same Quality

In one sentence NVIDIA Transformer Engine brings FP8 (E4M3/E5M2) mixed-precision training with automatic per-tensor scaling, halving memory versus BF16 with less than 0.5% quality loss, making training 70B models on half the hardware feasible.

Needs review Official source

ShareLinkedIn X

Numbers representing AI model weights can be stored with different amounts of precision, similar to the difference between writing "3.14159265" and "3.1". More precision uses more memory, but too little precision degrades model quality.

For years training has used BF16 — 16 bits per number — as the standard compromise. NVIDIA has now brought this down to FP8 — only 8 bits per number — using exactly half the memory. The problem is that FP8 has a very limited numerical range, and values during training can easily overflow or become zero if not carefully managed.

Transformer Engine solves this problem automatically. For each critical operation (matrix multiplications in attention and feed-forward), it dynamically measures the scale of values and applies a different scaling factor for each tensor, keeping numbers always within FP8's usable range. The result is all the advantages of reduced memory — larger models on the same hardware, larger batches, faster training — without writing a single extra line of code, and with an almost imperceptible quality difference.