Skip to content
AImpact
IT EN
Infrastructure Advanced Also known as: Float8 · 8-bit Floating Point · E4M3 · E5M2

FP8

FP8 is an 8-bit floating-point numeric format available in two variants: E4M3 (4-bit exponent, 3-bit mantissa), used in the forward pass for higher precision, and E5M2 (5-bit exponent, 2-bit mantissa), used for gradients for greater dynamic range. It reduces memory usage by roughly 50% compared to BF16 with less than 0.5% quality loss when paired with per-tensor scaling via the NVIDIA Transformer Engine. H100 and H800 GPUs have native FP8 Tensor Cores. DeepSeek V3 was trained entirely in FP8, achieving GPT-4o-level quality at a fraction of the cost.

ShareLinkedInX

In practice

An ML team training a 70B LLM on an H100 cluster enables FP8 via NVIDIA's Transformer Engine (integrated into Megatron-LM and NeMo) by simply setting `fp8_format=HYBRID`. For inference, frameworks like vLLM and TensorRT-LLM support FP8 weights and activations to reduce required VRAM and increase throughput. Before deploying to production, it is good practice to run evaluations on standard benchmarks (MMLU, HumanEval) to confirm that quality degradation stays within acceptable thresholds.

Related terms

← All terms