FP8
FP8 is an 8-bit floating-point numeric format available in two variants: E4M3 (4-bit exponent, 3-bit mantissa), used in the forward pass for higher precision, and E5M2 (5-bit exponent, 2-bit mantissa), used for gradients for greater dynamic range. It reduces memory usage by roughly 50% compared to BF16 with less than 0.5% quality loss when paired with per-tensor scaling via the NVIDIA Transformer Engine. H100 and H800 GPUs have native FP8 Tensor Cores. DeepSeek V3 was trained entirely in FP8, achieving GPT-4o-level quality at a fraction of the cost.
In practice
An ML team training a 70B LLM on an H100 cluster enables FP8 via NVIDIA's Transformer Engine (integrated into Megatron-LM and NeMo) by simply setting `fp8_format=HYBRID`. For inference, frameworks like vLLM and TensorRT-LLM support FP8 weights and activations to reduce required VRAM and increase throughput. Before deploying to production, it is good practice to run evaluations on standard benchmarks (MMLU, HumanEval) to confirm that quality degradation stays within acceptable thresholds.
Related terms
Seen in the wild
6 entries mentioning it- Mediumtorchao: PyTorch-Native Quantization and Sparsity Without Custom CUDA
- HighDeepSeek-V3: GPT-4o Quality at $0.55/M Tokens via MLA and FP8 Pipeline
- HighKV Cache Quantization FP8/INT8: Double User Density per GPU
- HighFP8 Training with NVIDIA Transformer Engine: Half the Memory, Same Quality
- HighFlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8
- HighNVIDIA TensorRT-LLM: automatic LLM compilation for GPUs with FP8 and multi-GPU