FlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8
In one sentence Tri Dao and NVIDIA publish FlashAttention-3: optimized for H100 Hopper with compute/memory overlapping via wgmma and TMA, FP8 low-precision support, 2.6x speedup over FA2 and 75% of H100 peak.
Every time NVIDIA launches a new GPU generation, old implementations do not exploit the new hardware features. H100 (Hopper architecture) introduced completely new specialized instructions — wgmma for matrix multiplications and TMA for asynchronous data transfer — that FA2 did not use at all.
FlashAttention-3 is a ground-up rewrite of FlashAttention to fully exploit H100. The main trick: overlapping compute and memory operations instead of running them in sequence. While H100 is doing matrix multiplications for one block, FA3 is already loading the next block's data into SRAM.
The result: 2.6x faster than FA2 on H100, reaching 75% of the GPU's theoretical peak. It also supports FP8, H100's new low-precision format that halves memory and nearly doubles throughput compared to FP16.
Companies
Tri Dao Research, NVIDIA
Tools
FlashAttention-3, CUDA, PyTorch, cuDNN
Tags
Sources