Skip to content
AImpact
IT EN
High AI Infrastructure · 1 min read

FlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8

In one sentence Tri Dao and NVIDIA publish FlashAttention-3: optimized for H100 Hopper with compute/memory overlapping via wgmma and TMA, FP8 low-precision support, 2.6x speedup over FA2 and 75% of H100 peak.

Verified Official source
ShareLinkedInX
Reading level

Every time NVIDIA launches a new GPU generation, old implementations do not exploit the new hardware features. H100 (Hopper architecture) introduced completely new specialized instructions — wgmma for matrix multiplications and TMA for asynchronous data transfer — that FA2 did not use at all.

FlashAttention-3 is a ground-up rewrite of FlashAttention to fully exploit H100. The main trick: overlapping compute and memory operations instead of running them in sequence. While H100 is doing matrix multiplications for one block, FA3 is already loading the next block's data into SRAM.

The result: 2.6x faster than FA2 on H100, reaching 75% of the GPU's theoretical peak. It also supports FP8, H100's new low-precision format that halves memory and nearly doubles throughput compared to FP16.

Companies

Tri Dao Research, NVIDIA

Tools

FlashAttention-3, CUDA, PyTorch, cuDNN

Tags

FlashAttention-3H100HopperNVIDIAFP8wgmmaTMACUDATri DaoAttention

Sources