Inference Advanced Also known as: Flash Attention

FlashAttention

An algorithm that reorganizes attention computation to minimize data movement between fast and slow GPU memory.

In practice

It does not change the math, but makes attention much faster and far less memory-hungry. It ships by default in PyTorch and in major inference servers (vLLM, TGI). If you use APIs you never see it; if you self-host it is almost mandatory to turn on.

Seen in the wild

4 entries mentioning it

May 18, 2024

FlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8

High
July 28, 2023

FlashAttention-2: rewrite with 2x speedup, MQA/GQA support, and head-dim 256

High
June 6, 2023

HuggingFace TGI: production-ready Docker container for LLM serving with continuous batching

High
June 21, 2022

FlashAttention: IO-aware attention that revolutionizes transformer training

Landmark

← All terms

In practice

Related terms

Seen in the wild