Skip to content
AImpact
IT EN
Inference Advanced Also known as: Flash Attention

FlashAttention

An algorithm that reorganizes attention computation to minimize data movement between fast and slow GPU memory.

ShareLinkedInX

In practice

It does not change the math, but makes attention much faster and far less memory-hungry. It ships by default in PyTorch and in major inference servers (vLLM, TGI). If you use APIs you never see it; if you self-host it is almost mandatory to turn on.

Related terms

← All terms