Models Beginner Also known as: Attenzione · Self-attention

Attention

A mechanism that lets the model weigh how relevant each word in the text is compared to the others to understand the meaning of the context.

ShareLinkedIn X

In practice

It is why an LLM knows that 'he' in a sentence refers to a person mentioned earlier. Compute cost grows with the square of context length: this is why very long contexts are expensive.

Related terms

Transformer Context window

Seen in the wild

12 entries mentioning it

July 2, 2025

vLLM v0.7: chunked prefill by default and a redesigned V1 engine

Medium
April 8, 2025

Continuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI

Medium
January 22, 2025

FlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion

Medium
May 18, 2024

FlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8

High
May 2, 2024

SGLang: 6.4x LLM throughput with RadixAttention and shared prefix caching

Medium
July 28, 2023

FlashAttention-2: rewrite with 2x speedup, MQA/GQA support, and head-dim 256

High
June 6, 2023

HuggingFace TGI: production-ready Docker container for LLM serving with continuous batching

High
February 9, 2023

vLLM: 24x LLM throughput with PagedAttention from UC Berkeley

High
June 21, 2022

FlashAttention: IO-aware attention that revolutionizes transformer training

Landmark
December 8, 2020

Big Bird at NeurIPS 2020: sparse attention for sequences up to 4096 tokens

Medium
July 22, 2020

Longformer: sliding-window attention for long documents

Medium
January 13, 2020

Reformer: the transformer that handles very long sequences

Medium

← All terms