In practice
It is why an LLM knows that 'he' in a sentence refers to a person mentioned earlier. Compute cost grows with the square of context length: this is why very long contexts are expensive.
Related terms
Seen in the wild
12 entries mentioning it- MediumvLLM v0.7: chunked prefill by default and a redesigned V1 engine
- MediumContinuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI
- MediumFlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion
- HighFlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8
- MediumSGLang: 6.4x LLM throughput with RadixAttention and shared prefix caching
- HighFlashAttention-2: rewrite with 2x speedup, MQA/GQA support, and head-dim 256
- HighHuggingFace TGI: production-ready Docker container for LLM serving with continuous batching
- HighvLLM: 24x LLM throughput with PagedAttention from UC Berkeley
- LandmarkFlashAttention: IO-aware attention that revolutionizes transformer training
- MediumBig Bird at NeurIPS 2020: sparse attention for sequences up to 4096 tokens
- MediumLongformer: sliding-window attention for long documents
- MediumReformer: the transformer that handles very long sequences