In practice
It is the foundation of basically every modern LLM. If you build products you do not need to implement it from scratch: you use frameworks like PyTorch or call APIs. Knowing it is parallelizable explains why training needs heavy GPUs.
Related terms
Seen in the wild
19 entries mentioning it- HighCrossFormer: a single transformer for 20+ robot embodiments with rigorous scaling analysis
- Mediumbitsandbytes 0.43: QLoRA and NF4/FP4 quantization for 4-bit fine-tuning
- LandmarkFLUX.1: the new open standard for photorealistic image generation
- HighFP8 Training with NVIDIA Transformer Engine: Half the Memory, Same Quality
- HighStable Diffusion 3: Diffusion Transformer architecture and improved text
- LandmarkSora: OpenAI shows cinema-quality AI video
- HighRT-2: the robot that reasons with a language model
- HighFlashAttention-2: rewrite with 2x speedup, MQA/GQA support, and head-dim 256
- HighDeepMind RT-1: the first Transformer trained on real robotics data
- LandmarkFlashAttention: IO-aware attention that revolutionizes transformer training
- HighGato: DeepMind tries a single agent for 600+ tasks
- LandmarkNVIDIA H100 and Hopper architecture: the foundation-model GPU
- HighSwitch Transformer: Google scales to 1.6T parameters with Mixture of Experts
- LandmarkVision Transformer (ViT): "An Image is Worth 16x16 Words"
- MediumLongformer: sliding-window attention for long documents
- HighHuggingFace Transformers 3.0: Rust tokenizers and the Model Hub
- MediumImage GPT: generative pretraining for images
- LandmarkGPT-3: the paper that opens the scaling-laws era
- MediumReformer: the transformer that handles very long sequences