Skip to content
AImpact
IT EN
Models Beginner Also known as: Architettura Transformer

Transformer

A neural network architecture introduced by Google in 2017 that uses the attention mechanism to process text in parallel rather than word by word.

ShareLinkedInX

In practice

It is the foundation of basically every modern LLM. If you build products you do not need to implement it from scratch: you use frameworks like PyTorch or call APIs. Knowing it is parallelizable explains why training needs heavy GPUs.

Related terms

Seen in the wild

19 entries mentioning it
  1. CrossFormer: a single transformer for 20+ robot embodiments with rigorous scaling analysis
    High
  2. bitsandbytes 0.43: QLoRA and NF4/FP4 quantization for 4-bit fine-tuning
    Medium
  3. FLUX.1: the new open standard for photorealistic image generation
    Landmark
  4. FP8 Training with NVIDIA Transformer Engine: Half the Memory, Same Quality
    High
  5. Stable Diffusion 3: Diffusion Transformer architecture and improved text
    High
  6. Sora: OpenAI shows cinema-quality AI video
    Landmark
  7. RT-2: the robot that reasons with a language model
    High
  8. FlashAttention-2: rewrite with 2x speedup, MQA/GQA support, and head-dim 256
    High
  9. DeepMind RT-1: the first Transformer trained on real robotics data
    High
  10. FlashAttention: IO-aware attention that revolutionizes transformer training
    Landmark
  11. Gato: DeepMind tries a single agent for 600+ tasks
    High
  12. NVIDIA H100 and Hopper architecture: the foundation-model GPU
    Landmark
  13. Switch Transformer: Google scales to 1.6T parameters with Mixture of Experts
    High
  14. Vision Transformer (ViT): "An Image is Worth 16x16 Words"
    Landmark
  15. Longformer: sliding-window attention for long documents
    Medium
  16. HuggingFace Transformers 3.0: Rust tokenizers and the Model Hub
    High
  17. Image GPT: generative pretraining for images
    Medium
  18. GPT-3: the paper that opens the scaling-laws era
    Landmark
  19. Reformer: the transformer that handles very long sequences
    Medium
← All terms