In practice
It is the architecture of GPT, Llama, Mistral, Claude, and basically every modern generative LLM. It contrasts with encoder-only (BERT, for classification) and encoder-decoder (T5, for translation). Its simplicity is the reason it scales so well in pretraining.
Related terms
Seen in the wild
0 entries mentioning itNo archive entry mentions it explicitly. Appears in broader contexts.