In practice
It is what makes a Transformer "causal" or decoder-only: during training the model learns to predict the next token without cheating by looking ahead. At inference time the mask becomes implicit because future tokens do not yet exist. Without it GPT would not make sense.
Related terms
Seen in the wild
0 entries mentioning itNo archive entry mentions it explicitly. Appears in broader contexts.