In practice
With k=1 it becomes greedy decoding; with large k it is almost the full distribution again. It is used to stop the model from picking absurd words from the tail. Modern APIs often replace or combine it with top-p, which is considered more adaptive.
Related terms
Seen in the wild
0 entries mentioning itNo archive entry mentions it explicitly. Appears in broader contexts.