Data Intermediate Also known as: Tokenizzazione a sotto-parole

Subword Tokenization

A family of techniques that splits text into pieces smaller than a whole word but larger than a single character.

In practice

It is a trade-off between huge vocabularies (one word = one token) and tiny ones (one character = one token). It handles unseen words, typos, and many languages without blowing up in size. Every modern LLM uses some form of subword tokenization.

Seen in the wild

0 entries mentioning it

No archive entry mentions it explicitly. Appears in broader contexts.

← All terms

In practice

Related terms

Seen in the wild