Data Intermediate Also known as: WordPiece · SentencePiece

WordPiece / SentencePiece

Two subword tokenization algorithms alternative to BPE: WordPiece is the one in BERT, SentencePiece is the one in T5 and Gemini.

ShareLinkedIn X

In practice

WordPiece chooses merges by probability rather than raw frequency. SentencePiece works directly on the raw string without assuming spaces, so it handles Chinese, Japanese, and other space-less languages better. Switching tokenizer requires retraining the model.

Seen in the wild

0 entries mentioning it

No archive entry mentions it explicitly. Appears in broader contexts.

← All terms

In practice

Related terms

Seen in the wild