## Definition
**Tokenization** is the process that converts a string of text into a sequence of integer **tokens** — the discrete units an LLM operates on. The tokenizer is paired with the model: it defines the vocabulary and the rules for breaking input into pieces.
## Dominant Schemes
- **Byte-Pair Encoding (BPE)** — start from bytes; iteratively merge the most frequent adjacent pair. Used by GPT-2 and many descendants.
- **WordPiece** — similar to BPE but uses likelihood rather than frequency. Used by BERT.
- **SentencePiece** — language-agnostic, operates on raw bytes; used by T5, Llama, Mistral, Gemma.
- **Tiktoken** — OpenAI's BPE variant.
- **Anthropic tokenizer** — Claude's tokenizer, also BPE-based; broadly similar token counts to GPT-class on English.
## Vocabulary Size
Typical modern vocabularies: 32k–256k tokens. The trade-off is **expressivity** (larger vocab = shorter sequences per text) vs. **embedding-table cost** (larger vocab = more parameters at the input/output).
## Practical Properties
- English code averages **~4 characters per token** — see [[Token]].
- Chinese, Japanese, and Korean often tokenize at ~1 character per token, sometimes worse — multilingual cost is real.
- Source code, JSON, and SQL tokenize differently from prose — measure your own corpus.
## Subword Behaviour
Out-of-vocabulary words are not lost; they decompose into multiple subword tokens. This is why a tokenizer can encode arbitrary text — at the cost of more tokens per unfamiliar string.
## Why It Matters Operationally
- **Pricing** is per token.
- **Context window** is measured in tokens — see [[Context Window]].
- **Latency** scales with tokens generated, not characters.
- **Prompt design** can deliberately reduce token count by choosing more common phrasings.
## Related
- [[Token]]
- [[Embedding]]
- [[Large Language Model]]
- [[Transformer Architecture]]