Tokenization - Albert Masoliver's learning site

## Definition **Tokenization** is the process that converts a string of text into a sequence of integer **tokens** — the discrete units an LLM operates on. The tokenizer is paired with the model: it defines the vocabulary and the rules for breaking input into pieces. ## Dominant Schemes - **Byte-Pair Encoding (BPE)** — start from bytes; iteratively merge the most frequent adjacent pair. Used by GPT-2 and many descendants. - **WordPiece** — similar to BPE but uses likelihood rather than frequency. Used by BERT. - **SentencePiece** — language-agnostic, operates on raw bytes; used by T5, Llama, Mistral, Gemma. - **Tiktoken** — OpenAI's BPE variant. - **Anthropic tokenizer** — Claude's tokenizer, also BPE-based; broadly similar token counts to GPT-class on English. ## Vocabulary Size Typical modern vocabularies: 32k–256k tokens. The trade-off is **expressivity** (larger vocab = shorter sequences per text) vs. **embedding-table cost** (larger vocab = more parameters at the input/output). ## Practical Properties - English code averages **~4 characters per token** — see [[Token]]. - Chinese, Japanese, and Korean often tokenize at ~1 character per token, sometimes worse — multilingual cost is real. - Source code, JSON, and SQL tokenize differently from prose — measure your own corpus. ## Subword Behaviour Out-of-vocabulary words are not lost; they decompose into multiple subword tokens. This is why a tokenizer can encode arbitrary text — at the cost of more tokens per unfamiliar string. ## Why It Matters Operationally - **Pricing** is per token. - **Context window** is measured in tokens — see [[Context Window]]. - **Latency** scales with tokens generated, not characters. - **Prompt design** can deliberately reduce token count by choosing more common phrasings. ## Related - [[Token]] - [[Embedding]] - [[Large Language Model]] - [[Transformer Architecture]]