## Definition
A **masked language model** (MLM) is trained to predict randomly masked tokens anywhere in a sequence, using context from *both* directions — the tokens before and after the mask. BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) is the canonical example:
> "My favorite \_\_ is blue." → predict **color**
Because both left and right context are available, MLMs develop rich bidirectional representations ideal for understanding tasks. Unlike [[Autoregressive Language Model]]s, MLMs are not trained to generate open-ended outputs; their native output is a probability distribution over the vocabulary *at the masked positions only*.
## Architecture
MLMs use an **encoder-only** transformer. The full input sequence (with mask tokens inserted) is processed in one forward pass, and the encoder produces a contextualised embedding for every position. A classification head on top of the `[MASK]` token position then predicts the original token.
This contrasts with decoder-only autoregressive models (e.g., GPT family) and encoder-decoder models (e.g., T5, seq2seq).
## Training Signal
During pretraining, a fixed percentage of tokens (15% in the original BERT) is randomly replaced with a `[MASK]` token. The model is trained to recover them. To prevent the model from learning that `[MASK]` is a special test signal, BERT uses a mixed strategy:
- 80% of selected tokens → replaced with `[MASK]`
- 10% → replaced with a random token
- 10% → left unchanged
The loss is computed only over the masked positions, making it self-supervised: no human labels are required.
## Primary Use Cases
As of Huyen (2024), masked language models are most commonly applied to:
- **Text classification** — sentiment analysis, topic classification, spam detection.
- **Named entity recognition (NER)** and sequence labelling.
- **Code debugging** — where understanding both preceding and following code is necessary to identify errors.
- **Semantic embeddings** — bidirectional context makes MLM-derived embeddings highly informative for retrieval and similarity tasks. See [[Embedding]] and [[Embedding-Based Retrieval]].
MLMs are not the natural fit for generative tasks (text generation, summarisation, translation), which require [[Autoregressive Language Model]]s.
## Contrast with Autoregressive Models
| Property | Masked LM (e.g., BERT) | Autoregressive LM (e.g., GPT) |
|---|---|---|
| Direction | Bidirectional | Left-to-right only |
| Output | Token at masked position | Entire continuation |
| Use case | Classification, embeddings | Open-ended generation |
| Architecture | Encoder-only | Decoder-only |
## Notable Examples
- **BERT** (Google, 2018) — the original MLM at scale; spawned a large family of derivatives.
- **RoBERTa** (Meta, 2019) — improved BERT training recipe; no next-sentence prediction objective.
- **DeBERTa** (Microsoft, 2020) — disentangled attention for stronger understanding performance.
## Related
- [[Autoregressive Language Model]]
- [[Large Language Model]]
- [[Transformer Architecture]]
- [[Embedding]]
- [[Embedding-Based Retrieval]]
- [[Foundation Model]]
## Sources
- [[AI Engineering - Chip Huyen]]