Masked Language Model - Albert Masoliver's learning site

## Definition A **masked language model** (MLM) is trained to predict randomly masked tokens anywhere in a sequence, using context from *both* directions — the tokens before and after the mask. BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) is the canonical example: > "My favorite \_\_ is blue." → predict **color** Because both left and right context are available, MLMs develop rich bidirectional representations ideal for understanding tasks. Unlike [[Autoregressive Language Model]]s, MLMs are not trained to generate open-ended outputs; their native output is a probability distribution over the vocabulary *at the masked positions only*. ## Architecture MLMs use an **encoder-only** transformer. The full input sequence (with mask tokens inserted) is processed in one forward pass, and the encoder produces a contextualised embedding for every position. A classification head on top of the `[MASK]` token position then predicts the original token. This contrasts with decoder-only autoregressive models (e.g., GPT family) and encoder-decoder models (e.g., T5, seq2seq). ## Training Signal During pretraining, a fixed percentage of tokens (15% in the original BERT) is randomly replaced with a `[MASK]` token. The model is trained to recover them. To prevent the model from learning that `[MASK]` is a special test signal, BERT uses a mixed strategy: - 80% of selected tokens → replaced with `[MASK]` - 10% → replaced with a random token - 10% → left unchanged The loss is computed only over the masked positions, making it self-supervised: no human labels are required. ## Primary Use Cases As of Huyen (2024), masked language models are most commonly applied to: - **Text classification** — sentiment analysis, topic classification, spam detection. - **Named entity recognition (NER)** and sequence labelling. - **Code debugging** — where understanding both preceding and following code is necessary to identify errors. - **Semantic embeddings** — bidirectional context makes MLM-derived embeddings highly informative for retrieval and similarity tasks. See [[Embedding]] and [[Embedding-Based Retrieval]]. MLMs are not the natural fit for generative tasks (text generation, summarisation, translation), which require [[Autoregressive Language Model]]s. ## Contrast with Autoregressive Models | Property | Masked LM (e.g., BERT) | Autoregressive LM (e.g., GPT) | |---|---|---| | Direction | Bidirectional | Left-to-right only | | Output | Token at masked position | Entire continuation | | Use case | Classification, embeddings | Open-ended generation | | Architecture | Encoder-only | Decoder-only | ## Notable Examples - **BERT** (Google, 2018) — the original MLM at scale; spawned a large family of derivatives. - **RoBERTa** (Meta, 2019) — improved BERT training recipe; no next-sentence prediction objective. - **DeBERTa** (Microsoft, 2020) — disentangled attention for stronger understanding performance. ## Related - [[Autoregressive Language Model]] - [[Large Language Model]] - [[Transformer Architecture]] - [[Embedding]] - [[Embedding-Based Retrieval]] - [[Foundation Model]] ## Sources - [[AI Engineering - Chip Huyen]]