## Definition
**Self-supervised learning (SSL)** derives supervisory signals from the *structure of unlabelled data itself*, then trains a model as if it were [[Supervised Learning]]. The bridge between unsupervised and supervised paradigms — and the engine behind modern foundation models.
## The Core Idea
Take unlabelled data; design a *pretext task* whose label can be computed from the input alone; train on it. The model is forced to learn representations useful for the pretext task — and those representations turn out useful for many downstream tasks.
## Examples
### Language
- **Next-token prediction.** Given prefix $x_{<t}$, predict $x_t$. The objective of GPT and most LLMs. See [[Pretraining]].
- **Masked language modelling.** Replace random tokens with `[MASK]`; predict them. The BERT objective.
- **Permutation prediction.** Predict whether two sentences are consecutive.
### Vision
- **Predict relative position** of image patches.
- **Predict rotation** applied to an image.
- **Reconstruct masked patches** (Masked Autoencoders / MAE).
- **Contrastive learning** — SimCLR, MoCo, DINO: pull together augmentations of the same image; push apart different images.
### Audio
- **Predict next audio frame.** wav2vec.
- **Contrastive predictive coding** (CPC).
### Multimodal
- **CLIP** — pair images and text from the web; pull paired pairs together, push apart unpaired.
## Why It Works
The pretext task forces the model to capture meaningful structure to succeed. A model that can predict the next token must implicitly model syntax, semantics, and world knowledge. A model that can reconstruct masked image patches must understand object shapes, textures, and scene composition.
The representations transfer to downstream tasks — often outperforming representations from purely supervised training, especially when downstream labels are scarce.
## Difference from Unsupervised Learning
The label/loss is exactly the supervised one (cross-entropy, contrastive); only the *source* of the label differs. From an algorithmic standpoint, SSL is supervised — from a data-engineering standpoint, it's unsupervised.
## Why It Took Off
Three preconditions came together around 2018–2020:
1. **Vast unlabelled corpora** (Common Crawl, ImageNet without labels, YouTube audio).
2. **Compute scale** to train large models on those corpora.
3. **Architectures** ([[Transformer Architecture]]) that exploited the data scale.
The result was the foundation-model era: train once on massive self-supervised pretext; adapt many times to downstream tasks.
## Related
- [[Pretraining]]
- [[Unsupervised Learning]]
- [[Supervised Learning]]
- [[Foundation Model]]
- [[Large Language Model]]