Self-Supervised Learning - Albert Masoliver's learning site

## Definition **Self-supervised learning (SSL)** derives supervisory signals from the *structure of unlabelled data itself*, then trains a model as if it were [[Supervised Learning]]. The bridge between unsupervised and supervised paradigms — and the engine behind modern foundation models. ## The Core Idea Take unlabelled data; design a *pretext task* whose label can be computed from the input alone; train on it. The model is forced to learn representations useful for the pretext task — and those representations turn out useful for many downstream tasks. ## Examples ### Language - **Next-token prediction.** Given prefix $x_{<t}$, predict $x_t$. The objective of GPT and most LLMs. See [[Pretraining]]. - **Masked language modelling.** Replace random tokens with `[MASK]`; predict them. The BERT objective. - **Permutation prediction.** Predict whether two sentences are consecutive. ### Vision - **Predict relative position** of image patches. - **Predict rotation** applied to an image. - **Reconstruct masked patches** (Masked Autoencoders / MAE). - **Contrastive learning** — SimCLR, MoCo, DINO: pull together augmentations of the same image; push apart different images. ### Audio - **Predict next audio frame.** wav2vec. - **Contrastive predictive coding** (CPC). ### Multimodal - **CLIP** — pair images and text from the web; pull paired pairs together, push apart unpaired. ## Why It Works The pretext task forces the model to capture meaningful structure to succeed. A model that can predict the next token must implicitly model syntax, semantics, and world knowledge. A model that can reconstruct masked image patches must understand object shapes, textures, and scene composition. The representations transfer to downstream tasks — often outperforming representations from purely supervised training, especially when downstream labels are scarce. ## Difference from Unsupervised Learning The label/loss is exactly the supervised one (cross-entropy, contrastive); only the *source* of the label differs. From an algorithmic standpoint, SSL is supervised — from a data-engineering standpoint, it's unsupervised. ## Why It Took Off Three preconditions came together around 2018–2020: 1. **Vast unlabelled corpora** (Common Crawl, ImageNet without labels, YouTube audio). 2. **Compute scale** to train large models on those corpora. 3. **Architectures** ([[Transformer Architecture]]) that exploited the data scale. The result was the foundation-model era: train once on massive self-supervised pretext; adapt many times to downstream tasks. ## Related - [[Pretraining]] - [[Unsupervised Learning]] - [[Supervised Learning]] - [[Foundation Model]] - [[Large Language Model]]