Supervised Learning - Albert Masoliver's learning site

## Definition **Supervised learning** is the paradigm of learning a function $f: X \to Y$ from a labelled dataset $\{(x_i, y_i)\}_{i=1}^n$. The "supervision" is the label $y_i$ — a target the model is told to predict from the input $x_i$. ## Two Sub-Tasks - **Regression** — output is continuous: house prices, temperatures, sales. See [[Linear Regression]]. - **Classification** — output is categorical: spam/not-spam, disease diagnosis, image category. See [[Logistic Regression]], [[kNN]]. ## The Learning Objective Choose a model $f_\theta$ parameterised by $\theta$ and minimise the expected loss: $ \theta^* = \arg\min_\theta \mathbb{E}_{(x, y) \sim P} \left[ L(f_\theta(x), y) \right] $ In practice we minimise *empirical risk* on the training set: $ \hat\theta = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n L(f_\theta(x_i), y_i) + \Omega(\theta) $ with $\Omega$ a regularizer ([[L1 Regularization]], [[L2 Regularization]]). ## Common Loss Functions | Task | Loss | | ------------------- | ----------------------------------- | | Regression | Squared error $(y - \hat y)^2$, absolute error $|y - \hat y|$ | | Binary classification | Cross-entropy / log loss | | Multi-class classification | Categorical cross-entropy | | Ranking | Pairwise hinge, listwise losses | | Imbalanced classes | Focal loss, weighted cross-entropy | See [[Loss Functions]] for a deeper treatment. ## Model Families - **Linear** — [[Linear Regression]], [[Logistic Regression]]. - **Distance-based** — [[kNN]], [[Support Vector Machine]]. - **Tree-based** — [[Decision Trees]], [[Random Forest]], [[XGBoost]]. - **Probabilistic** — [[Naive Bayes]]. - **Neural** — see [[9 - Deep Learning Notes Hub]]. ## The Generalisation Bargain The whole game of supervised learning: minimise training loss *while controlling* the gap between training and test performance — the **generalisation gap**. See [[Bias-Variance Tradeoff]], [[Overfitting and Underfitting]]. ## Labels: The Bottleneck In practice, the cost of labelling data often dominates the cost of training models. Strategies to reduce: - **Active learning** — model selects the most informative examples for labelling. - **Semi-supervised learning** — combine few labels with many unlabelled examples. - **[[Self-Supervised Learning]]** — derive labels from the structure of unlabelled data. - **Synthetic data** — generate labelled examples programmatically. ## Related - [[Machine Learning]] - [[Unsupervised Learning]] - [[Reinforcement Learning]] - [[Loss Functions]] - [[Bias-Variance Tradeoff]]