Loss Functions - Albert Masoliver's learning site

## Definition A **loss function** $L(\hat y, y)$ measures the discrepancy between a model's prediction $\hat y$ and the true target $y$. Optimisation minimises the average loss over the training set. The choice of loss shapes what the model considers "correct" — and therefore what it learns. ## Regression Losses - **Squared error** (MSE): $L = (\hat y - y)^2$. Smooth, penalises large errors quadratically. The default for regression. Predicts the *mean* in optimum. - **Absolute error** (MAE): $L = |\hat y - y|$. Linear penalty. Robust to outliers. Predicts the *median*. - **Huber loss**: quadratic near zero, linear far from zero. Combines MSE smoothness with MAE robustness. Used heavily in robust regression and Deep Q-Networks. - **Log-cosh**: $\log(\cosh(\hat y - y))$. Smooth approximation of MAE. See [[MSE MAE RMSE]] for more. ## Classification Losses ### Binary Cross-Entropy (Log Loss) For binary labels $y \in \{0, 1\}$ and predicted probability $\hat p$: $ L = -[y \log \hat p + (1 - y) \log(1 - \hat p)] $ The standard loss for [[Logistic Regression]] and binary classification with neural networks. Minimising it = maximising the likelihood of observed labels under a Bernoulli model. ### Categorical Cross-Entropy For one-hot labels $y_c$ across $K$ classes: $ L = -\sum_{c=1}^K y_c \log \hat p_c $ Combined with softmax output for multi-class classification. ### Hinge Loss For SVM: $L = \max(0, 1 - y \cdot \hat y)$ with $y \in \{-1, +1\}$. Zero loss inside the margin; linear outside. ### Focal Loss $L = -(1 - \hat p)^\gamma \log \hat p$ for the true class. Down-weights well-classified examples; focuses on hard ones. Designed for severe class imbalance (object detection). ## Properties A good loss function: - **Aligns with the metric you care about.** Don't train with MSE then report MAE. - **Is differentiable** (or has differentiable surrogates). Gradient-based optimisation needs gradients. - **Is convex** when possible — guarantees a unique global minimum. Many useful losses (cross-entropy on linear models) are convex. - **Is bounded below** — otherwise optimisation diverges. ## Surrogate Losses Some metrics — accuracy, F1, AUC — are non-differentiable. Training optimises a *surrogate* loss (cross-entropy) that correlates with the true metric. The gap between surrogate and metric is a real source of error and is sometimes addressed with task-specific losses (e.g., differentiable surrogates of AUC). ## Loss Engineering In production, the loss often becomes the focal point for ML engineering: - **Asymmetric losses** for unequal error costs. - **Multi-task losses** combining multiple objectives. - **Custom losses** encoding business logic (a refund miss costs 10x more than a false alarm). A model is the loss it optimises. Change the loss; change the model. ## Related - [[Gradient Descent]] - [[MSE MAE RMSE]] - [[Logistic Regression]] - [[Cross-Entropy Loss]]