## Definition
A **loss function** $L(\hat y, y)$ measures the discrepancy between a model's prediction $\hat y$ and the true target $y$. Optimisation minimises the average loss over the training set. The choice of loss shapes what the model considers "correct" — and therefore what it learns.
## Regression Losses
- **Squared error** (MSE): $L = (\hat y - y)^2$. Smooth, penalises large errors quadratically. The default for regression. Predicts the *mean* in optimum.
- **Absolute error** (MAE): $L = |\hat y - y|$. Linear penalty. Robust to outliers. Predicts the *median*.
- **Huber loss**: quadratic near zero, linear far from zero. Combines MSE smoothness with MAE robustness. Used heavily in robust regression and Deep Q-Networks.
- **Log-cosh**: $\log(\cosh(\hat y - y))$. Smooth approximation of MAE.
See [[MSE MAE RMSE]] for more.
## Classification Losses
### Binary Cross-Entropy (Log Loss)
For binary labels $y \in \{0, 1\}$ and predicted probability $\hat p$:
$
L = -[y \log \hat p + (1 - y) \log(1 - \hat p)]
$
The standard loss for [[Logistic Regression]] and binary classification with neural networks. Minimising it = maximising the likelihood of observed labels under a Bernoulli model.
### Categorical Cross-Entropy
For one-hot labels $y_c$ across $K$ classes:
$
L = -\sum_{c=1}^K y_c \log \hat p_c
$
Combined with softmax output for multi-class classification.
### Hinge Loss
For SVM: $L = \max(0, 1 - y \cdot \hat y)$ with $y \in \{-1, +1\}$. Zero loss inside the margin; linear outside.
### Focal Loss
$L = -(1 - \hat p)^\gamma \log \hat p$ for the true class. Down-weights well-classified examples; focuses on hard ones. Designed for severe class imbalance (object detection).
## Properties
A good loss function:
- **Aligns with the metric you care about.** Don't train with MSE then report MAE.
- **Is differentiable** (or has differentiable surrogates). Gradient-based optimisation needs gradients.
- **Is convex** when possible — guarantees a unique global minimum. Many useful losses (cross-entropy on linear models) are convex.
- **Is bounded below** — otherwise optimisation diverges.
## Surrogate Losses
Some metrics — accuracy, F1, AUC — are non-differentiable. Training optimises a *surrogate* loss (cross-entropy) that correlates with the true metric. The gap between surrogate and metric is a real source of error and is sometimes addressed with task-specific losses (e.g., differentiable surrogates of AUC).
## Loss Engineering
In production, the loss often becomes the focal point for ML engineering:
- **Asymmetric losses** for unequal error costs.
- **Multi-task losses** combining multiple objectives.
- **Custom losses** encoding business logic (a refund miss costs 10x more than a false alarm).
A model is the loss it optimises. Change the loss; change the model.
## Related
- [[Gradient Descent]]
- [[MSE MAE RMSE]]
- [[Logistic Regression]]
- [[Cross-Entropy Loss]]