## Definition
**Backpropagation** is the algorithm for efficiently computing gradients of a neural network's loss with respect to all its parameters. The engine behind every neural network's training. Popularised by Rumelhart, Hinton & Williams (1986).
## The Idea
Use the **chain rule** of calculus to propagate gradients backward through the network — starting from the loss, layer by layer to the inputs.
For composition $L = f_L \circ f_{L-1} \circ \dots \circ f_1$:
$
\frac{\partial L}{\partial \theta_\ell} = \frac{\partial L}{\partial h^{(L)}} \cdot \frac{\partial h^{(L)}}{\partial h^{(L-1)}} \cdots \frac{\partial h^{(\ell)}}{\partial \theta_\ell}
$
## Two Passes
### Forward Pass
Compute and store all intermediate activations:
$
h^{(\ell)} = \sigma(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)})
$
### Backward Pass
Working from the output back to the input:
1. Compute $\delta^{(L)} = \nabla_{h^{(L)}} L$ (the loss gradient at the output).
2. For each layer $\ell = L, L-1, \dots, 1$:
- $\nabla_{W^{(\ell)}} L = \delta^{(\ell)} (h^{(\ell-1)})^\top$
- $\nabla_{b^{(\ell)}} L = \delta^{(\ell)}$
- $\delta^{(\ell-1)} = (W^{(\ell)})^\top \delta^{(\ell)} \odot \sigma'(z^{(\ell-1)})$
The whole pass is $O(\text{forward pass})$ — gradient computation costs roughly the same as inference.
## Automatic Differentiation
Modern frameworks (PyTorch, JAX, TensorFlow) implement **autograd**: build a computation graph during the forward pass, compute gradients by traversing it backward automatically. Hand-derivation is obsolete; you write the forward pass and gradients come for free.
## Why It Was Revolutionary
Before backpropagation, training multi-layer networks required arbitrary heuristics. With it, gradient-based optimisation works for arbitrary differentiable architectures — unlocking neural networks as a general modelling tool.
## Failure Modes
- **Vanishing gradients.** Repeated multiplication of small derivatives (e.g., sigmoid with saturating units) makes gradients shrink toward zero in deep networks. Mitigations: ReLU activations, residual connections, careful initialisation.
- **Exploding gradients.** Repeated multiplication of large derivatives. Mitigations: gradient clipping, normalisation.
See [[Vanishing and Exploding Gradients]].
## Memory Cost
Backprop must store activations from the forward pass to compute backward gradients. For very deep networks, this is the memory bottleneck. Mitigations:
- **Gradient checkpointing.** Re-compute some activations on the backward pass.
- **Mixed precision training.** Store activations in lower precision (fp16 / bf16).
## Beyond First-Order
Backpropagation computes first derivatives. Second-order methods need the Hessian (or approximations). For most deep learning, first-order with adaptive optimisers (Adam) wins on cost-effectiveness.
## Related
- [[Multilayer Perceptron]]
- [[Gradient Descent]]
- [[Vanishing and Exploding Gradients]]
- [[Optimizers SGD Adam]]