Backpropagation - Albert Masoliver's learning site

## Definition **Backpropagation** is the algorithm for efficiently computing gradients of a neural network's loss with respect to all its parameters. The engine behind every neural network's training. Popularised by Rumelhart, Hinton & Williams (1986). ## The Idea Use the **chain rule** of calculus to propagate gradients backward through the network — starting from the loss, layer by layer to the inputs. For composition $L = f_L \circ f_{L-1} \circ \dots \circ f_1$: $ \frac{\partial L}{\partial \theta_\ell} = \frac{\partial L}{\partial h^{(L)}} \cdot \frac{\partial h^{(L)}}{\partial h^{(L-1)}} \cdots \frac{\partial h^{(\ell)}}{\partial \theta_\ell} $ ## Two Passes ### Forward Pass Compute and store all intermediate activations: $ h^{(\ell)} = \sigma(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}) $ ### Backward Pass Working from the output back to the input: 1. Compute $\delta^{(L)} = \nabla_{h^{(L)}} L$ (the loss gradient at the output). 2. For each layer $\ell = L, L-1, \dots, 1$: - $\nabla_{W^{(\ell)}} L = \delta^{(\ell)} (h^{(\ell-1)})^\top$ - $\nabla_{b^{(\ell)}} L = \delta^{(\ell)}$ - $\delta^{(\ell-1)} = (W^{(\ell)})^\top \delta^{(\ell)} \odot \sigma'(z^{(\ell-1)})$ The whole pass is $O(\text{forward pass})$ — gradient computation costs roughly the same as inference. ## Automatic Differentiation Modern frameworks (PyTorch, JAX, TensorFlow) implement **autograd**: build a computation graph during the forward pass, compute gradients by traversing it backward automatically. Hand-derivation is obsolete; you write the forward pass and gradients come for free. ## Why It Was Revolutionary Before backpropagation, training multi-layer networks required arbitrary heuristics. With it, gradient-based optimisation works for arbitrary differentiable architectures — unlocking neural networks as a general modelling tool. ## Failure Modes - **Vanishing gradients.** Repeated multiplication of small derivatives (e.g., sigmoid with saturating units) makes gradients shrink toward zero in deep networks. Mitigations: ReLU activations, residual connections, careful initialisation. - **Exploding gradients.** Repeated multiplication of large derivatives. Mitigations: gradient clipping, normalisation. See [[Vanishing and Exploding Gradients]]. ## Memory Cost Backprop must store activations from the forward pass to compute backward gradients. For very deep networks, this is the memory bottleneck. Mitigations: - **Gradient checkpointing.** Re-compute some activations on the backward pass. - **Mixed precision training.** Store activations in lower precision (fp16 / bf16). ## Beyond First-Order Backpropagation computes first derivatives. Second-order methods need the Hessian (or approximations). For most deep learning, first-order with adaptive optimisers (Adam) wins on cost-effectiveness. ## Related - [[Multilayer Perceptron]] - [[Gradient Descent]] - [[Vanishing and Exploding Gradients]] - [[Optimizers SGD Adam]]