## Definition **Vanishing** and **exploding gradients** are pathologies in deep neural network training where gradients shrink toward zero (vanishing) or grow uncontrollably (exploding) as they propagate through many layers via [[Backpropagation]]. Both break learning. ## Mechanism Backpropagation multiplies Jacobians at each layer. If the singular values of these Jacobians are systematically < 1, gradients shrink exponentially in depth. If > 1, they explode. For an $L$-layer network with identical layers of singular value $\sigma$: $ \|\text{gradient at layer 0}\| \sim \sigma^L $ For $\sigma = 0.9$, $L = 50$: gradient shrinks by $0.9^{50} \approx 0.005$. For $\sigma = 1.1$, $L = 50$: gradient grows by $1.1^{50} \approx 117$. Both ruin training. ## Symptoms - **Vanishing:** training loss plateaus or decreases very slowly; early layers barely change weights. - **Exploding:** training loss spikes; NaN values; weights become huge. ## Why Sigmoid/Tanh Were Replaced Sigmoid's derivative peaks at 0.25; tanh's at 1.0. In a deep network, repeated multiplication by 0.25 vanishes gradients fast. ReLU has derivative 0 or 1, which avoids vanishing for active units — one of the reasons it became standard. ## Mitigations ### For Vanishing Gradients - **Better activations.** ReLU, GELU, ELU avoid saturation. - **Better initialisation.** [[Weight Initialization]] (He, Xavier) scales weights to preserve gradient norms. - **Skip / residual connections.** ResNet's $h^{(\ell+1)} = h^{(\ell)} + f(h^{(\ell)})$ gives gradients a "highway" to flow back through. - **[[Batch Normalization]]** and [[Layer Normalization]] keep activations in healthy ranges. - **LSTM / GRU** for sequence models — explicitly designed to avoid vanishing gradients through gating. ### For Exploding Gradients - **Gradient clipping.** Cap the gradient norm: $g \leftarrow g \cdot \min(1, c / \|g\|)$. - **Lower learning rate.** - **Better initialisation.** - **Normalisation layers.** ## Connection to RNN Difficulty Vanishing/exploding gradients are particularly severe in **recurrent** networks because the same weight matrix is multiplied at every timestep. A sequence of length 100 means 100 multiplications. This is why: - **LSTM** (Hochreiter & Schmidhuber, 1997) was developed: gated cell state preserves information without repeated multiplications. - **Transformers** largely displaced RNNs for long sequences: attention provides direct paths between any two positions, sidestepping the recurrent multiplication entirely. ## The Modern Status With modern initialisation + ReLU-family activations + normalisation + residual connections, vanishing/exploding gradients are *mostly* a solved problem for feedforward networks. Very deep transformers (100+ layers) still need careful initialisation and gradient clipping but train reliably. ## Related - [[Backpropagation]] - [[Weight Initialization]] - [[Batch Normalization]] - [[Skip Connections]] - [[Long Short-Term Memory]]