## Definition
**Vanishing** and **exploding gradients** are pathologies in deep neural network training where gradients shrink toward zero (vanishing) or grow uncontrollably (exploding) as they propagate through many layers via [[Backpropagation]]. Both break learning.
## Mechanism
Backpropagation multiplies Jacobians at each layer. If the singular values of these Jacobians are systematically < 1, gradients shrink exponentially in depth. If > 1, they explode.
For an $L$-layer network with identical layers of singular value $\sigma$:
$
\|\text{gradient at layer 0}\| \sim \sigma^L
$
For $\sigma = 0.9$, $L = 50$: gradient shrinks by $0.9^{50} \approx 0.005$. For $\sigma = 1.1$, $L = 50$: gradient grows by $1.1^{50} \approx 117$. Both ruin training.
## Symptoms
- **Vanishing:** training loss plateaus or decreases very slowly; early layers barely change weights.
- **Exploding:** training loss spikes; NaN values; weights become huge.
## Why Sigmoid/Tanh Were Replaced
Sigmoid's derivative peaks at 0.25; tanh's at 1.0. In a deep network, repeated multiplication by 0.25 vanishes gradients fast. ReLU has derivative 0 or 1, which avoids vanishing for active units — one of the reasons it became standard.
## Mitigations
### For Vanishing Gradients
- **Better activations.** ReLU, GELU, ELU avoid saturation.
- **Better initialisation.** [[Weight Initialization]] (He, Xavier) scales weights to preserve gradient norms.
- **Skip / residual connections.** ResNet's $h^{(\ell+1)} = h^{(\ell)} + f(h^{(\ell)})$ gives gradients a "highway" to flow back through.
- **[[Batch Normalization]]** and [[Layer Normalization]] keep activations in healthy ranges.
- **LSTM / GRU** for sequence models — explicitly designed to avoid vanishing gradients through gating.
### For Exploding Gradients
- **Gradient clipping.** Cap the gradient norm: $g \leftarrow g \cdot \min(1, c / \|g\|)$.
- **Lower learning rate.**
- **Better initialisation.**
- **Normalisation layers.**
## Connection to RNN Difficulty
Vanishing/exploding gradients are particularly severe in **recurrent** networks because the same weight matrix is multiplied at every timestep. A sequence of length 100 means 100 multiplications. This is why:
- **LSTM** (Hochreiter & Schmidhuber, 1997) was developed: gated cell state preserves information without repeated multiplications.
- **Transformers** largely displaced RNNs for long sequences: attention provides direct paths between any two positions, sidestepping the recurrent multiplication entirely.
## The Modern Status
With modern initialisation + ReLU-family activations + normalisation + residual connections, vanishing/exploding gradients are *mostly* a solved problem for feedforward networks. Very deep transformers (100+ layers) still need careful initialisation and gradient clipping but train reliably.
## Related
- [[Backpropagation]]
- [[Weight Initialization]]
- [[Batch Normalization]]
- [[Skip Connections]]
- [[Long Short-Term Memory]]