## Definition
**Weight initialisation** is the choice of initial values for a neural network's parameters before training. A surprisingly consequential decision — bad initialisation causes [[Vanishing and Exploding Gradients]], slow convergence, or training failure.
## Bad Defaults
- **Zeros:** every neuron in a layer is identical → symmetric gradients → never learns to differentiate. Catastrophic.
- **Constants:** same symmetry problem.
- **Very large values:** activations saturate; gradients explode.
- **Very small values:** activations collapse to zero; gradients vanish.
## The Key Insight
To maintain healthy activation magnitudes across layers, the *variance* of activations should be roughly preserved layer to layer. This drives the modern initialisation schemes.
## Xavier (Glorot) Initialisation
For tanh / sigmoid activations:
$
W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)
$
Or uniform on $[-a, a]$ with $a = \sqrt{6 / (n_{\text{in}} + n_{\text{out}})}$.
Designed to keep variance of activations constant across layers under saturating activations.
## He (Kaiming) Initialisation
For ReLU activations:
$
W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)
$
Accounts for ReLU killing half the units; preserves variance through ReLU layers. **The default for most modern feedforward networks.**
## LeCun Initialisation
For SELU activations:
$
W_{ij} \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}}}\right)
$
## Orthogonal Initialisation
Initialise weight matrices as random orthogonal matrices (via QR or SVD of a random Gaussian). Particularly useful for RNNs where preserving gradient norm across time matters.
## In Practice
Modern frameworks (PyTorch, TensorFlow, JAX) implement these schemes by default. You rarely override unless tuning carefully:
```python
# PyTorch — default for nn.Linear is Kaiming Uniform (with non-linearity='leaky_relu')
nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')
nn.init.zeros_(layer.bias)
```
## Initialisation in Modern LLMs
For very deep transformers, additional tricks layer onto the basics:
- **Scaling** of residual branches by $1/\sqrt{L}$ (DeepNorm, GPT-J).
- **Re-zero** initialisation: start residual branches at zero; learn the right magnitude.
- **Layer-specific** initialisation depending on position.
These details matter at scale: a 70B-parameter model with bad initialisation may not train at all.
## Related
- [[Backpropagation]]
- [[Vanishing and Exploding Gradients]]
- [[Batch Normalization]]
- [[Activation function]]