Weight Initialization - Albert Masoliver's learning site

## Definition **Weight initialisation** is the choice of initial values for a neural network's parameters before training. A surprisingly consequential decision — bad initialisation causes [[Vanishing and Exploding Gradients]], slow convergence, or training failure. ## Bad Defaults - **Zeros:** every neuron in a layer is identical → symmetric gradients → never learns to differentiate. Catastrophic. - **Constants:** same symmetry problem. - **Very large values:** activations saturate; gradients explode. - **Very small values:** activations collapse to zero; gradients vanish. ## The Key Insight To maintain healthy activation magnitudes across layers, the *variance* of activations should be roughly preserved layer to layer. This drives the modern initialisation schemes. ## Xavier (Glorot) Initialisation For tanh / sigmoid activations: $ W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right) $ Or uniform on $[-a, a]$ with $a = \sqrt{6 / (n_{\text{in}} + n_{\text{out}})}$. Designed to keep variance of activations constant across layers under saturating activations. ## He (Kaiming) Initialisation For ReLU activations: $ W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right) $ Accounts for ReLU killing half the units; preserves variance through ReLU layers. **The default for most modern feedforward networks.** ## LeCun Initialisation For SELU activations: $ W_{ij} \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}}}\right) $ ## Orthogonal Initialisation Initialise weight matrices as random orthogonal matrices (via QR or SVD of a random Gaussian). Particularly useful for RNNs where preserving gradient norm across time matters. ## In Practice Modern frameworks (PyTorch, TensorFlow, JAX) implement these schemes by default. You rarely override unless tuning carefully: ```python # PyTorch — default for nn.Linear is Kaiming Uniform (with non-linearity='leaky_relu') nn.init.kaiming_normal_(layer.weight, nonlinearity='relu') nn.init.zeros_(layer.bias) ``` ## Initialisation in Modern LLMs For very deep transformers, additional tricks layer onto the basics: - **Scaling** of residual branches by $1/\sqrt{L}$ (DeepNorm, GPT-J). - **Re-zero** initialisation: start residual branches at zero; learn the right magnitude. - **Layer-specific** initialisation depending on position. These details matter at scale: a 70B-parameter model with bad initialisation may not train at all. ## Related - [[Backpropagation]] - [[Vanishing and Exploding Gradients]] - [[Batch Normalization]] - [[Activation function]]