Dropout - Albert Masoliver's learning site

## Definition **Dropout** (Srivastava et al., 2014) is a regularisation technique that randomly "drops" (sets to zero) a fraction of neurons during training. Prevents co-adaptation: neurons can't rely on specific other neurons to handle specific patterns, so they must encode more robust features. ## Algorithm During training, for each forward pass and each neuron, drop with probability $p$ (typical: 0.2-0.5): $ h_i = \begin{cases} 0 & \text{with probability } p \\ h_i / (1-p) & \text{with probability } 1-p \end{cases} $ The $1/(1-p)$ scaling preserves expected activation magnitude — sometimes called *inverted dropout*. At inference: dropout is disabled. All neurons are used. ## Why It Works Several complementary intuitions: - **Implicit ensembling.** Each forward pass uses a different subnetwork; over many passes, the model is an ensemble of exponentially many smaller networks sharing weights. - **Reduces co-adaptation.** Forces neurons to be useful in many contexts. - **Acts like noise injection.** Generic regularisation effect. ## When to Use - **MLPs and CNNs**, typically before fully-connected output layers. - **Pre-2017 norm.** Dropout was the dominant regulariser in the LeCun-Hinton-Bengio "deep learning explosion." ## When NOT to Use - **Batch-normalised networks.** Dropout + BatchNorm often interact poorly. Modern recipes use one or the other, not both. - **Most transformers.** Modern LLMs use light dropout (0.0-0.1) — or none — combined with layer normalisation, weight decay, and large training data. ## Variants - **Spatial dropout** for CNNs — drop entire feature maps. - **DropConnect** — drop weights instead of activations. - **Variational dropout** — fixed mask per training sample (Gal & Ghahramani 2016) — provides Bayesian uncertainty estimates. - **DropBlock** — drop contiguous regions in feature maps. ## Probability Setting - **Input layer:** 0.1-0.2 (low — preserving input information). - **Hidden layers:** 0.3-0.5. - **Tune** if validation performance is sensitive; otherwise 0.5 is a reasonable default. ## In Modern Deep Learning Dropout has been partially displaced by [[Batch Normalization]] and large data scale — both regularise implicitly. But dropout still appears: - **Output heads** of classifiers. - **Attention layers** in transformers (small probability, ~0.1). - **Tasks with limited training data** where other regularisation isn't enough. ## Related - [[Regularization]] - [[Batch Normalization]] - [[Overfitting and Underfitting]]