Skip Connections - Albert Masoliver's learning site

## Definition **Skip connections** (or **residual connections**, popularised by ResNet, He et al. 2016) add the input of a block directly to its output, allowing gradients to bypass intermediate layers. The architectural innovation that made truly deep networks (50+ layers) trainable. ## The Form A residual block: $ y = x + \mathcal{F}(x) $ where $\mathcal{F}$ is some learnable function (a few conv or fully-connected layers). The output is the input *plus* the learned residual. ## Why It Works - **Gradient highway.** During backprop, the gradient has a direct path through the identity connection — bypassing the potentially vanishing $\mathcal{F}$. Mitigates [[Vanishing and Exploding Gradients]]. - **Easier optimisation landscape.** Identity is a natural starting point; the network learns small perturbations to it rather than the function from scratch. - **Composition friendliness.** Stacking residual blocks composes well — adding a block can only help (if $\mathcal{F} = 0$, output = input). ## Variants ### ResNet (He et al., 2016) Identity skips through 2-3 conv layers. Enabled training networks with 152+ layers on ImageNet, achieving state-of-the-art at the time. ### DenseNet (Huang et al., 2017) Each layer connected to *all* subsequent layers — denser than skip connections. ### Highway Networks (Srivastava et al., 2015) Predecessor to ResNet with gated skips. ### Transformer Residuals Every sublayer in a Transformer is residual: `x + Attention(LayerNorm(x))`, `x + FeedForward(LayerNorm(x))`. Without these, transformers wouldn't train past ~10 layers. ## Implementation Notes - **Dimension matching.** If $\mathcal{F}(x)$ has different shape from $x$, project $x$ first (1x1 conv for CNNs, linear projection for MLPs). - **Pre-activation vs post-activation.** ResNet v2 uses BatchNorm + ReLU before the conv layers (pre-activation); empirically slightly better than v1. ## The Theoretical Story A network with residual connections at every layer can represent any "perturbation" of the identity function, no matter how deep. Without residuals, very deep networks tend to *underfit* — additional capacity doesn't help. With residuals, depth almost always helps. ## When NOT to Use Them There's almost no case for not using them in deep networks. The cost is one tensor addition per block; the benefit is dramatic. Modern architectures default to residual connections everywhere. ## Related - [[Vanishing and Exploding Gradients]] - [[Batch Normalization]] - [[Convolutional Neural Network]] - [[Transformer Architecture]]