## Definition
**Skip connections** (or **residual connections**, popularised by ResNet, He et al. 2016) add the input of a block directly to its output, allowing gradients to bypass intermediate layers. The architectural innovation that made truly deep networks (50+ layers) trainable.
## The Form
A residual block:
$
y = x + \mathcal{F}(x)
$
where $\mathcal{F}$ is some learnable function (a few conv or fully-connected layers). The output is the input *plus* the learned residual.
## Why It Works
- **Gradient highway.** During backprop, the gradient has a direct path through the identity connection — bypassing the potentially vanishing $\mathcal{F}$. Mitigates [[Vanishing and Exploding Gradients]].
- **Easier optimisation landscape.** Identity is a natural starting point; the network learns small perturbations to it rather than the function from scratch.
- **Composition friendliness.** Stacking residual blocks composes well — adding a block can only help (if $\mathcal{F} = 0$, output = input).
## Variants
### ResNet (He et al., 2016)
Identity skips through 2-3 conv layers. Enabled training networks with 152+ layers on ImageNet, achieving state-of-the-art at the time.
### DenseNet (Huang et al., 2017)
Each layer connected to *all* subsequent layers — denser than skip connections.
### Highway Networks (Srivastava et al., 2015)
Predecessor to ResNet with gated skips.
### Transformer Residuals
Every sublayer in a Transformer is residual: `x + Attention(LayerNorm(x))`, `x + FeedForward(LayerNorm(x))`. Without these, transformers wouldn't train past ~10 layers.
## Implementation Notes
- **Dimension matching.** If $\mathcal{F}(x)$ has different shape from $x$, project $x$ first (1x1 conv for CNNs, linear projection for MLPs).
- **Pre-activation vs post-activation.** ResNet v2 uses BatchNorm + ReLU before the conv layers (pre-activation); empirically slightly better than v1.
## The Theoretical Story
A network with residual connections at every layer can represent any "perturbation" of the identity function, no matter how deep. Without residuals, very deep networks tend to *underfit* — additional capacity doesn't help. With residuals, depth almost always helps.
## When NOT to Use Them
There's almost no case for not using them in deep networks. The cost is one tensor addition per block; the benefit is dramatic. Modern architectures default to residual connections everywhere.
## Related
- [[Vanishing and Exploding Gradients]]
- [[Batch Normalization]]
- [[Convolutional Neural Network]]
- [[Transformer Architecture]]