## Definition
A **Convolutional Neural Network (CNN)** is a neural architecture built on **convolutional layers** that exploit spatial structure in grid-like data (images, audio spectrograms, video). The dominant architecture for computer vision from ~2012 to ~2020; still strong for many image tasks in 2026.
## The Core Operations
Standard CNN block:
1. **Convolutional layer** — apply learnable filters across the input. See [[Convolution and Pooling]].
2. **Activation** — typically ReLU.
3. **Pooling** — spatial downsampling (max or average pool).
4. Repeat at deeper layers with more channels.
5. **Global pooling or flattening** → fully-connected head → output.
## Three Key Inductive Biases
CNNs encode three assumptions that match real-world image structure:
1. **Locality** — nearby pixels are more correlated than distant ones.
2. **Translation equivariance** — moving an object in the image moves the activations correspondingly. The same filter detects the same pattern anywhere.
3. **Hierarchical compositionality** — early filters detect edges; middle layers compose them into textures; deeper layers compose them into parts and objects.
This is the [[Inductive Bias]] that makes CNNs vastly more data-efficient than [[Multilayer Perceptron|MLP]]s on image tasks.
## Historical Milestones
- **LeNet** (LeCun, 1989) — first practical CNN, digit recognition.
- **AlexNet** (Krizhevsky et al., 2012) — won ImageNet, kicked off deep learning revolution.
- **VGG** (2014) — 16-19 layers; simple architecture with small (3x3) convolutions.
- **GoogLeNet / Inception** (2014) — inception modules with parallel filters.
- **ResNet** (2015) — 50-152 layers via skip connections.
- **EfficientNet** (2019) — principled scaling of depth, width, resolution.
- **ConvNeXt** (2022) — re-engineered CNN matching transformer performance.
## CNN vs Transformer for Vision
Vision transformers (ViT, 2020) showed transformers can match or beat CNNs on image classification when given enough data. Current state (2026):
- **Transformers** dominate at scale (LLM-grade data and compute).
- **CNNs** remain competitive at moderate scale and are computationally more efficient.
- **Hybrid models** (Swin, ConvNeXt, MaxViT) combine both.
## Modern CNN Defaults (2026)
- **3x3 convolutions** stacked deeply (rather than larger kernels).
- **BatchNorm** after each conv.
- **ReLU or GELU** activations.
- **Residual / skip connections** for depth.
- **Global average pooling** instead of fully-connected layers at the head.
## Beyond Images
- **1D CNNs** for audio, time series, raw text.
- **3D CNNs** for video, volumetric medical imaging.
- **Graph convolutional networks** generalise convolution to graphs.
## Related
- [[Convolution and Pooling]]
- [[Neural Network Architecture]]
- [[Skip Connections]]
- [[Batch Normalization]]
- [[Transformer Architecture]]