Convolutional Neural Network - Albert Masoliver's learning site

## Definition A **Convolutional Neural Network (CNN)** is a neural architecture built on **convolutional layers** that exploit spatial structure in grid-like data (images, audio spectrograms, video). The dominant architecture for computer vision from ~2012 to ~2020; still strong for many image tasks in 2026. ## The Core Operations Standard CNN block: 1. **Convolutional layer** — apply learnable filters across the input. See [[Convolution and Pooling]]. 2. **Activation** — typically ReLU. 3. **Pooling** — spatial downsampling (max or average pool). 4. Repeat at deeper layers with more channels. 5. **Global pooling or flattening** → fully-connected head → output. ## Three Key Inductive Biases CNNs encode three assumptions that match real-world image structure: 1. **Locality** — nearby pixels are more correlated than distant ones. 2. **Translation equivariance** — moving an object in the image moves the activations correspondingly. The same filter detects the same pattern anywhere. 3. **Hierarchical compositionality** — early filters detect edges; middle layers compose them into textures; deeper layers compose them into parts and objects. This is the [[Inductive Bias]] that makes CNNs vastly more data-efficient than [[Multilayer Perceptron|MLP]]s on image tasks. ## Historical Milestones - **LeNet** (LeCun, 1989) — first practical CNN, digit recognition. - **AlexNet** (Krizhevsky et al., 2012) — won ImageNet, kicked off deep learning revolution. - **VGG** (2014) — 16-19 layers; simple architecture with small (3x3) convolutions. - **GoogLeNet / Inception** (2014) — inception modules with parallel filters. - **ResNet** (2015) — 50-152 layers via skip connections. - **EfficientNet** (2019) — principled scaling of depth, width, resolution. - **ConvNeXt** (2022) — re-engineered CNN matching transformer performance. ## CNN vs Transformer for Vision Vision transformers (ViT, 2020) showed transformers can match or beat CNNs on image classification when given enough data. Current state (2026): - **Transformers** dominate at scale (LLM-grade data and compute). - **CNNs** remain competitive at moderate scale and are computationally more efficient. - **Hybrid models** (Swin, ConvNeXt, MaxViT) combine both. ## Modern CNN Defaults (2026) - **3x3 convolutions** stacked deeply (rather than larger kernels). - **BatchNorm** after each conv. - **ReLU or GELU** activations. - **Residual / skip connections** for depth. - **Global average pooling** instead of fully-connected layers at the head. ## Beyond Images - **1D CNNs** for audio, time series, raw text. - **3D CNNs** for video, volumetric medical imaging. - **Graph convolutional networks** generalise convolution to graphs. ## Related - [[Convolution and Pooling]] - [[Neural Network Architecture]] - [[Skip Connections]] - [[Batch Normalization]] - [[Transformer Architecture]]