## Definition **Convolution** and **pooling** are the two fundamental operations of a [[Convolutional Neural Network]]. Convolution applies learnable filters across spatial locations; pooling reduces spatial resolution while preserving important information. ## Convolution A small **kernel** (filter) slides across the input. At each position, the kernel's elements multiply with the corresponding input region; the products sum to produce one output value. For an input $x$ and filter $W$ of size $k \times k$: $ y[i, j] = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} x[i+m, j+n] \cdot W[m, n] + b $ The same filter is applied at every position — **weight sharing**. Translation-equivariant by construction. ### Hyperparameters - **Kernel size** — usually 3x3 or 5x5; modern CNNs prefer 3x3 stacks. - **Stride** — step size; stride 2 halves resolution. - **Padding** — add zeros around the input to control output size. "Same" padding preserves spatial dimensions. - **Dilation** — gaps between kernel elements; expands receptive field without more parameters. ### Channels Real CNNs have multi-channel inputs (RGB images = 3 channels) and outputs (many filters per layer = many output channels). A layer with $C_{\text{in}}$ input channels and $C_{\text{out}}$ output channels has $C_{\text{in}} \cdot C_{\text{out}} \cdot k^2$ parameters. ## Pooling Reduces spatial dimensions while preserving meaningful structure. ### Max Pooling Take the maximum value in each region (typical: 2x2 windows, stride 2 → halves dimensions). - Preserves strongest activations. - Provides translation invariance to small shifts. - Standard default. ### Average Pooling Mean over the region. - Smoother output. - Used at the end (global average pooling) to reduce feature maps to a single vector before classification. ### Global Average Pooling Replaces fully-connected layers in modern CNNs: average each feature map to one number. Massively reduces parameters, less overfitting. ## Receptive Field The region of the input that influences a given activation. Grows with depth — a unit deep in the network sees a large area of the original image, even if individual convolutions are 3x3. Larger receptive fields enable understanding of broader context (object shape, scene layout). ## Modern Variants - **Depthwise separable convolutions** (MobileNet) — factor convolution into depthwise + pointwise; dramatic parameter reduction. - **Dilated convolutions** — expand receptive field without losing resolution; useful for segmentation. - **Transposed convolution** — "deconvolution"; used to upsample in segmentation and generative models. ## Related - [[Convolutional Neural Network]] - [[Neural Network Architecture]] - [[Skip Connections]]