## Definition
**Convolution** and **pooling** are the two fundamental operations of a [[Convolutional Neural Network]]. Convolution applies learnable filters across spatial locations; pooling reduces spatial resolution while preserving important information.
## Convolution
A small **kernel** (filter) slides across the input. At each position, the kernel's elements multiply with the corresponding input region; the products sum to produce one output value.
For an input $x$ and filter $W$ of size $k \times k$:
$
y[i, j] = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} x[i+m, j+n] \cdot W[m, n] + b
$
The same filter is applied at every position — **weight sharing**. Translation-equivariant by construction.
### Hyperparameters
- **Kernel size** — usually 3x3 or 5x5; modern CNNs prefer 3x3 stacks.
- **Stride** — step size; stride 2 halves resolution.
- **Padding** — add zeros around the input to control output size. "Same" padding preserves spatial dimensions.
- **Dilation** — gaps between kernel elements; expands receptive field without more parameters.
### Channels
Real CNNs have multi-channel inputs (RGB images = 3 channels) and outputs (many filters per layer = many output channels). A layer with $C_{\text{in}}$ input channels and $C_{\text{out}}$ output channels has $C_{\text{in}} \cdot C_{\text{out}} \cdot k^2$ parameters.
## Pooling
Reduces spatial dimensions while preserving meaningful structure.
### Max Pooling
Take the maximum value in each region (typical: 2x2 windows, stride 2 → halves dimensions).
- Preserves strongest activations.
- Provides translation invariance to small shifts.
- Standard default.
### Average Pooling
Mean over the region.
- Smoother output.
- Used at the end (global average pooling) to reduce feature maps to a single vector before classification.
### Global Average Pooling
Replaces fully-connected layers in modern CNNs: average each feature map to one number. Massively reduces parameters, less overfitting.
## Receptive Field
The region of the input that influences a given activation. Grows with depth — a unit deep in the network sees a large area of the original image, even if individual convolutions are 3x3.
Larger receptive fields enable understanding of broader context (object shape, scene layout).
## Modern Variants
- **Depthwise separable convolutions** (MobileNet) — factor convolution into depthwise + pointwise; dramatic parameter reduction.
- **Dilated convolutions** — expand receptive field without losing resolution; useful for segmentation.
- **Transposed convolution** — "deconvolution"; used to upsample in segmentation and generative models.
## Related
- [[Convolutional Neural Network]]
- [[Neural Network Architecture]]
- [[Skip Connections]]