## Definition
The **kernel trick** is a technique for applying linear algorithms in implicit, often very high-dimensional, feature spaces — by replacing every inner product $x^\top x'$ with a **kernel function** $K(x, x')$ that computes the inner product in the feature space without ever materialising the feature map.
## The Setup
Suppose we have a feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$ with $D \gg d$ (possibly infinite). We want to apply a linear algorithm in $\mathbb{R}^D$ but computing $\phi(x)$ explicitly is impractical.
If the algorithm only uses inner products $\phi(x_i)^\top \phi(x_j)$, define:
$
K(x_i, x_j) := \phi(x_i)^\top \phi(x_j)
$
If we can compute $K$ directly (without computing $\phi$), the algorithm runs as if in $\mathbb{R}^D$ at the cost of operations in $\mathbb{R}^d$.
## Mercer's Theorem
A function $K(x, x')$ is a valid kernel iff it is **positive semi-definite** — i.e., for any finite set of points, the Gram matrix $K_{ij} = K(x_i, x_j)$ is positive semi-definite. Mercer's theorem then guarantees the existence of *some* feature map $\phi$ for which $K$ is the inner product.
## Common Kernels
- **Linear:** $K(x, x') = x^\top x'$. No mapping; identity case.
- **Polynomial:** $K(x, x') = (x^\top x' + c)^d$. Maps to space of degree-$d$ polynomial features.
- **RBF / Gaussian:** $K(x, x') = \exp(-\gamma \|x - x'\|^2)$. Maps to an *infinite*-dimensional space. The most common non-linear default.
- **Sigmoid:** $K(x, x') = \tanh(\alpha x^\top x' + c)$. Not always positive semi-definite; less common.
- **String kernels, graph kernels** — specialised kernels for non-vector data.
## Where It Applies
Any algorithm that uses inputs only via inner products can be kernelised:
- [[Support Vector Machine]] — the canonical example.
- [[Support Vector Regression]].
- **Kernel PCA** — non-linear dimensionality reduction.
- **Kernel ridge regression.**
- **Gaussian process regression** — kernels are the heart.
## Power and Cost
- **Power:** linear algorithms become non-linear without changing the underlying math.
- **Cost:** the Gram matrix $K_{ij}$ is $n \times n$. Memory and computation scale poorly with sample size — typically prohibitive past ~10-50k samples.
## Why Neural Networks Largely Replaced Kernels
Neural networks scale better with data and learn task-appropriate features automatically. Kernel methods were dominant in the 1995-2010 era; deep networks took over for most large-scale problems. Kernels remain useful for:
- Small datasets.
- Domains with strong, hand-designed similarity functions.
- Theoretical analysis (the **Neural Tangent Kernel** connects deep networks to kernels in a specific limit).
## Related
- [[Support Vector Machine]]
- [[Support Vector Regression]]
- [[Principal Component Analysis]]
- [[Embedding]]