Kernel Trick - Albert Masoliver's learning site

## Definition The **kernel trick** is a technique for applying linear algorithms in implicit, often very high-dimensional, feature spaces — by replacing every inner product $x^\top x'$ with a **kernel function** $K(x, x')$ that computes the inner product in the feature space without ever materialising the feature map. ## The Setup Suppose we have a feature map $\phi: \mathbb{R}^d \to \mathbb{R}^D$ with $D \gg d$ (possibly infinite). We want to apply a linear algorithm in $\mathbb{R}^D$ but computing $\phi(x)$ explicitly is impractical. If the algorithm only uses inner products $\phi(x_i)^\top \phi(x_j)$, define: $ K(x_i, x_j) := \phi(x_i)^\top \phi(x_j) $ If we can compute $K$ directly (without computing $\phi$), the algorithm runs as if in $\mathbb{R}^D$ at the cost of operations in $\mathbb{R}^d$. ## Mercer's Theorem A function $K(x, x')$ is a valid kernel iff it is **positive semi-definite** — i.e., for any finite set of points, the Gram matrix $K_{ij} = K(x_i, x_j)$ is positive semi-definite. Mercer's theorem then guarantees the existence of *some* feature map $\phi$ for which $K$ is the inner product. ## Common Kernels - **Linear:** $K(x, x') = x^\top x'$. No mapping; identity case. - **Polynomial:** $K(x, x') = (x^\top x' + c)^d$. Maps to space of degree-$d$ polynomial features. - **RBF / Gaussian:** $K(x, x') = \exp(-\gamma \|x - x'\|^2)$. Maps to an *infinite*-dimensional space. The most common non-linear default. - **Sigmoid:** $K(x, x') = \tanh(\alpha x^\top x' + c)$. Not always positive semi-definite; less common. - **String kernels, graph kernels** — specialised kernels for non-vector data. ## Where It Applies Any algorithm that uses inputs only via inner products can be kernelised: - [[Support Vector Machine]] — the canonical example. - [[Support Vector Regression]]. - **Kernel PCA** — non-linear dimensionality reduction. - **Kernel ridge regression.** - **Gaussian process regression** — kernels are the heart. ## Power and Cost - **Power:** linear algorithms become non-linear without changing the underlying math. - **Cost:** the Gram matrix $K_{ij}$ is $n \times n$. Memory and computation scale poorly with sample size — typically prohibitive past ~10-50k samples. ## Why Neural Networks Largely Replaced Kernels Neural networks scale better with data and learn task-appropriate features automatically. Kernel methods were dominant in the 1995-2010 era; deep networks took over for most large-scale problems. Kernels remain useful for: - Small datasets. - Domains with strong, hand-designed similarity functions. - Theoretical analysis (the **Neural Tangent Kernel** connects deep networks to kernels in a specific limit). ## Related - [[Support Vector Machine]] - [[Support Vector Regression]] - [[Principal Component Analysis]] - [[Embedding]]