Support Vector Machine - Albert Masoliver's learning site

## Definition A **Support Vector Machine (SVM)** is a classifier that finds the **maximum-margin hyperplane** separating two classes in feature space. For non-linearly separable data, the [[Kernel Trick]] implicitly maps to a higher-dimensional space where separation becomes possible. ## The Margin A hyperplane $w^\top x + b = 0$ separates the data. The **margin** is twice the perpendicular distance from the hyperplane to the nearest training points. SVM maximises this margin — the intuition: a hyperplane with large margin generalises better, because small perturbations of training data don't change the decision. ## Optimisation (Soft-Margin) For non-perfectly-separable data, allow slack variables: $ \min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_i \xi_i $ subject to $y_i (w^\top x_i + b) \geq 1 - \xi_i$ and $\xi_i \geq 0$. - $C$ — hyperparameter trading off margin width vs misclassification. High $C$ = harder margin. ## Support Vectors Only training points on or inside the margin (with $\xi_i > 0$ or on the boundary) contribute to the final hyperplane. These are the **support vectors**. The rest are irrelevant — a sparse property. ## Dual Formulation and Kernels The problem rewrites in dual form involving inner products $x_i^\top x_j$. Replace inner products with a kernel $K(x_i, x_j)$ and you have a non-linear classifier without explicit feature mapping. See [[Kernel Trick]]. Common kernels: - **Linear:** $K(x, x') = x^\top x'$. - **Polynomial:** $(x^\top x' + c)^d$. - **RBF (Gaussian):** $\exp(-\gamma \|x - x'\|^2)$. The most popular non-linear default. ## Strengths - **Robust to overfitting** in high-dim spaces. - **Effective in non-linear settings** via kernels. - **Sparse solution** — only support vectors matter at inference. - **Strong theoretical guarantees** (VC dim, margin theory). ## Weaknesses - **Doesn't scale.** Training is $O(n^2)$ to $O(n^3)$; impractical past ~50k samples. - **No probabilistic output natively** (need Platt scaling). - **Hyperparameter sensitive** ($C$, kernel choice, kernel parameters). - **Less interpretable** than logistic regression or trees. ## Multi-class Extensions - **One-vs-one:** train $\binom{k}{2}$ binary SVMs; vote. - **One-vs-rest:** train $k$ binary SVMs; pick highest score. ## When to Use - Small-to-medium datasets (up to ~50k samples). - High-dimensional data (text classification with TF-IDF was an SVM stronghold). - Non-linear problems where neural networks are overkill. For most tabular problems in 2026, gradient-boosted trees outperform SVMs with less tuning. ## Related - [[Kernel Trick]] - [[Support Vector Regression]] - [[Logistic Regression]]