## Definition
A **Support Vector Machine (SVM)** is a classifier that finds the **maximum-margin hyperplane** separating two classes in feature space. For non-linearly separable data, the [[Kernel Trick]] implicitly maps to a higher-dimensional space where separation becomes possible.
## The Margin
A hyperplane $w^\top x + b = 0$ separates the data. The **margin** is twice the perpendicular distance from the hyperplane to the nearest training points.
SVM maximises this margin — the intuition: a hyperplane with large margin generalises better, because small perturbations of training data don't change the decision.
## Optimisation (Soft-Margin)
For non-perfectly-separable data, allow slack variables:
$
\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_i \xi_i
$
subject to $y_i (w^\top x_i + b) \geq 1 - \xi_i$ and $\xi_i \geq 0$.
- $C$ — hyperparameter trading off margin width vs misclassification. High $C$ = harder margin.
## Support Vectors
Only training points on or inside the margin (with $\xi_i > 0$ or on the boundary) contribute to the final hyperplane. These are the **support vectors**. The rest are irrelevant — a sparse property.
## Dual Formulation and Kernels
The problem rewrites in dual form involving inner products $x_i^\top x_j$. Replace inner products with a kernel $K(x_i, x_j)$ and you have a non-linear classifier without explicit feature mapping. See [[Kernel Trick]].
Common kernels:
- **Linear:** $K(x, x') = x^\top x'$.
- **Polynomial:** $(x^\top x' + c)^d$.
- **RBF (Gaussian):** $\exp(-\gamma \|x - x'\|^2)$. The most popular non-linear default.
## Strengths
- **Robust to overfitting** in high-dim spaces.
- **Effective in non-linear settings** via kernels.
- **Sparse solution** — only support vectors matter at inference.
- **Strong theoretical guarantees** (VC dim, margin theory).
## Weaknesses
- **Doesn't scale.** Training is $O(n^2)$ to $O(n^3)$; impractical past ~50k samples.
- **No probabilistic output natively** (need Platt scaling).
- **Hyperparameter sensitive** ($C$, kernel choice, kernel parameters).
- **Less interpretable** than logistic regression or trees.
## Multi-class Extensions
- **One-vs-one:** train $\binom{k}{2}$ binary SVMs; vote.
- **One-vs-rest:** train $k$ binary SVMs; pick highest score.
## When to Use
- Small-to-medium datasets (up to ~50k samples).
- High-dimensional data (text classification with TF-IDF was an SVM stronghold).
- Non-linear problems where neural networks are overkill.
For most tabular problems in 2026, gradient-boosted trees outperform SVMs with less tuning.
## Related
- [[Kernel Trick]]
- [[Support Vector Regression]]
- [[Logistic Regression]]