## Definition
A **Gaussian Mixture Model (GMM)** models data as a weighted sum of $k$ Gaussian distributions:
$
p(x) = \sum_{c=1}^k \pi_c \, \mathcal{N}(x \mid \mu_c, \Sigma_c)
$
- $\pi_c$ — mixture weights (sum to 1).
- $\mu_c, \Sigma_c$ — mean and covariance of component $c$.
Each component represents a cluster; each point has a *soft* membership across components.
## Soft vs Hard Clustering
GMM gives the probability that a point belongs to each component:
$
P(c \mid x) = \frac{\pi_c \, \mathcal{N}(x \mid \mu_c, \Sigma_c)}{\sum_{c'} \pi_{c'} \, \mathcal{N}(x \mid \mu_{c'}, \Sigma_{c'})}
$
Contrast with [[K-Means Clustering]]: hard assignment to nearest centroid.
## Training: Expectation-Maximization
Fitting GMM uses [[Expectation-Maximization]]:
**E-step:** compute responsibilities $\gamma_{ic} = P(c \mid x_i)$ for each point and component.
**M-step:** update each component's parameters using responsibility-weighted statistics:
$
\mu_c = \frac{\sum_i \gamma_{ic} x_i}{\sum_i \gamma_{ic}}, \quad \Sigma_c = \frac{\sum_i \gamma_{ic} (x_i - \mu_c)(x_i - \mu_c)^\top}{\sum_i \gamma_{ic}}, \quad \pi_c = \frac{\sum_i \gamma_{ic}}{n}
$
Iterate until log-likelihood converges.
## Covariance Structures
- **Spherical** ($\Sigma = \sigma^2 I$): each component is a sphere; equivalent to soft K-Means.
- **Diagonal** ($\Sigma$ diagonal): axis-aligned ellipsoids.
- **Tied** (same $\Sigma$ for all components): shared shape.
- **Full** (different $\Sigma_c$ per component): most flexible, most parameters.
More flexibility → more parameters → more data needed.
## Choosing $k$
- **BIC / AIC** — penalise number of parameters; pick lowest.
- **Cross-validation** on held-out log-likelihood.
- **Variational Bayesian GMM** — automatically determines effective $k$.
## Strengths
- **Soft assignments** — useful for downstream probabilistic reasoning.
- **Captures elliptical clusters** of different sizes and orientations.
- **Density model** — can evaluate $p(x)$ at new points; useful for anomaly detection.
- **Generates data** — sample from $p(x)$ for synthetic data.
## Weaknesses
- **Assumes Gaussian components.** Wrong if data is non-Gaussian.
- **Sensitive to initialisation** — multiple seeds + best result.
- **Singular covariances.** Can collapse onto single points; use covariance regularisation.
- **More parameters than K-Means.** Needs more data.
## When to Use
- Soft clustering.
- Density estimation.
- Anomaly detection.
- Generative model for tabular data.
## Related
- [[K-Means Clustering]]
- [[Expectation-Maximization]]
- [[Unsupervised Learning]]