## Definition A **Gaussian Mixture Model (GMM)** models data as a weighted sum of $k$ Gaussian distributions: $ p(x) = \sum_{c=1}^k \pi_c \, \mathcal{N}(x \mid \mu_c, \Sigma_c) $ - $\pi_c$ — mixture weights (sum to 1). - $\mu_c, \Sigma_c$ — mean and covariance of component $c$. Each component represents a cluster; each point has a *soft* membership across components. ## Soft vs Hard Clustering GMM gives the probability that a point belongs to each component: $ P(c \mid x) = \frac{\pi_c \, \mathcal{N}(x \mid \mu_c, \Sigma_c)}{\sum_{c'} \pi_{c'} \, \mathcal{N}(x \mid \mu_{c'}, \Sigma_{c'})} $ Contrast with [[K-Means Clustering]]: hard assignment to nearest centroid. ## Training: Expectation-Maximization Fitting GMM uses [[Expectation-Maximization]]: **E-step:** compute responsibilities $\gamma_{ic} = P(c \mid x_i)$ for each point and component. **M-step:** update each component's parameters using responsibility-weighted statistics: $ \mu_c = \frac{\sum_i \gamma_{ic} x_i}{\sum_i \gamma_{ic}}, \quad \Sigma_c = \frac{\sum_i \gamma_{ic} (x_i - \mu_c)(x_i - \mu_c)^\top}{\sum_i \gamma_{ic}}, \quad \pi_c = \frac{\sum_i \gamma_{ic}}{n} $ Iterate until log-likelihood converges. ## Covariance Structures - **Spherical** ($\Sigma = \sigma^2 I$): each component is a sphere; equivalent to soft K-Means. - **Diagonal** ($\Sigma$ diagonal): axis-aligned ellipsoids. - **Tied** (same $\Sigma$ for all components): shared shape. - **Full** (different $\Sigma_c$ per component): most flexible, most parameters. More flexibility → more parameters → more data needed. ## Choosing $k$ - **BIC / AIC** — penalise number of parameters; pick lowest. - **Cross-validation** on held-out log-likelihood. - **Variational Bayesian GMM** — automatically determines effective $k$. ## Strengths - **Soft assignments** — useful for downstream probabilistic reasoning. - **Captures elliptical clusters** of different sizes and orientations. - **Density model** — can evaluate $p(x)$ at new points; useful for anomaly detection. - **Generates data** — sample from $p(x)$ for synthetic data. ## Weaknesses - **Assumes Gaussian components.** Wrong if data is non-Gaussian. - **Sensitive to initialisation** — multiple seeds + best result. - **Singular covariances.** Can collapse onto single points; use covariance regularisation. - **More parameters than K-Means.** Needs more data. ## When to Use - Soft clustering. - Density estimation. - Anomaly detection. - Generative model for tabular data. ## Related - [[K-Means Clustering]] - [[Expectation-Maximization]] - [[Unsupervised Learning]]