K-Means Clustering - Albert Masoliver's learning site

## Definition **K-Means** partitions data into $k$ disjoint clusters by iteratively assigning each point to its nearest centroid and updating centroids as cluster means. The canonical clustering algorithm. ## Algorithm (Lloyd's Algorithm) ``` initialise k centroids (typically by random selection from data) repeat until convergence: Assignment step: each point → nearest centroid Update step: each centroid → mean of its assigned points ``` Converges in finite iterations to a local optimum of the sum-of-squared-distances objective: $ J = \sum_{c=1}^k \sum_{x_i \in C_c} \|x_i - \mu_c\|^2 $ ## Choosing $k$ - **Elbow method.** Plot $J$ vs $k$; pick the "elbow" where diminishing returns set in. - **Silhouette score.** Measures how well points fit their cluster vs neighbouring clusters. - **Gap statistic.** Compares $J$ to a null reference distribution. - **Domain knowledge.** Sometimes $k$ is given. ## Initialisation Matters Random initialisation can lead to bad local optima. **K-Means++** (Arthur & Vassilvitskii, 2007) seeds smartly: pick first centroid randomly; pick subsequent centroids with probability proportional to squared distance from existing centroids. Now the default in scikit-learn. Run from multiple seeds and keep the best — `n_init` in scikit-learn (default 10). ## Strengths - **Fast** — $O(n \cdot k \cdot d \cdot \text{iterations})$. - **Simple and interpretable.** - **Scales** to large data with mini-batch variants. ## Weaknesses - **$k$ must be specified.** Often the hardest decision. - **Assumes spherical, equal-size clusters.** Elongated or unequal-density clusters break it. - **Sensitive to scale.** Always standardise features. - **Hard assignments.** No notion of "this point is between two clusters" — use [[Gaussian Mixture Model]] for soft assignments. - **Local optima.** Multiple runs help. ## Variants - **Mini-batch K-Means.** Use small batches for speed; for very large data. - **K-Medoids (PAM).** Use actual data points as cluster centres; more robust to outliers. - **Bisecting K-Means.** Hierarchical via repeated 2-means. - **Kernel K-Means.** K-Means in a kernel feature space. ## When K-Means Fails For non-convex clusters (concentric rings, half-moons), K-Means produces nonsense. Use [[DBSCAN]] or [[Hierarchical Clustering]] instead. ## Related - [[Hierarchical Clustering]] - [[DBSCAN]] - [[Gaussian Mixture Model]] - [[Expectation-Maximization]] - [[Unsupervised Learning]]