DBSCAN - Albert Masoliver's learning site

## Definition **DBSCAN** (Density-Based Spatial Clustering of Applications with Noise, Ester et al. 1996) clusters points by **density**: regions of high density become clusters; sparse regions become noise. Handles arbitrary cluster shapes that [[K-Means Clustering]] cannot. ## The Two Hyperparameters - **$\epsilon$** — neighbourhood radius. - **MinPts** — minimum points to form a dense region. Together they define a *density threshold*. ## Point Classifications - **Core point.** Has at least MinPts points within $\epsilon$ (including itself). - **Border point.** Has fewer than MinPts within $\epsilon$ but is within $\epsilon$ of a core point. - **Noise point.** Neither core nor border. ## Algorithm ``` mark all points as unvisited for each unvisited point P: mark P as visited neighbours ← points within ε of P if |neighbours| < MinPts: mark P as noise (may be reclassified later) else: start a new cluster C expand C with neighbours, recursively pulling in their ε-neighbourhoods if a neighbour is a core point, add ITS neighbours too ``` ## Strengths - **No need to specify $k$.** Number of clusters emerges. - **Arbitrary cluster shapes.** Crescents, rings, elongated blobs — all fine. - **Explicit noise category.** Points outside any cluster are labelled as such. - **Robust to outliers** by design. ## Weaknesses - **Sensitive to $\epsilon$ and MinPts.** Bad choices give nonsense. - **Varying density.** Single $\epsilon$ can't accommodate clusters of very different densities. Use HDBSCAN for this. - **High dimensions.** Distance concentrates ([[Curse of Dimensionality]]); DBSCAN degrades. - **$O(n^2)$ naïvely**, $O(n \log n)$ with spatial indexing for low dimensions. ## Choosing $\epsilon$ - Plot **k-distance graph** (distance to $k$-th nearest neighbour, sorted). Look for an elbow. - **MinPts** typically $\geq 2 \cdot d$ for $d$-dimensional data; often 4-5 for 2D. ## HDBSCAN **Hierarchical DBSCAN** (Campello et al., 2013) eliminates $\epsilon$ entirely; produces a hierarchy of clusters at all density levels and extracts the most stable ones. Often the best-of-both choice in 2026 practice. ## When to Use - Spatial data, images. - Density-based anomaly detection. - Arbitrary cluster shapes. - Datasets with clear noise / outliers. ## When Not to Use - High-dimensional data without dimension reduction first. - Datasets with hugely varying densities (use HDBSCAN). - Very large datasets without spatial indexing. ## Related - [[K-Means Clustering]] - [[Hierarchical Clustering]] - [[Unsupervised Learning]]