## Definition
**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise, Ester et al. 1996) clusters points by **density**: regions of high density become clusters; sparse regions become noise. Handles arbitrary cluster shapes that [[K-Means Clustering]] cannot.
## The Two Hyperparameters
- **$\epsilon$** — neighbourhood radius.
- **MinPts** — minimum points to form a dense region.
Together they define a *density threshold*.
## Point Classifications
- **Core point.** Has at least MinPts points within $\epsilon$ (including itself).
- **Border point.** Has fewer than MinPts within $\epsilon$ but is within $\epsilon$ of a core point.
- **Noise point.** Neither core nor border.
## Algorithm
```
mark all points as unvisited
for each unvisited point P:
mark P as visited
neighbours ← points within ε of P
if |neighbours| < MinPts:
mark P as noise (may be reclassified later)
else:
start a new cluster C
expand C with neighbours, recursively pulling in their ε-neighbourhoods
if a neighbour is a core point, add ITS neighbours too
```
## Strengths
- **No need to specify $k$.** Number of clusters emerges.
- **Arbitrary cluster shapes.** Crescents, rings, elongated blobs — all fine.
- **Explicit noise category.** Points outside any cluster are labelled as such.
- **Robust to outliers** by design.
## Weaknesses
- **Sensitive to $\epsilon$ and MinPts.** Bad choices give nonsense.
- **Varying density.** Single $\epsilon$ can't accommodate clusters of very different densities. Use HDBSCAN for this.
- **High dimensions.** Distance concentrates ([[Curse of Dimensionality]]); DBSCAN degrades.
- **$O(n^2)$ naïvely**, $O(n \log n)$ with spatial indexing for low dimensions.
## Choosing $\epsilon$
- Plot **k-distance graph** (distance to $k$-th nearest neighbour, sorted). Look for an elbow.
- **MinPts** typically $\geq 2 \cdot d$ for $d$-dimensional data; often 4-5 for 2D.
## HDBSCAN
**Hierarchical DBSCAN** (Campello et al., 2013) eliminates $\epsilon$ entirely; produces a hierarchy of clusters at all density levels and extracts the most stable ones. Often the best-of-both choice in 2026 practice.
## When to Use
- Spatial data, images.
- Density-based anomaly detection.
- Arbitrary cluster shapes.
- Datasets with clear noise / outliers.
## When Not to Use
- High-dimensional data without dimension reduction first.
- Datasets with hugely varying densities (use HDBSCAN).
- Very large datasets without spatial indexing.
## Related
- [[K-Means Clustering]]
- [[Hierarchical Clustering]]
- [[Unsupervised Learning]]