Unsupervised Learning - Albert Masoliver's learning site

## Definition **Unsupervised learning** discovers structure in unlabelled data $\{x_i\}_{i=1}^n$. There is no target $y$ to predict — the model finds patterns, groupings, or representations from the data alone. ## Main Sub-Tasks ### Clustering Group similar examples. Algorithms: [[K-Means Clustering]], [[Hierarchical Clustering]], [[DBSCAN]], [[Gaussian Mixture Model]]. ### Dimensionality Reduction Project high-dimensional data into a low-dimensional space preserving structure. See [[Principal Component Analysis]], [[t-SNE]], [[UMAP]]. ### Density Estimation Learn the probability distribution $p(x)$ that generated the data. Used for anomaly detection, generative modelling, sampling. ### Association Rule Learning Find frequent patterns and rules among items. See [[Apriori Algorithm]]. Classic example: market-basket analysis. ### Anomaly Detection Identify examples that don't fit the learnt structure — outliers, fraud, defects. ## Evaluation Challenge Without labels, there's no straightforward analogue of accuracy or error. Evaluation depends on the task: - **Clustering:** silhouette score, Davies-Bouldin index, mutual information against external labels (if available). - **Dimensionality reduction:** reconstruction error, downstream task performance. - **Density estimation:** held-out log-likelihood. In practice, unsupervised learning is often evaluated by its *utility for a downstream task* — does the clustering improve a customer-segmentation business metric? Does the lower-dimensional embedding speed up a search system? ## When Unsupervised Wins - **Exploratory data analysis.** What groups exist in the data? - **Data preprocessing.** Reduce dimensionality before a supervised model. - **Anomaly detection** when anomalies are too rare to label. - **Pre-training** for downstream supervised tasks (the bridge to self-supervised learning). ## The Self-Supervised Bridge Modern foundation models — including LLMs — are trained with *self-supervised* objectives derived from unlabelled data: next-token prediction, masked language modelling, contrastive learning. The distinction between "unsupervised" and "self-supervised" is technical; the practical effect is the same: usable models without manual labels. See [[Self-Supervised Learning]]. ## Related - [[K-Means Clustering]] - [[Principal Component Analysis]] - [[Gaussian Mixture Model]] - [[Self-Supervised Learning]] - [[Supervised Learning]]