kNN - Albert Masoliver's learning site

## Introduction The k-Nearest Neighbors (kNN) algorithm is a non-parametric, instance-based learning method used for classification and regression. It operates by finding the **k** closest data points (neighbors) to a given input and making predictions based on these neighbors. ## kNN for Classification In classification tasks, kNN assigns a class to an input based on the majority vote of its **k** nearest neighbors. The choice of distance metric (e.g., Euclidean, Manhattan) influences the selection of neighbors and, consequently, the model's accuracy. ## kNN for Regression Unlike classification, where the output is categorical, kNN can also be used for regression problems where the target variable is continuous. The prediction is typically computed as the mean of the output values of the **k** nearest neighbors: $y=\frac{1}{k} \sum_{i \in kNN(x)} y_i$ where: - $kNN(x)$ represents the **k** closest neighbors to input `x` based on a chosen distance metric. - $y_i$ is the output value of each neighbor. ### Weighted kNN Regression A variation of kNN regression involves assigning weights to each neighbor based on their distance to the input point. Closer neighbors receive higher weights, allowing more influence on the prediction. This is achieved using an inverse distance weighting function: $y=\frac{\sum_{i \in kNN(x)} w_i y_i}{\sum_{i \in kNN(x)} w_i}$ where $w_i$ is a weight assigned to each neighbor, typically computed as: $w_i=\frac{1}{d(x, x_i) + \epsilon}$ where $d(x, x_i)$ is the distance between the input $x$ and neighbor $x_i$, and $\epsilon$ is a small constant to avoid division by zero. ## Considerations in kNN Regression - **Choice of k:** A smaller k results in a model that is sensitive to noise, while a larger k smooths predictions but may introduce bias. - **Distance Metric:** Common choices include Euclidean distance, Manhattan distance, and Minkowski distance. - **Computational Efficiency:** Since kNN requires storing the entire dataset and performing distance calculations for each prediction, it can be computationally expensive for large datasets. - **Density of Data Points:** The performance of kNN heavily depends on the density of training data in the feature space. ## Conclusion The kNN algorithm provides a simple yet effective approach to both classification and regression. Its non-parametric nature makes it adaptable to various datasets, but it requires careful tuning of **k** and distance metrics to achieve optimal performance.