What is the k-nearest neighbors algorithm?
The k-nearest neighbors algorithm is a type of a lazy learning or an instance-based learning algorithm. In this algorithm, all of the learning and inference is done only at the time of the prediction and no computation is done before that.
The algorithm can be used for both regression and classification tasks.
How does the k-nearest neighbors algorithm work?
If you pick a point for which you want the prediction, we can find the Euclidean distance of that point from all the other points in the training dataset.
The Euclidean distance is the distance between two different points in space.
The next step is to pick the k-nearest points sorted by distance (hence, the name). This value of k has to be decided by the user before running the algorithm.
For a classification task, we take the majority class of these k-points and assign it to the test point. For a regression task, we take the average of the output values of the k-points and set that to be our prediction.
What are some issues with the k-nearest neighbors algorithm?
Consider a classification task in which the members of one class are far greater in number than that of the other class.
In this scenario, k-nearest neighbors will not perform well.
Why? Because the members of the majority class will have a much higher probability of being closer to the test point. Due to this, when you do a simple majority voting, the majority class will always be assigned as the output.
One way to overcome this issue is by taking a weighted contribution instead of simple majority voting.
Weighted contribution? 🤨 Let us elaborate.
The contribution of the k-nearest points is weighed by the inverse of the distance between the test point and the point belonging to the set of k-nearest points. So, the closer the point, the greater its effect. (So far, so good, yeah?)
Since k-nearest neighbors considers only the k-points which are closest to the test point, it is sensitive to the local structure of the training data. This might lead to the k-points being adversely affected by the presence of outliers. If the data is noisy, the k-nearest neighbors algorithm might not be a good choice for the task.
No computation takes place in the training phase of the algorithm as it’s a form of instance-based learning. Therefore, the inference time is considerably large. For each test point, distance from all training points is calculated and then sorted to obtain the nearest k-points. This is computationally expensive and time consuming, especially if the training dataset is huge.
When should the k-nearest neighbors algorithm be used?
The k-nearest neighbors algorithm is relatively simple when compared to some of the other machine learning algorithms.
If the training data is spread evenly, does not have extremely high dimensions and the inference time is not a factor, the k-nearest neighbors algorithm is a great choice to get quick results and establish a baseline for other complex algorithms to beat.
Think we're missing something? 🧐 Help us update this article by sending us your suggestions here. 🙏