Introduction
The concepts and techniques used in machine learning can be very complex and overwhelming – but then we have a no brainer technique known as K Nearest Neighbor (KNN) which stands apart from all other techniques due to its intuitive simplicity.
Though in this article we will see how K-Nearest Neighbor works for classification problem, but this K-NN technique can also be used for regression.
We will assume that you know basics of classification problem and you also know how to measure distance between two data points.
How K-NN Classification works
Let us take below example where we have distribution of two classes of data – Green and Brown. Now we have a new unseen data which is initially marked black in color as we are not sure that to which class it belongs to.
We follow theses steps for K-NN classification –
- We find K neighbors which are nearest to black point. In this example we choose K=5 neighbors around black point.
- To find the nearest neighbors we calculate distance between black points and other points. We then choose the top 5 neighbors whose distance is closest to black point.
- We find that out of 5 nearest neighbors of black point, 2 are brown and 3 are green. Since we have a majority of green points around this black point we assign green label to it.
That is it !! This is KNN classification – as simple as it could get !!
How to choose value of K
Now I know you would be thinking how to choose value of K and why we chose K=5 here. First of all let me tell you that our choice of K=5 was arbitrary.
Secondly, different choices of K on same data might produce different results. Let us have a look at below visualization.
As we can see that on the same data set, for two different values of K we are getting two different outcomes of class label. So the question still remains – how to choose value of K.
Well the short answer is there is no rule or formula to derive the value of K. One value of K may work wonders on one type of data set but may fail on other data set.
But we do have some guidelines and considerations to estimate a good value of K-
- To begin with you may choose a value of K = square root of number of observations in data set. At the same time it is also advisable to choose an odd value of K to avoid any ties between most frequent neighbor classes.
- Based on this value of K you can run K-NN algorithm on test set and evaluate the prediction using one of many available metrics in machine learning.
- You may then try to increase and decrease the value K till you cant increase the prediction accuracy any further.
- Do remember that if you are choosing a very small value of K then any outliers present in the neighborhood of data in consideration will incorrectly influence the classification result.
- On the other hand a very large value of K will defeat the whole purpose of K-NN algorithm where you might end up exploring data outside neighborhood of data in consideration. This will again end up with not so correct classifications.
- Another point to note is that small value of K is computationally efficient and as expected a large value of K can become computationally expensive.
You would have realized by now that there is no magical formula for K. It is more about tuning of parameter K and this depends on your data.
- Also Read- Demystifying Confusion Matrix and Performance Metrics in Machine Learning
- Also Read- Overfitting and Underfitting in Machine Learning – Animated Guide for Beginners
Few points to consider about K-NN classification
- K-NN algorithm makes no pre assumption on how your data is distri
- buted. This feature of K-NN is known as non parametric.
- Generally, in supervised learning classification there is a training phase where a model learns from the training data and creates a classier function based on it. However in K-NN basically nothing happens during training phase. Only when some test or unseen data is fed to the K-NN model it becomes active and starts exploring neighborhood. That is why it is also known as Lazy Learner
- The distance between data points can be calculated by any available methods like Euclidean distance, Manhattan distance, Cosine distance etc. As beginner or common practitioner you are most likely to use Euclidean distance.
- K-NN is computationally expensive algorithm by nature and requires high memory. This becomes more issue when number of dimensions in the data is very much high. High dimension also leads to problems known as Curse of Dimensionality. So K-NN is best suited for data with lower dimensions.
Also Read- Dummies guide to Cost Functions in Machine Learning [with Animation]
Also Read- Data Preprocessing in Machine Learning – Complete Nutshell view for Beginners
In the End…
K-NN is a very simple yet very powerful classification technique in machine learning if used at right place. I hope this article was also a simple read and gave you powerful insights on K-NN classification.
Do share your feed back about this post in the comments section below. If you found this post informative, then please do share this and subscribe to us by clicking on bell icon for quick notifications of new upcoming posts. And yes, don’t forget to join our new community MLK Hub and Make AI Simple together.
I leave you with the following quote –
“A good neighbor is a priceless treasure”
One Response
very nice tutorial