A very common confusion which a beginner has in his early days of machine learning or data science journey is that what is the difference between regression & classification and what are their use cases. In this post we will see an in depth comparison of regression vs classification which should be able to give you a good understanding about the two concepts.
So let us start.
Regression vs Classification
Well before discussing on the differences between the two, I believe it will be a good starting point to first understand the similarities between regression and classification.
- Both regression and classification belong to category of machine learning known as supervised learning.
- Being from supervised learning family both regression and classification algorithms have something known as training phase.
- During training phase, regression and classification algorithms needs both input data as well as output data to learn the relationship between the two.
- In the training phase, mostly all regression and classification algorithms creates a function f(X) which maps the input values of X=x1,x2,x3.. xN to output value y.
We saw how these two are similar at a very high level but of course at core, regression and classification are two different techniques meant for totally different use cases.
- In regression, we try to predict a continuous number y based on input data X = x1,x2,x3.. xN
- In classification, we try to predict the class or label based on input data X = x1,x2,x3.. xN
Okay, let me put this in layman terms again –
- In regression we predict a quantity or a number.
- In classification we predict category.
Let us understand this more clearly with bank loan approval example in below illustration.
- Observe here that in regression problem, the output data is Loan Amount whereas in case of classification it is the Loan Status.
- Loan Amount is the amount approved, which is a number. This can be any amount (i.e. number) within business constraints.
- Whereas Loan Status is a class or label having only two possible categories – ‘Rejected‘ and ‘Approved‘.
- Notice one more thing that we can convert a Regression problem into Classification problem by applying some common sense. In our example, the case having Loan Amount=0 can be considered as Rejected class otherwise as Approved class.
Again, just because output column has numbers does not mean you are dealing with regression problem. There might be a scenario of dummy variable encoding where the category itself is represented by number. For example, True/Positive is often denoted by 1 and False/Negative by 0 in case of binary classification. So when you see numbers in output column, you will have to understand the context to decide if it is a regression or classification problem.
Regression Vs Classification – Graphical View
If you still have any confusion between classification and regression then this section should open your eyes further with a visual understanding.
In regression the machine learning model comes up with a generalized function that approximately learns the trend of data. In a 2-D space the generalized function is just a line and can be easily visualized in below animation.
Notice that the regression line is a best fit line that approximates the data distribution. It does not pass through all the data points to avoid over fitting.
In classification, the machine learning model comes up with a generalized function that approximately divides the data into different classes that exists. This is known as decision boundary.
All data on one side of decision boundary belongs to one class and on the other side belongs to another class.
In higher dimension, decision boundary is hyperplane but in 2-D it can be visualized as a line. Notice in below illustration that the decision boundary is approximately dividing the red and green classes into two areas.
Both classification and regression problems can be approached by various algorithms.
Which algorithm would work best depends on the type of data in hand. There is not rule of thumb here.
There are some algorithms that works for both regression as well as classification by applying some tweaking.
Below is the list of commonly used algorithms for classification and regression.
If you look closely you would see an algorithm Logistic Regression mentioned under Classification. Well it is not a mistake and neither does this algorithm deals with predicting continuous number like regression. The name is a misnomer, there is a good discussion here on the possible reason behind this misleading name.
Let us quickly summarize the discussion between regression vs classification before wrapping up this post.
- Both regression and classification belongs to family of machine learning paradigm known as supervised learning.
- Regression predicts a continuous number whereas classification predicts a class label.
- Regression algorithms produces a generalized function that depicts the trend of data. Whereas classification algorithms produces a decision boundary to separate the data into existing classes.
Also Read – Data Preprocessing in Machine Learning – Complete Nutshell view for Beginners
Also Read – Demystifying Confusion Matrix and Performance Metrics in Machine Learning
In The End…
Though it was a small post, yet I hope it gave you a rich information on regression vs classification comparison. If you are reading this post, you are most likely to be a beginner, asking question between classification and regression. Wishing you good luck for your machine learning journey and hope to see you on our site more.
By any chance if you have confusion between supervised learning and unsupervised learning do check out this post –
Do share your feed back about this post in the comments section below. If you found this post informative, then please do share this and subscribe to us by clicking on bell icon for quick notifications of new upcoming posts. And yes, don’t forget to join our new community MLK Hub and Make AI Simple together.