Steps of Training Testing and Validation in Machine Learning is very essential to make a robust supervised learning model. Training alone cannot ensure a model to work with unseen data. We need to complement training with testing and validation to come up with a powerful model that works with new unseen data.
However for beginners, concept of Training Testing and Validation in Machine Learning can be confusing – especially between testing and validation. So in this post we will understand these concepts step by step and why they are required with the help of a small case study.
Need of Testing
Suppose you have been given a data set by your manager and have been asked to come up with supervised model for regression or classification. You built the model by training it on given data set and straight away deployed it on production. In few days your manager comes back angry and frustrated that the model failed to show any acceptable accuracy.
So what happened here?
Well, we missed a very important step – we did not do testing of model before releasing it to stakeholders for production use. Just like any software development, even a machine learning model needs some testing to verify how well it performs before we send it for production deployment.
There are many metrics available to measure the performance of the classification and regression models. But the question is what data we should use to test model’s performance. Since the only data which we have during development of model is initial data set – so somehow we will have to make use of this data only for both training and testing of model.
So let us look at various approaches we can take, to make use of given data set for both training a model and testing it.
Training and Testing on same data
(Spoiler Alert – Don’t do it !! Yes – don’t do this unless you want to frustrate your manager even more.)
One possible approach can be to train your model on the entire given data set. Then to test your model you can randomly select few sample data from training set only.You now test your model on these test sample data and again fine tune your model hyper-parameters.
In this process you arrive at a very impressive accuracy for your model. Now happy that your testing results are giving good accuracy, you showcase results to manager assuring him that you have done testing this time and deploy it on production. But again in few days manager comes back even more angry and frustrated as model is still not showing any accuracy on production as promised by test results.
So where we went wrong again?
Well any good supervised learning model should work well with unseen data. In this case we tested our model on the data from training set itself and furthered fine tuned our model according to training data only (which model had already seen during training phase). This caused the model to adapt itself on training data only, resulting in over fitting. So when new unseen data came in production it failed to perform.
Splitting data into Training and Test set
Is it not very intuitive – that we can address above shortcoming by splitting our initial data set into two parts at the very start? We train the model on one part and test the model on the other part. They are commonly known as Training set and Testing set respectively.
Now in what ratio should we split the data set – well generally a very common split ratio is 80-20. This means 80% of data is kept for training purpose and 20% data is kept aside for testing. And this split is done randomly.
So as we can see now, by doing this split we make sure that we have unseen data for testing purpose. With this improved and better testing strategy you assure your manager of better outcome this time and deploy the model on production.
But again, after few days your manager comes back – bit disappointed that though there is definitely some improvement but the performance accuracy you showcased in your testing is not matching at production, it is still going off than expected.
So why are we still not getting it right?
It is very common for people to train model and the test the results on test set. However, if accuracy is not coming on test set, we tend to tune hyper parameters and forcefully try to match training accuracy with test set. So, if you can realize we actually end up over fitting the model with test set indirectly. So again when it encounters new unseen data in production the model fails to deliver the accuracy it had promises on test set.
So to over come this issue we can explore next option – let us see how.
Splitting data into Training, Validation and Test set
In this approach, we initially do train test split like before, however from the training set we again set aside some portion – this portion is known as Validation Set. Based on the volume of available data this portion can be 10%-20% of your training data.
So, you now train your model on training set, validate its performance on unseen validation set. If required based on validation results you can change hyper-parameters of your model to meet accuracy of training phase and avoid over fitting at the same time. This can be iterative process.
After you feel the model is properly trained, as one final testing you can run the model on test set and obtain the performance accuracy. This final accuracy is going to be indicative of the accuracy you can expect in production. So this time you have robust testing results for your manager.
After few weeks, your manager comes back – this time he is actually happy with your work. The model accuracy is not only good but also as what you promised after testing.
A new problem at hand now
Now that your manager has gained confidence, he gives you a new data set for building another supervised machine learning model. You open the data set and realize it is very small. It is so small that if you try to split data set in train-validation-test or even train-test you will end up loosing information while training your model with training set.
This problem can be addressed with the next technique. Let us see.
K-Fold Cross Validation
In this method we shuffle the data set and then split it in to K equal parts. We now reserve the first part for testing and train model on rest of K-1 parts. After training, we then test model on first part that we had reserved and note the accuracy.
Again in next iteration we now leave the second part for testing and train model on rest of data. We then test model on second part and note the accuracy.
We carry out this process till we have done K iteration and have K accuracy score for each iteration. This process is visualized below for K=5.
After we get K accuracy scores, we calculate the mean of it. This mean gives a more proper indication that how robust our model is. If the mean accuracy score is low we change or fine tune the hyper-parameters of model.
Though we can take any value of K but after many experimentation results have shown that a good value to choose is K=10. But this may vary from one data set to another.
A good thing about K-fold cross validation is that model is tested with every data across K iterations. So when we have a very small data set this is the only approach that can show the true performance of the model. Having said this, this approach can be taken for larger data sets also, but process will become slow. So with a very huge data set you may try to explore earlier options of Train-Test or Train-Validation-Test split if time becomes a constraint.
However K-fold cross validation is a gold standard for testing models in many data science competition and research work irrespective of data size and you should also try to use this in your work.
In the End…
We hope you would be clearer now about various strategies for Training Testing and Validation in Machine Learning along with pros and cons of each. K-fold cross validation has many flavors which we will discuss in another specific post because here our intention was just to give you a gentle introduction about it.
Do share your feed back about this post in the comments section below. If you found this post informative, then please do share this and subscribe to us by clicking on bell icon for quick notifications of new upcoming posts