Cross Validation in Sklearn | Hold Out Approach | K-Fold Cross Validation | LOOCV

 

  • How accurate the model is
  • How generalized the model is
When we start building a model and train it with the ‘entire’ dataset, we can very well calculate its accuracy on this training data set. But we cannot test how this model will behave with new data which is not present in the training set, hence its generalization cannot be determined.
Hence we need techniques to make use of the same data set for both training and testing of the models.
In machine learning, Cross-Validation is the technique to evaluate how well the model has generalized and its overall accuracy. For this purpose, it randomly samples data from the dataset to create training and testing sets. There are multiple cross-validation approaches as follows –
  1. Hold Out Approach
  2. Leave One Out Cross-Validation
  3. K-Fold Cross-Validation
  4. Stratified K-Fold Cross-Validation
  5. Repeated Random Train Test Split

1. Hold Out Approach

In the hold-out approach, the data set is split into the train and test set with random sampling. The train set is used for training the model and the test set is used to test its accuracy with unseen data. If the training and accuracy are almost the same then the model is said to have generalized well. It is common to use 80% of data for training and the remaining 20% for testing.

Advantages

  • It is simple and easy to implement
  • The execution time is less.

Disadvantages

  • If the dataset itself is small, setting aside portions for testing would reduce the robustness of the model. This is because the training sample may not be representative of the entire dataset.
  • The evaluation metrics may vary due to the randomness of the split between the train and test set.
  • Although 80-20 split for train test is widely followed, there is no thumb rule for the split and hence the results can vary based on how the train test split is done.

2. Leave One Out Cross Validation (LOOCV)

In this technique, if there are n observations in the dataset, only one observation is reserved for testing, and the remaining data points are used for training. This is repeated n times till all data points have been used for testing purposes in each iteration. Finally, the average accuracy is calculated by combining the accuracies of each iteration.

Advantage

  • Since every data participates both for training and testing, the overall accuracy is more reliable.
  • It is very useful when the dataset is small.

Disadvantage

  • LOOCV is not practical to use when the number of data observations n is huge. E.g. imagine a dataset with 500,000 records, then 500,000 model needs to be created which is not really feasible.
  • There is a huge computational and time cost associated with the LOOCV approach.

3. K-Fold Cross-Validation

In the K-Fold Cross-Validation approach, the dataset is split into K folds. Now in 1st iteration, the first fold is reserved for testing and the model is trained on the data of the remaining k-1 folds.

In the next iteration, the second fold is reserved for testing and the remaining folds are used for training. This is continued till the K-th iteration. The accuracy obtained in each iteration is used to derive the overall average accuracy for the model.

Advantages

  • K-Fold cross-validation is useful when the dataset is small and splitting it is not possible to split it in train-test set (hold out approach) without losing useful data for training.
  • It helps to create a robust model with low variance and low bias as it is trained on all data

Disadvantages

  • The major disadvantage of K-Fold Cross Validation is that the training needs to be done K times and hence it consumes more time and resources,
  • Not recommended to be used with sequential time series data.
  • When the dataset is imbalanced, K-fold cross-validation may not give good results. This is because some folds may have just a few or no records for the minority class.

4. Stratified K-Fold Cross-Validation

Stratified K-fold cross-validation is useful when the data is imbalanced. While sampling data into K-folds it makes sure that the distribution of all classes in each fold is maintained. For example, if in the dataset 98% of data belongs to class B and 2% to class A, the stratified sampling will make sure each fold contains the two classes in the same ratio of 98% to 2%.

Advantage

Stratified K-fold cross-validation is recommended when the dataset is imbalanced.

5. Repeated Random Test-Train Split

Repeated random test-train split is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data into the training-test set and then repeat this process multiple times, just like the cross-validation method.

Examples of Cross-Validation in Sklearn Library

The hold-out approach can be applied by using train_test_split module of sklearn.model_selection

In the below example we have split the dataset to create the test data with a size of 30% and train data with a size of 70%. The random_state number ensures the split is deterministic in every run.

In Sklearn stratified K-fold cross-validation can be applied by using StratifiedKFold module of sklearn.model_selection

In the below example, the dataset is divided into 5 splits or folds. It returns 5 accuracy scores using which we calculate the final mean score.

  • Veer Kumar

    I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *