Random Forest Regression in Python Sklearn with Example

Introduction

In this article, we will see a tutorial on Random Forest Regression by using Python Sklearn library. We will have a brief overview of Random Forest Regression and then understand the RandomForestRegressor module of Sklearn in detail. Finally, we will see its example with the help of a small machine learning project that will also include hyperparameter tuning for RandomForestRegressor.

Quick Overview on Random Forest Regression

Random Forest Regression Sklearn Python
(Source)

Random Forest is an ensemble learning technique used for both classification and regression problems. In this technique, multiple decision trees are created and their output is averaged to give the final result. Random Forest Regression is known to produce very robust results by avoiding overfitting.

How Random Forest Regression Works

The high-level steps for random forest regression are as followings –

  1. Decide the number of decision trees N to be created.
  2. Randomly take K data samples from the training set by using the bootstrapping method.
  3. Create a decision tree using the above K data samples.
  4. Repeat steps 2 and 3 till N decision trees are created.
  5. For the new unseen data predict regression results using each of the N decision trees. Then take an average of these results to arrive at the final regression output.

Random Forest Regression in Sklearn

In Sklearn, random forest regression can be done quite easily by using RandomForestRegressor module of sklearn.ensemble module.

Random Forest Regressor Hyperparameters (Sklearn)

Hyperparameters are those parameters that can be fine-tuned for arriving at better accuracy of the machine learning model. Some of the main hyperparameters that RandomForestRegressor module of Sklearn provides are as follows –

  • n_estimators: It denotes the number of decision trees to be created in the random forest model. By default, it is 100.
  • criterion: This denotes the criteria to be used to assess the quality of the split in decision trees. The supported values are ‘squared_error’ (default), ‘absolute_error’, ‘friedman_mse’, ‘poisson’.
  • max_depth: It denotes the maximum depth of the tree. By default is None in which case nodes are expanded till all leaves become pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split: It denotes the minimum number of samples needed to split an internal node. By default, it is 2.
  • min_samples_leaf: It denotes the minimum number of samples required to be at the leaf node. By default, it is 1.
  • max_features: It denotes the number of features to be considered for the best split. It can have values of ‘auto’, ‘sqrt’, ‘log2’, ‘None’, int, or float value. By default, it is 1.0
  • max_samples: It denotes the number of samples to be drawn from training data in bootstrap sampling.

Example of Random Forest Regression in Sklearn

About Dataset

In this example, we are going to use the Salary dataset which contains two attributes – ‘YearsExperience’ and ‘Salary’. It is a simple and small dataset of only 29 records.

Importing libraries

To begin with, we import all the libraries that are going to be required in this example including RandomForestRegressor.

In [0]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor

 

Importing Dataset

Next, we import the dataset into Pandas DataFrame and list down its row.

In [1]:

df = pd.read_csv('/content/salary_dataset.csv')

df
Out[1]:
index YearsExperience Salary
0 1.2 39344
1 1.4 46206
2 1.6 37732
3 2.1 43526
4 2.3 39892
5 3.0 56643
6 3.1 60151
7 3.3 54446
8 3.3 64446
9 3.8 57190
10 4.0 63219
11 4.1 55795
12 4.1 56958
13 4.2 57082
14 4.6 61112
15 5.0 67939
16 5.2 66030
17 5.4 83089
18 6.0 81364
19 6.1 93941
20 6.9 91739
21 7.2 98274
22 8.0 101303
23 8.3 113813
24 8.8 109432
25 9.1 105583
26 9.6 116970
27 9.7 112636
28 10.4 122392
29 10.6 121873

Visualizing Dataset

Let us visualize the dataset by creating a scatter plot of matplotlib library.

In [2]:

plt.scatter(x = df['YearsExperience'], y = df['Salary'])

Out[2]:

Random Forest Regression Sklearn Python Dataset

Splitting the Dataset into Train & Test Dataset

In this section, we are first creating a dataframe of independent variable X and dependent variable y from the original dataframe df. Then we use train_test_split module to randomly create train and test datasets with an 80-20% split.

In [3]:

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Training the RandomForestRegressor

Now we are creating an object of RandomForestRegressor with n_estimators = 10 i.e. with 10 decision trees. And then we fit this object over X_train and y_train for training the model.

In [4]:

rf_regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
rf_regressor.fit(X_train, y_train)

Training Accuracy

Here we use the R2 score to calculate the training accuracy which turns out to be 98.1% which is quite impressive.
In [5]:
y_pred_train = rf_regressor.predict(X_train)
r2_score(y_train, y_pred_train)
Out[5]:
0.9815329041236582

Visualizing Training Accuracy

Let us visualize the training accuracy with a scatter plot. The red markers are actual training data points and the blue ones are predicted data points during the training process. The spread of both are overlapping each other which confirms that training accuracy is indeed very good.
In[6]:
fig, ax = plt.subplots()
ax.scatter(X_train,y_train, color = "red")
ax.scatter(X_train,y_pred_train, color = "blue")

Out[6]:

RandomForestRegressor Training Example

Testing Accuracy

Now we use this model to carry out predictions on unseen test data and check its accuracy which turns out to be 96.7%. This indicates, there is slight overfitting in the model because its training accuracy was 98.1%. We shall address this in the next section of hyperparameter tuning.

In[7]:

y_pred = rf_regressor.predict(X_test)
r2_score(y_test, y_pred)
Out[7]:
0.9675706804534532

Visualizing Testing Accuracy

Again, let us visualize the testing accuracy with the help of matplotlib scatter plot. The red markers are actual data points and the blue ones are corresponding predicted data points.

In [8]:

fig, ax = plt.subplots()
ax.scatter(X_test,y_test, color = "red")
ax.scatter(X_test,y_pred, color = "blue")
Out[8]:
RandomForestRegressor Tesing Example

Improving  Results with K Cross Validation & Hyperparameter Tuning

In the above example, we noticed slight overfitting in the trained model. This is because the dataset is very small (29 rows) and splitting it into train and test sets can lead to information loss for training. Thus an effective way is to use K Cross Validation instead of splitting the data to produce a good model less prone to overfitting.

Secondly, in the example, we just use n_estimators as 10 but we can also play around with different combinations and values of other hyperparameters. We cannot evaluate so many combinations manually, hence we can use GridSearchCV module of Sklearn.

Let us see the implementation below.

Using GridSearchCV & Cross Validation

Here we first create param_grid with multiple hyperparameters and their possible values using which we want to create & evaluate the model. We then use this param_grid, the RandomForestRegressor object to create an instance of GridSearchCV with K Cross Validation value cv=5 & scoring technique as R2.

Finally, we fit the GridSearchCV object over the training dataset. During this process, GridSearchCV creates different models with all possible combinations of hyperparameters that we provided in param_grid.

In [9]:

param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, None],
    'max_features': ["sqrt", "log2", None],
    'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [2, 3, 4],
    'n_estimators': [10, 25, 50, 75, 100]
}

rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 5, verbose =2, scoring='r2',  n_jobs = -1)
grid_search.fit(X_train, y_train)
Out[9]:
Fitting 5 folds for each of 540 candidates, totalling 2700 fits
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True], 'max_depth': [80, 90, 100, None],
                         'max_features': ['sqrt', 'log2', None],
                         'min_samples_leaf': [1, 3, 5],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [10, 25, 50, 75, 100]},
             scoring='r2', verbose=2)

Checking for Best Hyperparameters

Let us see the best hyperparameter combination that GridSearchCV has selected for us.

In [10]:

grid_search.best_params_
Out[10]:
{'bootstrap': True,
 'max_depth': None,
 'max_features': None,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 25}

Training Accuracy

The training accuracy here comes out to be 98.4%

In [11]:

y_pred = best_grid.predict(X_train)
r2_score(y_train, y_pred)

Out[11]:

0.9846484854275217

Testing Accuracy

The accuracy on the unseen test data comes out to be 97.9% ~ 98% which is equivalent to the accuracy we got on training data. This means by using K Cross Validation and hyperparameter tuning with GridSearchCV we are able to avoid overfitting.

In [12]:

y_pred = best_grid.predict(X_test)
r2_score(y_test, y_pred)

Out[12]:

0.9792332560698718

Visualizing Testing Accuracy

In this visualization, we can see that the red and blue markers that correspond to actual and predicted data are much closer than what we saw in the earlier example. Hence this confirms that we have achieved better accuracy with K Cross Validation and GridSearchCV.

In [13]:

fig, ax = plt.subplots()
ax.scatter(X_test,y_test, color = "red")
ax.scatter(X_test,y_pred, color = "blue")
Out[13]:
Random Forest Regressor Tesing Example with GridSearchCV

Reference: Sklearn Documentation

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *