Introduction
In this article, we will see a tutorial on Random Forest Regression by using Python Sklearn library. We will have a brief overview of Random Forest Regression and then understand the RandomForestRegressor module of Sklearn in detail. Finally, we will see its example with the help of a small machine learning project that will also include hyperparameter tuning for RandomForestRegressor.
Quick Overview on Random Forest Regression
Random Forest is an ensemble learning technique used for both classification and regression problems. In this technique, multiple decision trees are created and their output is averaged to give the final result. Random Forest Regression is known to produce very robust results by avoiding overfitting.
How Random Forest Regression Works
The highlevel steps for random forest regression are as followings â€“
 Decide the number of decision trees N to be created.
 Randomly take K data samples from the training set by using the bootstrapping method.
 Create a decision tree using the above K data samples.
 Repeat steps 2 and 3 till N decision trees are created.
 For the new unseen data predict regression results using each of the N decision trees. Then take an average of these results to arrive at the final regression output.
Random Forest Regression in Sklearn
In Sklearn, random forest regression can be done quite easily by using RandomForestRegressor module of sklearn.ensemble module.
Random Forest Regressor Hyperparameters (Sklearn)
Hyperparameters are those parameters that can be finetuned for arriving at better accuracy of the machine learning model. Some of the main hyperparameters that RandomForestRegressor module of Sklearn provides are as follows â€“
 n_estimators: It denotes the number of decision trees to be created in the random forest model. By default, it is 100.
 criterion: This denotes the criteria to be used to assess the quality of the split in decision trees. The supported values are â€˜squared_errorâ€™ (default), â€˜absolute_errorâ€™, â€˜friedman_mseâ€™, â€˜poissonâ€™.
 max_depth: It denotes the maximum depth of the tree. By default is None in which case nodes are expanded till all leaves become pure or until all leaves contain less than min_samples_split samples.
 min_samples_split: It denotes the minimum number of samples needed to split an internal node. By default, it is 2.
 min_samples_leaf: It denotes the minimum number of samples required to be at the leaf node. By default, it is 1.
 max_features: It denotes the number of features to be considered for the best split. It can have values of â€˜autoâ€™, â€˜sqrtâ€™, â€˜log2â€™, â€˜Noneâ€™, int, or float value. By default, it is 1.0
 max_samples: It denotes the number of samples to be drawn from training data in bootstrap sampling.
Example of Random Forest Regression in Sklearn
About Dataset
In this example, we are going to use the Salary dataset which contains two attributes â€“ â€˜YearsExperienceâ€™ and â€˜Salaryâ€™. It is a simple and small dataset of only 29 records.
Importing libraries
To begin with, we import all the libraries that are going to be required in this example including RandomForestRegressor.
In [0]:
import numpy as np import pandas as pd importÂ matplotlib.pyplotÂ asÂ plt from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score fromÂ sklearn.model_selectionÂ importÂ GridSearchCV fromÂ sklearn.ensembleÂ importÂ RandomForestRegressor
Â
Importing Dataset
Next, we import the dataset into Pandas DataFrame and list down its row.
In [1]:
dfÂ =Â pd.read_csv('/content/salary_dataset.csv') df
index  YearsExperience  Salary 

0  1.2  39344 
1  1.4  46206 
2  1.6  37732 
3  2.1  43526 
4  2.3  39892 
5  3.0  56643 
6  3.1  60151 
7  3.3  54446 
8  3.3  64446 
9  3.8  57190 
10  4.0  63219 
11  4.1  55795 
12  4.1  56958 
13  4.2  57082 
14  4.6  61112 
15  5.0  67939 
16  5.2  66030 
17  5.4  83089 
18  6.0  81364 
19  6.1  93941 
20  6.9  91739 
21  7.2  98274 
22  8.0  101303 
23  8.3  113813 
24  8.8  109432 
25  9.1  105583 
26  9.6  116970 
27  9.7  112636 
28  10.4  122392 
29  10.6  121873 
Visualizing Dataset
Let us visualize the dataset by creating a scatter plot of matplotlib library.
In [2]:
plt.scatter(xÂ =Â df['YearsExperience'],Â yÂ =Â df['Salary'])
Out[2]:
Splitting the Dataset into Train & Test Dataset
In this section, we are first creating a dataframe of independent variable X and dependent variable y from the original dataframe df. Then we use train_test_split module to randomly create train and test datasets with an 8020% split.
In [3]:
X = df.iloc[:, :1] yÂ =Â df.iloc[:,Â 1] X_train,Â X_test,Â y_train,Â y_testÂ =Â train_test_split(X,Â y,Â test_sizeÂ =Â 0.2,Â random_stateÂ =Â 0)
Training the RandomForestRegressor
Now we are creating an object of RandomForestRegressor with n_estimators = 10 i.e. with 10 decision trees. And then we fit this object over X_train and y_train for training the model.
In [4]:
rf_regressor = RandomForestRegressor(n_estimators = 10, random_state = 0) rf_regressor.fit(X_train, y_train)
Training Accuracy
y_pred_train = rf_regressor.predict(X_train) r2_score(y_train, y_pred_train)
0.9815329041236582
Visualizing Training Accuracy
fig, ax = plt.subplots() ax.scatter(X_train,y_train, color = "red") ax.scatter(X_train,y_pred_train,Â colorÂ =Â "blue")
Out[6]:
Testing Accuracy
Now we use this model to carry out predictions on unseen test data and check its accuracy which turns out to be 96.7%. This indicates, there is slight overfitting in the model because its training accuracy was 98.1%. We shall address this in the next section of hyperparameter tuning.
In[7]:
y_pred = rf_regressor.predict(X_test) r2_score(y_test,Â y_pred)
0.9675706804534532
Visualizing Testing Accuracy
Again, let us visualize the testing accuracy with the help of matplotlib scatter plot. The red markers are actual data points and the blue ones are corresponding predicted data points.
In [8]:
fig, ax = plt.subplots() ax.scatter(X_test,y_test, color = "red") ax.scatter(X_test,y_pred,Â colorÂ =Â "blue")
ImprovingÂ Results with K Cross Validation & Hyperparameter Tuning
In the above example, we noticed slight overfitting in the trained model. This is because the dataset is very small (29 rows) and splitting it into train and test sets can lead to information loss for training. Thus an effective way is to use K Cross Validation instead of splitting the data to produce a good model less prone to overfitting.
Secondly, in the example, we just use n_estimators as 10 but we can also play around with different combinations and values of other hyperparameters. We cannot evaluate so many combinations manually, hence we can use GridSearchCV module of Sklearn.
 Also Read â€“ Cross Validation in Sklearn
 Also Read â€“ Hyperparameter Tuning with Sklearn GridSearchCV
Let us see the implementation below.
Using GridSearchCV & Cross Validation
Here we first create param_grid with multiple hyperparameters and their possible values using which we want to create & evaluate the model. We then use this param_grid, the RandomForestRegressor object to create an instance of GridSearchCV with K Cross Validation value cv=5 & scoring technique as R2.
Finally, we fit the GridSearchCV object over the training dataset. During this process, GridSearchCV creates different models with all possible combinations of hyperparameters that we provided in param_grid.
In [9]:
param_grid = { Â Â Â 'bootstrap':Â [True], Â Â Â 'max_depth':Â [80,Â 90,Â 100,Â None], Â Â Â 'max_features':Â ["sqrt",Â "log2",Â None], Â Â Â 'min_samples_leaf':Â [1,Â 3,Â 5], Â Â Â 'min_samples_split':Â [2,Â 3,Â 4], Â Â Â 'n_estimators':Â [10,Â 25,Â 50,Â 75,Â 100] } rf = RandomForestRegressor() grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 5, verbose =2, scoring='r2',Â Â n_jobs = 1) grid_search.fit(X_train,Â y_train)
Fitting 5 folds for each of 540 candidates, totalling 2700 fits
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=1, param_grid={'bootstrap': [True], 'max_depth': [80, 90, 100, None], 'max_features': ['sqrt', 'log2', None], 'min_samples_leaf': [1, 3, 5], 'min_samples_split': [2, 3, 4], 'n_estimators': [10, 25, 50, 75, 100]}, scoring='r2', verbose=2)
Checking for Best Hyperparameters
Let us see the best hyperparameter combination that GridSearchCV has selected for us.
In [10]:
grid_search.best_params_
{'bootstrap': True, 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 25}
Training Accuracy
The training accuracy here comes out to be 98.4%
In [11]:
y_pred = best_grid.predict(X_train) r2_score(y_train,Â y_pred)
Out[11]:
0.9846484854275217
Testing Accuracy
The accuracy on the unseen test data comes out to be 97.9% ~ 98% which is equivalent to the accuracy we got on training data. This means by using K Cross Validation and hyperparameter tuning with GridSearchCV we are able to avoid overfitting.
In [12]:
y_pred = best_grid.predict(X_test) r2_score(y_test,Â y_pred)
Out[12]:
0.9792332560698718
Visualizing Testing Accuracy
In this visualization, we can see that the red and blue markers that correspond to actual and predicted data are much closer than what we saw in the earlier example. Hence this confirms that we have achieved better accuracy with K Cross Validation and GridSearchCV.
In [13]:
fig, ax = plt.subplots() ax.scatter(X_test,y_test, color = "red") ax.scatter(X_test,y_pred,Â colorÂ =Â "blue")
Reference: Sklearn Documentation

MLK is a knowledge sharing community platform for machine learning enthusiasts, beginners and experts. Let us create a powerful hub together to Make AI Simple for everyone.