Decision Tree Regression in Python Sklearn with Example

Introduction

In this article, we will look at a Decision Tree Regression tutorial using the Python Sklearn library. We will begin with a brief overview of Decision Tree Regression before going in-depth into Sklearn’s DecisionTreeRegressor module. Finally, we will see an example of it using a small machine learning project that will also include DecisionTreeRegressor hyperparameter tuning.

Quick Overview of Decision Tree Regression

Decision Tree Regression in Python Sklearn
(Source)

How Decision Tree Regression Works – Step By Step

  1. Data Collection: The first step in creating a decision tree regression model is to collect a dataset containing both input features (also known as predictors) and output values (also called target variable).
  2. Test Train Data Splitting: The dataset is then divided into two parts: a training set and a testing set. The model is built using the training set, and its performance is evaluated using the testing set.
  3. Building Decision Tree: The decision tree is constructed by recursively splitting the training set into smaller subsets. The goal is to create a tree with high predictive accuracy while remaining simple to understand.
  4. Choosing the Best Split: At each stage of the tree-building process, the algorithm chooses the feature that best divides the data into the most homogeneous subsets based on a predefined criterion, such as mean squared error.
  5. Pruning the Tree: Once the tree is fully grown, it is often pruned to reduce overfitting. Pruning involves removing branches of the tree that do not improve the predictive accuracy of the model.
  6. Making Predictions: Predictions are made by passing the input data down the tree, traversing the branches until it reaches a leaf node. The leaf node’s output value is then used as a prediction for the input data.
  7. Evaluating the Model: The model’s performance is evaluated using the testing set. The model’s evaluation metric is chosen by the problem at hand, but it could be mean squared error, mean absolute error, R-squared, R Score, etc.
  8. Using the Model: Once trained and evaluated, the model can be used to make predictions on new, previously unseen data.

Decision Tree Regression in Sklearn

In Sklearn, decision tree regression can be done quite easily by using DecisionTreeRegressor module of sklearn.tree package.

Decision Tree Regressor Hyperparameters (Sklearn)

Hyperparameters are parameters that can be fine-tuned to improve the accuracy of a machine learning model. Some of the main hyperparameters provided by Sklearn’s DecisionTreeRegressor module are as follows:

criterion: This refers to the criteria that is used to evaluate the quality of the split in decision trees. The following values are supported:’squared error’ (the default), ‘absolute error’, ‘friedman mse’, and ‘poisson’.

splitter:  This denotes the strategy used for splitting at each node while creating the tree. The supported strategies are “best” (default) and “random”.

max_depth: It denotes the tree’s maximum depth. It supports any int value or “None”. If “None”, nodes are expanded until all leaves are pure or contain fewer than min samples split samples.

min_samples_split: It refers to the minimum number of samples needed to split an internal node. It supports any int or float value and the default is 2.

min_samples_leaf: It refers to the minimum no. of samples required at the leaf node. By default, it is 1. It can be any int or float value and the default is 1.

max_features: It indicates the number of features to be considered in order to find the best split. It can have the values ‘auto,”sqrt,’ ‘log2’, ‘None,’ int, or float. It is set to 1.0 by default.

Example of Decision Tree Regression in Sklearn

About Dataset

In this example, we’ll use the Salary dataset, which has two attributes: ‘YearsExperience’ and ‘Salary’. It is a simple dataset with only 29 records.

Importing libraries

To begin, we import all of the libraries that will be needed in this example, including DecisionTreeRegressor.

In [0]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeRegressor

Importing Dataset

Next, our dataset is imported into Pandas DataFrame and its rows are listed.

In [1]:

df = pd.read_csv('/content/salary_dataset.csv')

df
Out[1]:
index YearsExperience Salary
0 1.2 39344
1 1.4 46206
2 1.6 37732
3 2.1 43526
4 2.3 39892
5 3.0 56643
6 3.1 60151
7 3.3 54446
8 3.3 64446
9 3.8 57190
10 4.0 63219
11 4.1 55795
12 4.1 56958
13 4.2 57082
14 4.6 61112
15 5.0 67939
16 5.2 66030
17 5.4 83089
18 6.0 81364
19 6.1 93941
20 6.9 91739
21 7.2 98274
22 8.0 101303
23 8.3 113813
24 8.8 109432
25 9.1 105583
26 9.6 116970
27 9.7 112636
28 10.4 122392
29 10.6 121873

Visualizing Dataset

Let’s visualize the dataset using the Matplotlib library’s scatter plot.

In [2]:

plt.scatter(x = df['YearsExperience'], y = df['Salary'])

Out[2]:

Decision Tree Regression Sklearn Python Dataset

Splitting the Dataset into Train & Test Dataset

In this section, we first divide the original data frame df into two data frames with independent variable X and dependent variable y. Then, we randomly generate train and test datasets with an 80-20% split using the train test split package. Moreover, the random state is kept at 5 to maintain the same split between reruns.

In [3]:

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)

Training the DecisionTreeRegressor

Now we will create an object of DecisionTreeRegressor with random_state=0 so that the results can be reproduced in multiple reruns. And then we fit this regressor object over X_train and y_train for training the model.

In [4]:

dt_regressor = DecisionTreeRegressor(random_state = 0)
dt_regressor.fit(X_train, y_train)

Training Accuracy

Now, we calculate the training accuracy using the R2 score, and the result is 99.7%, which is very impressive.
In [5]:
y_pred_train = dt_regressor.predict(X_train)
r2_score(y_train, y_pred_train)
Out[5]:
0.9971006453393606

Visualizing Training Accuracy

Let’s use a scatter plot to see the training accuracy. The blue markers represent predicted data points during training, while the red markers represent actual training data points. Both spreads are almost completely overlapping one another, indicating that training accuracy is actually quite high and also a possibility of some overfitting.
In [6]:
fig, ax = plt.subplots()
ax.scatter(X_train,y_train, color = "red")
ax.scatter(X_train,y_pred_train, color = "blue")

Out[6]:

Decision Tree Regression in Python Sklearn Example - 1

 

Testing Accuracy

Now we utilize this model to carry out predictions on unseen test data and check its accuracy which turns out to be 98.8%. This indicates, there is slight overfitting in the model because its training accuracy was 99.7%. This issue will be covered in the following section on hyperparameter tuning of decision tree regression.

In [7]:

y_pred = dt_regressor.predict(X_test)
r2_score(y_test, y_pred)
Out[7]:
0.9887060585499007

Visualizing Testing Accuracy

Again, let’s use a matplotlib scatter plot to illustrate the testing accuracy. The blue markers represent the appropriate projected data points, and the red markers represent the actual data points.

In [8]:

fig, ax = plt.subplots()
ax.scatter(X_test,y_test, color = "red")
ax.scatter(X_test,y_pred, color = "blue")
Out[8]:

Decision Tree Regression in Sklearn Overfitting - 1

Improving  Results with K Cross Validation & Hyperparameter Tuning

We observed slight overfitting in the trained model in the preceding example. This is due to the small size of the dataset (29 rows), and splitting it into train and test sets can result in information loss for training. To produce a good model that is less prone to overfitting, we can use K Cross Validation rather than splitting the data into just two parts.

Moreover, in that example, we did not pass any hyperparameters in DecisionTreeRegressor object, hence all default hyperparameters were used. Instead, we can actually experiment with various combinations of hyperparameters with the help GridSearchCV module of Sklearn to arrive at a better accuracy.

Let us now look at the implementation in the next section.

GridSearchCV & Cross Validation in Decision Tree Regression

First, we create a param grid with multiple hyperparameters and their possible values, which we will use to create and evaluate the model. We then use this param grid and the DecisionTreeRegressor object to create a GridSearchCV instance with K Cross Validation value cv=10 and R2 scoring technique.

Finally, the GridSearchCV object was fitted to the training dataset. GridSearchCV creates models with all possible combinations of hyperparameters that we specified in param grid during this process.

In [9]:

param_grid = {
    'max_depth': [80, 90, 100, None],
    'max_features': ["sqrt", "log2", None],
    'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [2, 3, 4],
    'criterion': ["squared_error", "friedman_mse", "absolute_error", "poisson"],
    'splitter': ["best", "random"]
}

rf = DecisionTreeRegressor()

grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 10, verbose =2, scoring='r2',  n_jobs = -1)
grid_search.fit(X_train, y_train)

Out[9]:

Fitting 10 folds for each of 864 candidates, totalling 8640 fits
GridSearchCV(cv=10, estimator=DecisionTreeRegressor(), n_jobs=-1,
             param_grid={'criterion': ['squared_error', 'friedman_mse',
                                       'absolute_error', 'poisson'],
                         'max_depth': [80, 90, 100, None],
                         'max_features': ['sqrt', 'log2', None],
                         'min_samples_leaf': [1, 3, 5],
                         'min_samples_split': [2, 3, 4],
                         'splitter': ['best', 'random']},
             scoring='r2', verbose=2)

Checking for Best Hyperparameters

Let’s take a look at the best hyperparameter combination that GridSearchCV has chosen for us.

In [10]:

grid_search.best_params_
Out[10]:
{'criterion': 'squared_error',
 'max_depth': 80,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 4}

Training Accuracy

The training accuracy, in this case, is 97.15%.

In [11]:

y_pred = best_grid.predict(X_train)
r2_score(y_train, y_pred)

Out[11]:

0.9715589269592385

Testing Accuracy

The accuracy on unseen test data is 97.9%, which is equivalent to the accuracy on training data. This shows that we can avoid overfitting by using K Cross Validation and hyperparameter tuning with GridSearchCV.

In [12]:

y_pred = best_grid.predict(X_test)
r2_score(y_test, y_pred)

Out[12]:

0.9792332560698718

Visualizing Testing Accuracy

In this visualization, the red and blue markers that correspond to actual and predicted data respectively are a bit closer than in the previous example. As a result, this confirms that we achieved higher accuracy using K Cross Validation and GridSearchCV.

In [13]:

fig, ax = plt.subplots()
ax.scatter(X_test,y_test, color = "red")
ax.scatter(X_test,y_pred, color = "blue")
Out[13]:
Decision Tree Regressor Tesing Example with GridSearchCV

Visualizing Regression Decision Tree with Graphviz

We can visualize the decision tree itself by using the tree module of sklearn and Graphviz package as shown below. (Graphviz can be installed with pip command)

In [14]:

from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(dt_regressor,out_file=None, 
                      filled=True, rounded=True,  
                      special_characters=True)  

graph = graphviz.Source(dot_data)  
graph
Out[14]:
Decision Tree Regressor Visualization

 

 

 

 

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *