What is Predictive Power Score (PPS) – Is it better than Correlation ? [With Python Code]

Introduction

Every time when we try to solve a data science problem, our aim is to extract as much insight as possible from the data. This also involves deriving the relationship between two different attributes or features of a dataset and traditionally Correlation Matrix is widely used for this purpose. However, there are instances when the correlation matrix is not very effective to convey all hidden insights of the data. To overcome this shortcoming, there is a new method of Predictive Power Score which has started to gain popularity.

In this article, we will understand what is predictive power score and see its implementation in Python. We will also do a comparison between predictive power score vs correlation and understand its pros and cons.

What is a Predictive Power Score?

Predictive Power Score or PPS is a kind of score that is asymmetric and data-type agnostic and helps in identifying linear or non-linear relationships between two columns of a particular dataset. The value spectrum of PPS lies between 0 (no predictive power) and 1 (highest predictive power).

Through PPS we can figure out how useful a variable would be in predicting the values of another variable in a given dataset and since it normalizes the data it is much more reliable.

Generally, a PPS score near 1 (e.g 0.8) is considered as good and this tells us that a given column A is very likely to predict the values of column B.

Whereas if the PPS score lies on the lower side near 0 (e.g. 0.3), then we may have to reconsider our comparisons as column A may not be useful to predict the values of column B.

Predictive Power Score vs Correlation

i) Correlation

The correlation matrix produces output between -1 to 1 using which we can easily find linear relationships that are quite stronger, in both positive and negative directions.

Still, there are areas where the correlation matrix fails to produce desirable results. Instances where columns have non-linear relationships go undetected by the correlation matrix. Interestingly, the correlation matrix would produce a score of 0 in such scenarios, trying to convey that “I didn’t find anything useful here”.

Moreover, the correlation matrix can only handle columns that consist of numerical data, it simply neglects other kinds of values namely categoric, nominal, etc.

Another drawback is the correlation matrix is symmetrical, which can give a misleading interpretation that the correlation of column A to B and B to A is same. But this is not always the case in the real world.

ii) Predictive Power Score (PPS)

  • PPS can very well find out non-linear relationships amongst different columns which cannot be obtained from correlation.
  • The correlation matrix works only with numerical variables but PPS can also handle categorical and nominal data as well, in addition to numerical values.
  • Unlike the correlation matrix, PPS is asymmetric. This simply means that if column A can predict column B values, it does not mean that column B can also predict column A values, and PPS will show this interpretation.

[adrotate banner=”3″]

Disadvantages of Predictive Power Score

In spite of all the pros of PPS, there are some cons that make it tough to work with. Let’s go over these cons and understand what are the problems encountered.

  • Since PPS is relatively new as compared to the correlation matrix, the predictive power score can be tedious and hard to draw inferences.
  • The PPS can reflect different types of relationships on a single score, this can give rise to complex patterns.
  • Contrary to the correlation matrix, the predictive power score can have different methods i.e. different algorithms and evaluation metrics for producing the output, thus making the results highly dependent on several factors. This may generate undesirable results.
  • Predictive Power Score generates the output in more time as compared to the correlation matrix.

Steps for Predictive Power Score Calculation

1. Choosing the Algorithm

By default, the predictive power score method uses a Decision Tree for calculating the results. There are many reasons for choosing the Decision Tree algorithm.

  • Firstly, the Decision Tree can find out any sort of non-linear bivariate relationships.
  • Decision Tree is applicable in numerous cases and also it requires very little data preprocessing.
  • Furthermore, Decision Tree is able to handle outliers very well and rarely overfits, thus making it highly robust.

2. Deciding the Implementation Method

As we know that the predictive power score can work with numerical and categorical values in the target column. We will also have to choose the supervised learning method that will be used for performing predictions.

  • When dealing with categorical values, we will be using the Decision Tree Classifier.
  • When deadline with numerical values, we should use Decision Tree Regressor.

If in case you are looking to use some other algorithm in place of Decision Tree, then you’ll have to implement the classifier and regressor using that particular algorithm.

3. Preprocessing the data

Earlier in the article, we had discussed that Decision Tree does not require preprocessing but there are instances when we have to perform some kind of data preprocessing and feature selection for generating better results.

Again based on the type of values, we choose the evaluation metric method.

  • If the column whose values are to be predicted a.k.a. the target column has categorical values, we use Label Encoding.
  • But if the column which is predicting the values a.k.a. the feature column has categorical values, we use one-hot encoding.

4. Finalizing the Evaluation Metric

Now on the basis of the supervised learning method chosen, we also to finalize the evaluation metric for our score.

So if the method which is chosen is the classification method, then we’ll be using the weighted F1 score as an evaluation metric. This score is basically a weighted average of precision and recall. In predictive power score, we first calculate the F1 score for the naive model (the model that always predicts the most common class) and after this, with the help of the F1 score generated, we obtain the actual F1 score for the predictive power score.

The F1 score lies between the range of 0 to 1. The higher the better it is.

Following is the mathematical formula used in this case:

PPS = (F1_model – F1_naive) / (1 – F1_naive)
If the method used is regression, then the Mean Absolute Error (MAE) evaluation metric is used. Again in this case as well, we first calculate MAE for the naive model, and then using this score, we generate the desired MAE for predictive power score.

Here the score lies between 0 and 1 but as this score tells us about the error component, the lower it is, the better it will be. The mathematical formula used for calculating the MAE is mentioned below.

PPS = 1 – (MAE_model / MAE_naive)

Predictive Power Score in Python

In section, we will implement Predictive Power Score in Python and will also compare its results with the correlation matrix. We will calculate the predictive power score and correlation for columns of a given dataset.

i) Installing ppscore library for Predictive Power Score

If ppscore library is not present, you can install it using the following line at the command prompt.

  • pip install ppscore

ii) Loading the libraries

Here we will be loading the pandas, seaborn and predictive power score library (ppscore).

In [1]:
import pandas as pd
import seaborn as sns

import ppscore as pps

iii) Create heatmap for PPS matrix

For better interpretation, we will have to visualize the results of the predictive power score matrix. For this purpose, we will write a function to create a heatmap.

In [2]:
def heatmap(df):
    ax = sns.heatmap(df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
    ax.set_title('PPS matrix')
    ax.set_xlabel('feature')
    ax.set_ylabel('target')
    return ax

iv) Create a heatmap for correlation matrix

Similarly, we will have to visualize the results of correlation matrix and for this purpose, we write another function to create a heatmap by calling the heatmap function.

In [3]:
def corr_heatmap(df):
    ax = sns.heatmap(df, vmin=-1, vmax=1, cmap="BrBG", linewidths=0.5, annot=True)
    ax.set_title('Correlation matrix')
    return ax

v) Loading the Dataset

Here for this practical implementation, we are using a dataset that contains information about placements of job aspirants after they have completed their MBA degree. Using pandas we load the dataset.

In [4]:
df = pd.read_csv("placement.csv")
In [5]:
df.head()
Out[5]:
sl_no gender 10th_Grade 12th_Marks Degree_Grade workex Employability_Test MBA_Grade Job_Status Salary
0 1 M 67.00 91.00 58.00 No 55.0 58.80 Placed 270000.0
1 2 M 79.33 78.33 77.48 Yes 86.5 66.28 Placed 200000.0
2 3 M 65.00 68.00 64.00 No 75.0 57.80 Placed 250000.0
3 4 M 56.00 52.00 52.00 No 66.0 59.43 Not Placed NaN
4 5 M 85.80 73.60 73.30 No 96.8 55.50 Placed 425000.0

vi) Preprocessing the Dataset

Since there are some NaN values in Salary column, we will be removing them. Also, the workex and job_status columns have text values, so we convert them to numerical binary values for easy handling.

In [6]:
df = df[df['Salary'].notna()]
In [7]:
df['workex'].loc[(df['workex'] == 'No')] = 0
In [8]:
df['workex'].loc[(df['workex'] == 'Yes')] = 1
In [9]:
df['Job_Status'].loc[(df['Job_Status'] == 'Not Placed')] = 0
In [10]:
df['Job_Status'].loc[(df['Job_Status'] == 'Placed')] = 1
In [11]:
df.head()
Out[11]:
sl_no gender 10th_Grade 12th_Marks Degree_Grade workex Employability_Test MBA_Grade Job_Status Salary
0 1 M 67.00 91.00 58.00 0 55.0 58.80 1 270000.0
1 2 M 79.33 78.33 77.48 1 86.5 66.28 1 200000.0
2 3 M 65.00 68.00 64.00 0 75.0 57.80 1 250000.0
4 5 M 85.80 73.60 73.30 0 96.8 55.50 1 425000.0
7 8 M 82.00 64.00 66.00 1 67.0 62.14 1 252000.0

vii) Single Predictive Power Score

Here we are looking to find the relationship between the MBA_Grade and Workex columns. The ppscore tells us about the relationship. The low value suggests that the relationship is not strong. Other parameter contains relevant information such as the metric and model used by ppscore.

In [12]:
pps.score(df, "MBA_Grade", "workex")
Out[12]:
{'x': 'MBA_Grade',
 'y': 'workex',
 'task': 'classification',
 'ppscore': 0.2687331422358646,
 'metric': 'weighted F1',
 'baseline_score': 0.4864864864864865,
 'model_score': 0.6244845865535521,
 'model': DecisionTreeClassifier()}

viii) Predictive Power Score Matrix

By using the pps library’s matrix function, we can generate a matrix of our dataset.

In [13]:
matrix = pps.matrix(df)
In [14]:
matrix
Out[14]:
sl_no gender 10th_Grade 12th_Marks Degree_Grade workex Employability_Test MBA_Grade Job_Status Salary
sl_no 1.000000 0.000000 0.000000 0.000000 0.000000 0.004119 0.000000 0.000000 0 0.000000
gender 0.030294 1.000000 0.078647 0.000301 0.103646 0.000000 0.000000 0.000000 0 0.026784
10th_Grade 0.000000 0.012969 1.000000 0.000000 0.000000 0.002736 0.000000 0.000000 0 0.018676
12th_Marks 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0 0.000000
Degree_Grade 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0 0.000000
workex 0.013792 0.000000 0.000000 0.008364 0.022069 1.000000 0.068917 0.157935 0 0.000000
Employability_Test 0.000000 0.002994 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0 0.000000
MBA_Grade 0.000000 0.026816 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 0.000000
Job_Status 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1 1.000000
Salary 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 1.000000

ix) Heatmap for Predictive Power Score

Using the earlier built heatmap function and matrix obtained just above, we will be building a heatmap for the predictive power score. This heatmap helps in visualizing the relationships that different columns have with each other.

As we can see, the values are ranging from 0 to 1 for each cell, this helps in easier mapping of values and there is no explicit requirement of normalization.

Sometimes it can be a problem to understand this matrix if there are more number of columns

In [15]:
heatmap(matrix)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x21ad6808b70>

Power Predictive Score in Python

x) Heatmap for Correlation matrix

Now we will build the heatmap using the correlation matrix. Here as we see the values are ranging from -1 to 1, this definitely creates a problem in mapping the values and it may require data normalization.

In [16]:
corr_heatmap(df.corr())
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x21ad6970128>
Predictive Power Score vs Correlation
Correlation Matrix

Applications of Predictive Power Score

Having seen its practical implementation, let us see some applications of predictive power score for real-world data science projects.

  1. To find various patterns in a given data.
  2. Performing Feature Selection
  3. When we use Correlation Matrix, a lot of information is lost, Predictive Power Score finds Information leakage
  4. PPS is a normalized entity itself, thus it also helps in Data Normalization.

Conclusion

We have reached to end of this article, we learned what is predictive power score and saw its implementation in Python. We also did a comparison between predictive power score vs correlation. Finally, we learned about the applications of the PPS and its pros and cons.

Power Predictive Score GitHub Link

  • Palash Sharma

    I am Palash Sharma, an undergraduate student who loves to explore and garner in-depth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.

    View all posts

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *