Pandas Tutorial – Rolling, Correlation and Apply

Pandas Tutorial - Rolling, Correlation and Apply

Introduction

In this article, we will continue with the pandas tutorial and cover rolling(), corr(), and apply() functions. We will look at the syntax and examples of these functions for better understanding.

Importing Pandas Library

Starting the tutorial by importing the Pandas library.

In [1]:
import pandas as pd
import numpy as np

Pandas Rolling : Rolling()

The pandas rolling function helps in calculating rolling window calculations.

Syntax

DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)

window : int or offset – This parameter determines the size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.

If its set to offset then this will be the time period of each window. Each window will be a variable-sized based on the observations included in the time-period. This is only valid for datetime like indexes.

min_periods : int,default is None – This parameter is used to specify the minimum number of observations in window required to have a value (otherwise result is NA).

center : bool – It is helpful in selecting the label at the center of the window

win_type : str – This is used for providing window type.

on : str(optional) – For a DataFrame, a datetime-like column or MultiIndex level on which to calculate the rolling window, rather than the DataFrame’s index.

axis : int or str

closed : str, default None – This is used for making the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints.

The function returns a window or rolling for a particular operation.

Example 1: Using win_type parameter in Pandas Rolling()

Here in this first example of rolling function, we are using the different values of win_type parameter. Using the win_type parameter, we can perform the sum operation.

In [2]:
df = pd.DataFrame({'A': [7, 3, 5,9, 2]})
In [3]:
df
Out[3]:
A
0 7
1 3
2 5
3 9
4 2

In the example given below, the sum of numbers is calculated using rolling function and win_type parameter.

In [4]:
df.rolling(2, win_type='triang').sum()
Out[4]:
A
0 NaN
1 5.0
2 4.0
3 7.0
4 5.5

Similarly, win_type parameter is passed “gaussian” value. Here while using gaussian parameter, we have to specify standard deviation as well.

In [5]:
df.rolling(2, win_type='gaussian').sum(std=3)
Out[5]:
A
0 NaN
1 9.862071
2 7.889657
3 13.806900
4 10.848278

Example 2: Using min_periods parameter in Pandas Rolling()

This parameter is used to specify the minimum number of observations in window required to have a value.

In [6]:
df.rolling(2, min_periods=1).sum()
Out[6]:
A
0 7.0
1 10.0
2 8.0
3 14.0
4 11.0

As we can see that in this case the minimum periods value is set to ‘2’ and thus the 0th index row has value as NaN

In [7]:
df.rolling(2, min_periods=2).sum()
Out[7]:
A
0 NaN
1 10.0
2 8.0
3 14.0
4 11.0

[adrotate banner=”3″]

Pandas Correlation : Corr()

The pandas corr() function is beneficial in computing the correlation of columns.

Syntax

DataFrame.corr(method=’pearson’, min_periods=1)

method : {‘pearson’, ‘kendall’, ‘spearman’} – Using these 3 different correlation methods, we can check the correlation in the columns.

min_periods : int,optional – This optional parameter decides the minimum number of observations required per pair of columns to have a valid result.

The output of the function is a DataFrame with correlation matrix.

Example 1: Simple example of corr() function

We will create a dataframe using “csv” file. Now corr() function has a method called “pearson”. Using this method, we will get the correlation value for all the columns which contains numerical values.

In [8]:
df = pd.read_csv('employees.csv')
In [9]:
df.head()
Out[9]:
First Name Gender Start Date Last Login Time Salary Bonus % Senior Management Team
0 Douglas Male 8/6/1993 12:42 PM 97308 6.945 True Marketing
1 Thomas Male 3/31/1996 6:53 AM 61933 4.170 True NaN
2 Maria Female 4/23/1993 11:17 AM 130590 11.858 False Finance
3 Jerry Male 3/4/2005 1:00 PM 138705 9.340 True Finance
4 Larry Male 1/24/1998 4:47 PM 101004 1.389 True Client Services

It uses standard correlation coefficient for calculating the correlation value.

In [10]:
df.corr(method='pearson')
Out[10]:
Salary Bonus %
Salary 1.000000 -0.036381
Bonus % -0.036381 1.000000

Example 2: Finding correlation value using Kendall method

In this example, the Kendall method is used. It uses Kendall Tau correlation coefficient for calculating the correlation value.

In [11]:
df.corr(method='kendall')
Out[11]:
Salary Bonus %
Salary 1.0000 -0.0234
Bonus % -0.0234 1.0000

Pandas Apply : Apply()

The pandas apply() function is used for applying a function along the axis of a dataframe.

Syntax

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), kwds)

func : function – Here the function which has to be applied is passed.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – The axis along which the function is applied.

raw : bool, default False – It helps to determine if row or column is passed as a Series or ndarray object.

result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None – This is used for specifying the type of result expected.

args : tuple – This contains the positional arguments that are passed to the function in addition to the array/series.

kwds – Additional Keyword arguments.

Example 1: Applying pandas square root function

Here using the apply function, we are applying the square root function of numpy.

In [12]:
df = pd.DataFrame([[64,81]] * 3, columns=['P', 'Q'])
In [13]:
df
Out[13]:
P Q
0 64 81
1 64 81
2 64 81
In [14]:
df.apply(np.sqrt)
Out[14]:
P Q
0 8.0 9.0
1 8.0 9.0
2 8.0 9.0

Example 2: Applying reducing function i.e. sum function over either axis

In this example the pandas apply function is applying numpy’s sum function over both the axis. We will see the difference in the results obtained through two examples.

In the below example, the rows of column “P” and “Q” added to produce the results.

In [15]:
df.apply(np.sum, axis=0)
Out[15]:
P    192
Q    243
dtype: int64

In this example the column values are added to produce the results. This is the reason why we are getting the same values for all the 3 rows

In [16]:
df.apply(np.sum, axis=1)
Out[16]:
0    145
1    145
2    145
dtype: int64

Example 3: Understanding the usage of result_type parameter.

For this example, we will use lambda function and then will see how result_type function is used. The results for each row are generated in the form of a list-like array.

In [17]:
df.apply(lambda x: [7, 9], axis=1)
Out[17]:
0    [7, 9]
1    [7, 9]
2    [7, 9]
dtype: object

Now when we specify the result_type value as “expand”, then we get the results in the form of dataframe. So in simple terms, the results are expanded.

In [18]:
df.apply(lambda x: [7, 9], axis=1, result_type='expand')
Out[18]:
0 1
0 7 9
1 7 9
2 7 9

Conclusion

We have reached the end of this article, through this article we learned about some new pandas functions, namely pandas rolling(), correlation() and apply(). These functions are helpful in applying operations over a Pandas DataFrame. We also looked at the syntax of these functions and their examples which helps in understanding the usage of functions.

Reference – https://pandas.pydata.org/docs/

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *