Contents

## Introduction

In this article, we will learn about a few pandas statistical functions. The statistical functions that will be discussed in this article are pandas std() used for finding the standard deviation, quantile() used for finding intervals in the available data and finally the boxplot() function which is used to visualize the features that are used to describe the dataset. We will be looking at the syntax and examples of these functions to understand their usage.

### Importing Pandas Library

We will commence the tutorial by importing pandas library.

```
import pandas as pd
import numpy as np
```

**Pandas Standard Deviation : std()**

The pandas standard deviation functions helps in finding the standard deviation over the desired axis of Pandas Dataframes.

### Syntax

**pandas.DataFrame.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, kwargs**)

**axis : {index (0), columns (1)}**– This is the axis over which the standard deviation is calculated.**skipna : bool, default True**– It is used for deciding whether to exclude NA/Null values or not.**level : int or level name, default None**– This parameter is generally used when we have a multindex dataframe. Using this we can decide the level on which this function is applied.**ddof : int, default 1**– Delta degrees of freedom or ddof is the divisor used in calculations is N – ddof, where N represents the number of elements.**numeric_only : bool, default None**– This parameter decides if only the int, float, boolean values should be included or not.**kwargs**– Additional Keyword Arguments

### Example 1: Applying pandas std() function over the rows

Here the standard deviation is computed using std() function of pandas.

```
df = pd.read_csv('employees.csv')
```

```
df.head()
```

Here the **skipna** parameter is also used to decide whether to skip the null values or not. In this case, skipping the **Null values** did not made any difference.

**NOTE:** – The **count of Null values** present in a database can cause an **alteration** in the standard deviation value, in the example that we considered, the count of Null/NaN values are less and thus standard deviation is identical in both cases(Including and Excluding the Null Values).

Therefore, remember that a lesser number of Null Values or absence of Null values will have no effect on standard deviation. But a larger number of null values can cause a change in standard deviation.

Thus, an individual must analyze the dataset and then understand the impact of null values.

```
df.std(axis=0,skipna=True)
```

```
df.std(axis=0,skipna=False)
```

### Example 2: Applying pandas std() function over the columns

Using the pandas standard deviation function, we will be now computing the standard deviation over the columns. Again the standard deviation is the same for both cases of skipna.

```
df.std(axis=1,skipna=True)
```

**Again in this example, there is no difference in the standard deviation values of the dataframe calculated over columns. This suggests that count of null values is not causing any effects in our case.**

```
df.std(axis=1,skipna=False)
```

**Pandas Quantile : quantile()**

The pandas **quantile()** function is used for returning values at the given quantile over requested axis.

### Syntax

**pandas.DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation=’linear’)**

**q : float or array-like, default 0.5 (50% quantile)**– Here a value between 0 <= q <= 1 is specified, the quantile(s) to compute is executed using this information.**axis : {0, 1, ‘index’, ‘columns’} (default 0)**– This parameter is used to decide the axis over which this operation is performed.**numeric_only : bool, default True**– If False, the quantile of datetime and timedelta data will be computed as well.**interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}**– This parameter specifies the interpolation method to use.

### Example 1: Computing quantile using pandas quantile()

In this example, we will calculate different quantiles for the data.

**NOTE**: Quantiles are used to divide the data into fixed portions. The quantiles can range from **0% to 100%**. Generally, quantiles that are frequently used are **25%, 50%, and 75%**.

```
df = pd.DataFrame(np.array([[5, 75], [10, 150], [15, 300], [20, 600]]),
columns=['P', 'Q'])
```

```
df
```

In the below mentioned example, the **10% percentile** is computed.

```
df.quantile(.1)
```

In the below mentioned example, the **25%, 50% and 75% percentile** are computed.

```
df.quantile([.25, .5,.75])
```

### Example 2: Computing quantile of datetime and timedelta data

In this example, we will compute the quantile of datetime and timedelta data using pandas quantile function.

```
df_date = pd.DataFrame({'P': [9, 27],
'Q': [pd.Timestamp('2017'),
pd.Timestamp('2019')],
'R': [pd.Timedelta('9 days'),
pd.Timedelta('25 days')]})
```

```
df_date
```

Here the numeric-only parameter of the quantile method is used, using this parameter we will be able to calculate the quantile of timedata. With the help of quantile function, we have calculated the **40% and 85%** quantile of the data.

```
df_date.quantile([0.40,0.85], numeric_only=False)
```

**Pandas Boxplot : boxplot()**

The pandas **boxplot** function helps in building a box plot from DataFrame columns.

### Syntax

**pandas.DataFrame.boxplot(column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, backend=None, kwargs**)

**column : str or list of str, optional**– Here the column name or list of names is provided using this data, the boxplot is created.**by : str or array-like, optional**– This parameter is used for specifying the column used.**ax : object of class matplotlib.axes.Axes, optional**– Here the matplotlib axes to be used by boxplot is provided.**fontsize : float or str**– Here the tick label font size in points or as a string is given.**rot : int or float, default 0**– The rotation angle of labels (in degrees) with respect to the screen coordinate system.**grid : bool, default True**– This is used for displaying the grid.**figsize : A tuple (width, height) in inches**– This is the size of the figure to create in matplotlib.**layout : tuple (rows, columns), optional**– The layout parameter helps in deciding the layout.**return_type : {‘axes’, ‘dict’, ‘both’} or None, default ‘axes’**– This parameter helps in specifying the kind of object that should be returned.**backend : str, default None**– This is provided for the backend of the matplotlib plots.**kwargs**– Additional Keyword arguments.

The below image is used to explain the boxplot and it also helps in understanding the points represented by a boxplot. Here we can see that at two extreme ends, **Outliers** are represented, moving ahead the **lower and upper limit** of the dataset is represented in the form of **whiskers or bars**.

The two edges of the box represent the **minimum and maximum value** in the range of the dataset. The middle section is displaying the **median** of the dataset. The Q1, Q2 and Q3 are the quartiles which represent the **25%, 50% and 75%** intervals of the dataset respectively.

The difference between Q3 and Q1 quartiles is known as the **Interquartile range**.

**NOTE**: Boxplot is a very important visualizing tool to learn for data science aspirants.

### Example 1: Visualizing data through pandas boxplot() function

Here the data is generated using numpy and then boxplot function is used to visualize the data using pandas boxplot function.

```
np.random.seed(1234)
```

```
df = pd.DataFrame(np.random.randn(100, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
```

```
df.head()
```

In this example, we can see how all the descriptive statistics information is provided. In the 4 columns information which are conveyed in the below visualization, **Col1** doesn’t have outliers, but other three do have outliers.

**Outliers** are basically those values which do not belong to the range of data to which other data belongs to.

Apart from this, we can see that the minimum values and maximum values are falling almost similar for all the 4 columns.

This is how data is explained with the help of **boxplot**. Here we are able to decipher the minimum and maximum values, median value, outliers, quartiles, starting and ending values as well.

```
boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3','Col4'])
```

### Example 2: Grouped Boxplots by a variable

Here the boxplot is created by grouping them with another variable which consists series data. We can see how the series data is visualized using the boxplot using these two columns.

There are outliers as well in the data which can be easily seen along with other details of the dataset.

```
df = pd.DataFrame(np.random.randn(50, 2),
columns=['Col1', 'Col2'])
```

```
df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A','A', 'A', 'A', 'A', 'A','A', 'A', 'A', 'A', 'A','A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B','B', 'B', 'B', 'B', 'B','B', 'B', 'B', 'B', 'B','B', 'B', 'B', 'B', 'B'])
```

```
boxplot = df.boxplot(by='X')
```

## Conclusion

We have reached to the finish point of this tutorial, in this tutorial we covered some of the most important concepts of statistical pandas functions with focus being on pandas functions like std(), quantiles() and boxplot() which is used to describe a dataset efficiently. We looked at the syntax and examples of these functions to better understand them and their usage.

- Also Read – Tutorial – Pandas Drop, Pandas Dropna, Pandas Drop Duplicate
- Also Read – Pandas Visualization Tutorial – Bar Plot, Histogram, Scatter Plot, Pie Chart
- Also Read – Tutorial – Pandas Concat, Pandas Append, Pandas Merge, Pandas Join
- Also Read – Pandas DataFrame Tutorial – Selecting Rows by Value, Iterrows and DataReader

*Reference –* https://pandas.pydata.org/docs/

## Like and Comment section (Community Members)

## Create Your ML Profile!

Don't miss out to join exclusive Machine Learning community

## Comments