Importing Pandas Library

Let’s start this tutorial by first importing the pandas library.

import pandas as pd
import numpy as np

Pandas Groupby : groupby()

The pandas groupby function is used for grouping dataframe using a mapper or by series of columns.

Syntax

pandas.DataFrame.groupby(by, axis, level, as_index, sort, group_keys, squeeze, observed)

by : mapping, function, label, or list of labels – It is used to determine the groups for groupby.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – The axis along which the operation is applied.

level : int, level name, or sequence of such, default None – It used to decide if the axis is a MultiIndex (hierarchical), group by a particular level or levels.

as_index : bool, default True – For aggregated output, return object with group labels as the index.

sort : bool, default True – This is used for sorting group keys.

group_keys : bool, default True – When calling apply, this parameter adds group keys to index to identify pieces.

squeeze : bool, default False – This parameter is used to reduce the dimensionality of the return type if possible.

observed : bool, default False – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

The function returns a groupby object that contains information about the groups

Example 1: Computing mean using groupby() function

Using the pandas groupby function

df = pd.DataFrame({'Cars': ['Bentley', 'Bentley',
                               'Aston Martin', 'Aston Martin'],
                    'Max Speed': [380, 370, 275, 350]})

df

In this example, the mean of max_speed attribute is computed using pandas groupby function using Cars column.

df.groupby(['Cars']).mean()

Example 2: Using hierarchical indexes with pandas groupby function

In this example multindex dataframe is created, this is further used to learn about the utility of pandas groupby function.

arrays = [['Mclaren', 'Mercedes', 'Mclaren', 'Mercedes'],
           ['Sports', 'Luxury', 'Sports', 'Luxury']]

index = pd.MultiIndex.from_arrays(arrays, names=('Cars', 'Type'))

df = pd.DataFrame({'Max Speed': [380, 370, 275, 350]},
                   index=index)

df

Here the groupby function is passed two different values as parameter. In both the examples, level parameter is passed to the groupby function.

df.groupby(level=0).mean()

df.groupby(level="Type").std()

[adrotate banner=”3″]

Pandas Where: where()

The pandas where function is used to replace the values where the conditions are not fulfilled.

Syntax

pandas.DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, try_cast=False)

cond : bool Series/DataFrame, array-like, or callable – This is the condition used to check for executing the operations.

other : scalar, Series/DataFrame, or callable – Entries where cond is False are replaced with corresponding value from other.

inplace : bool, default False – It is used to decide whether to perform the operation in place on the data.

axis : int, default None – This is used to specify the alignment axis, if needed.

level : int, default None – This is used to specify the alignment axis, if needed.

try_cast : bool, default False – This parameter is used to try to cast the result back to the input type.

Example 1: Simple example of pandas where() function

Here the where() function is used for filtering the data on the basis of specific conditions.

df = pd.read_csv('players.csv')

df.head()

df.sort_values("Team", inplace = True)

filtering = df['Team'] == "Boston Celtics"

df.where(filtering,inplace=True)

df

As we can see the filtering operation has worked and filtered the desired data but the other entries are also displayed with NaN values in each column and row. So we’ll use the dropna() function to drop all the null values and extract the useful data.

df.dropna()

Example 2: Multi-condition operations in pandas where() function

In the 2nd example of where() function, we will be combining two different conditions into one filtering operation.

df_mul = pd.read_csv('players.csv')

df_mul.sort_values("Team", inplace = True)

mul_filter1 = df_mul["Team"]=="Boston Celtics"

mul_filter2 = df_mul["Weight"]>215

df_mul.where(mul_filter1 & mul_filter2, inplace = True)

df_mul

Again we can see that the filtering operation has worked and filtered the desired data but the other entries are also displayed with NaN values in each column and row. So we’ll use the dropna() function to drop all the null values and extract the useful data.

As we can see all the values in weight column are greater than 215 and also the players are from a specific team that we specified i.e. Boston Celtics. So this is how multiple filtering operations are used in where function of pandas.

df_mul.dropna()

Pandas Filter : filter()

The pandas filter function helps in generating a subset of the dataframe rows or columns according to the specified index labels.

Syntax

pandas.DataFrame.filter(items, like, regex, axis)

items : list-like – This is used for specifying to keep the labels from axis which are in items.

like : str – This is used for keeping labels from axis for which “like in label == True”.

regex : str (regular expression) – This is used for keeping labels from axis for which re.search(regex, label) == True.

axis : {0 or ‘index’, 1 or ‘columns’, None}, default None – This is the axis over which the operation is applied.

Example 1: Filtering columns by name using pandas filter() function

In this example, the pandas filter operation is applied to the columns for filtering them with their names.

df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])),
                   index=['Audi', 'Bentley'],
                   columns=['one', 'two', 'three'])

df.filter(items=['one', 'three'])

Example 2: Using regular expression to filter columns

In this example, regex is used along with the pandas filter function. Here, with the help of regex, we are able to fetch the values of column(s) which have column name that has “o” at the end. The ‘$’ is used as a wildcard suggesting that column name should end with “o”.

df.filter(regex='o$', axis=1)

Example 3: Filtering rows with “like” parameter

This like parameter helps us to find desired strings in the row values and then filters them accordingly. As we specified the string in the like parameter, we got the desired results. So this is how like parameter is put to use.

df.filter(like='ntl', axis=0)

Conclusion

We have reached the end of the article, we learned about the filter functions frequently used for fetching data from a dataset with ease. The functions covered in this article were pandas groupby(), where() and filter(). We tried to understand these functions with the help of examples which also included detailed information of the syntax.

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
0	Avery Bradley	Boston Celtics	0.0	PG	25.0	6-2	180.0	Texas	7730337.0
1	Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
2	John Holland	Boston Celtics	30.0	SG	27.0	6-5	205.0	Boston University	NaN
3	R.J. Hunter	Boston Celtics	28.0	SG	22.0	6-5	185.0	Georgia State	1148640.0
4	Jonas Jerebko	Boston Celtics	8.0	PF	29.0	6-10	231.0	NaN	5000000.0

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
0	Avery Bradley	Boston Celtics	0.0	PG	25.0	6-2	180.0	Texas	7730337.0
1	Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
3	R.J. Hunter	Boston Celtics	28.0	SG	22.0	6-5	185.0	Georgia State	1148640.0
6	Jordan Mickey	Boston Celtics	55.0	PF	21.0	6-8	235.0	LSU	1170960.0
7	Kelly Olynyk	Boston Celtics	41.0	C	25.0	7-0	238.0	Gonzaga	2165160.0
8	Terry Rozier	Boston Celtics	12.0	PG	22.0	6-2	190.0	Louisville	1824360.0
9	Marcus Smart	Boston Celtics	36.0	PG	22.0	6-4	220.0	Oklahoma State	3431040.0
10	Jared Sullinger	Boston Celtics	7.0	C	24.0	6-9	260.0	Ohio State	2569260.0
11	Isaiah Thomas	Boston Celtics	4.0	PG	27.0	5-9	185.0	Washington	6912869.0
12	Evan Turner	Boston Celtics	11.0	SG	27.0	6-7	220.0	Ohio State	3425510.0
13	James Young	Boston Celtics	13.0	SG	20.0	6-6	215.0	Kentucky	1749840.0
14	Tyler Zeller	Boston Celtics	44.0	C	26.0	7-0	253.0	North Carolina	2616975.0

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
1	Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
6	Jordan Mickey	Boston Celtics	55.0	PF	21.0	6-8	235.0	LSU	1170960.0
7	Kelly Olynyk	Boston Celtics	41.0	C	25.0	7-0	238.0	Gonzaga	2165160.0
9	Marcus Smart	Boston Celtics	36.0	PG	22.0	6-4	220.0	Oklahoma State	3431040.0
10	Jared Sullinger	Boston Celtics	7.0	C	24.0	6-9	260.0	Ohio State	2569260.0
12	Evan Turner	Boston Celtics	11.0	SG	27.0	6-7	220.0	Ohio State	3425510.0
14	Tyler Zeller	Boston Celtics	44.0	C	26.0	7-0	253.0	North Carolina	2616975.0

Pandas Tutorial – groupby(), where() and filter()

Introduction

Importing Pandas Library

Pandas Groupby : groupby()

Syntax

Example 1: Computing mean using groupby() function

Example 2: Using hierarchical indexes with pandas groupby function

Pandas Where: where()

Syntax

Example 1: Simple example of pandas where() function

Example 2: Multi-condition operations in pandas where() function

Pandas Filter : filter()

Syntax

Example 1: Filtering columns by name using pandas filter() function

Example 2: Using regular expression to filter columns

Example 3: Filtering rows with “like” parameter

Conclusion

Leave a Reply Cancel reply

Latest Posts

Follow US

		Max Speed
Cars	Type
Mclaren	Sports	380
Mercedes	Luxury	370
Mclaren	Sports	275
Mercedes	Luxury	350

	Max Speed
Cars
Aston Martin	312.5
Bentley	375.0

	one	three
Audi	1	3
Bentley	4	6

	two
Audi	2
Bentley	5