Introduction
While working with your data science or machine learning projects you will have to spend a lot of time with data preprocessing with the help of the Pandas library. It is very likely that your data set might contain missing data or duplicate data for which you would like to drop the entire column. Or you might just want to drop the column because you think that the feature is not important for creating the model. In this article, we’ll see how pandas in-built functions such as drop(), dropna() and drop_duplicates() are used for fulfilling this purpose.
Importing Pandas Library
Beginning this tutorial by importing pandas library
import pandas as pd
import numpy as np
We will commence this article with the drop function in pandas.
Pandas Drop : drop()
Pandas drop() function is used for removing or dropping desired rows and/or columns from dataframe.
For removing rows or columns, we can either specify the labels and the corresponding axis or they can be removed by using index values as well.
Let’s understand its syntax and then look at some of its examples.
Syntax
dataframe.drop(labels=None,axis=0,inplace=False)
labels : single label or list-like – In this parameter index or column names which are required to be dropped are provided.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – This parameter helps in knowing the place from where labels are to be dropped. If specified as 0, it will be dropped from index and if specified as 1, it will be dropped from columns.
inplace – boolean – The inplace parameter ensures whether the operations performed on the dataframe are permanent or not. If true is passed to it, then the changes due to the operations are made permanent, otherwise, they are not. The default value is false.
As a result, this function returns the dataframe without the removed index or column.
Example 1: dropping columns using column names
Here we are dropping columns using column names. A dataframe is created using arange function and then reshape function is used to structure the dataframe.
We have to specify axis=1 to specify that we want to drop columns.
df = pd.DataFrame(np.arange(15).reshape(3, 5),
columns=['A', 'B', 'C', 'D','E'])
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
1 | 5 | 6 | 7 | 8 | 9 |
2 | 10 | 11 | 12 | 13 | 14 |
df.drop(['A', 'E'], axis=1)
B | C | D | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
2 | 11 | 12 | 13 |
This is another example where instead of specifying axis parameter in drop() function, we provide the columns keyword for providing the information that we want to drop columns.
As you can see both these examples are producing the same result.
df.drop(columns=['A', 'E'])
B | C | D | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
2 | 11 | 12 | 13 |
Example 2: dropping the rows by using index values
In this example, we are dropping rows using index values 0 and 2. By using pandas drop function, we can drop the rows and generate such output.
df.drop([0, 2])
A | B | C | D | E | |
---|---|---|---|---|---|
1 | 5 | 6 | 7 | 8 | 9 |
Example 3: drop function over multi-index dataframe
For this example, we will create a multi-index dataframe.
midx = pd.MultiIndex(levels=[['BMW', 'Jaguar', 'Mercedes'],
['speed', 'weight', 'length']],
codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
[0, 1, 2, 0, 1, 2, 0, 1, 2]])
midx
MultiIndex([( 'BMW', 'speed'), ( 'BMW', 'weight'), ( 'BMW', 'length'), ( 'Jaguar', 'speed'), ( 'Jaguar', 'weight'), ( 'Jaguar', 'length'), ('Mercedes', 'speed'), ('Mercedes', 'weight'), ('Mercedes', 'length')], )
df = pd.DataFrame(index=midx, columns=['big', 'small'],
data=[[125, 130], [200, 100], [10, 6], [300, 200],
[150, 70], [8, 6], [320, 250],
[75, 65], [12, 10]])
df
big | small | ||
---|---|---|---|
BMW | speed | 125 | 130 |
weight | 200 | 100 | |
length | 10 | 6 | |
Jaguar | speed | 300 | 200 |
weight | 150 | 70 | |
length | 8 | 6 | |
Mercedes | speed | 320 | 250 |
weight | 75 | 65 | |
length | 12 | 10 |
Here a specific index and column is dropped using drop function
df.drop(index='Jaguar', columns='small')
big | ||
---|---|---|
BMW | speed | 125 |
weight | 200 | |
length | 10 | |
Mercedes | speed | 320 |
weight | 75 | |
length | 12 |
In this example, only the length index is dropped. As it can be seen, for the three car brands, length index is dropped.
df.drop(index='length', level=1)
big | small | ||
---|---|---|---|
BMW | speed | 125 | 130 |
weight | 200 | 100 | |
Jaguar | speed | 300 | 200 |
weight | 150 | 70 | |
Mercedes | speed | 320 | 250 |
weight | 75 | 65 |
After learning how to remove specific index or columns, let’s see how to deal with missing values. For this we will look at dropna function.
Pandas Dropna : dropna()
As mentioned above, dropna() function in pandas removes the missing values.
NOTE – Remember NA is abbreviation of Not Available i.e. the values are not present there. NA should not be confused with an empty string or 0.
Syntax
dataframe.dropna(axis=0,how=’any’,thresh=None, subset=None,inplace=False)
axis – 0 or ‘index’, 1 or ‘column’ (default 0) – This helps in specifying the rows or columns which contain missing values and thus removing them. The default value is 0 i.e. missing values from rows are removed, if 1 is passed to this parameter, then missing values from column are removed.
how – ‘any’ or ‘all’ (default – any) – By using this parameter a condition for removing missing values is provided. If mentioned as ‘any’ then even if single value is missing, then that row will be removed but if specified as ‘all’, then all the values must be missing, for that row or column to be removed.
thresh – int(optional) – Thresh parameter is used to specify the count of Non-NA values which should be present in passed dataframe.
subset – array-like(optional) – This subset is a part of dataframe considered along the other axis from where null values are removed.
inplace – boolean – The inplace parameter ensures whether the operations performed on the dataframe are permanent or not. If true is passed to it, then the changes due to the operations are made permanent, otherwise they are not. The default value is false.
The final output will be a dataframe with NA values removed.
Example 1: dropping NA values using dropna() function
In the given dataframe, nan is abbreviation for the word ‘Not a Number’ and NaT is for the term ‘Not a Time’ basically it represents missing values in datatime datatype.
df = pd.DataFrame({"name": ['George', 'James', 'Maxwell','Michael','Clarke'],
"subject_liked": [np.nan, 'Geography', 'History',np.nan,'English'],
"born": [pd.NaT, pd.Timestamp("1985-04-02"),pd.NaT, pd.Timestamp("1998-07-09"),
pd.Timestamp("2005-10-15"),]})
df
name | subject_liked | born | |
---|---|---|---|
0 | George | NaN | NaT |
1 | James | Geography | 1985-04-02 |
2 | Maxwell | History | NaT |
3 | Michael | NaN | 1998-07-09 |
4 | Clarke | English | 2005-10-15 |
In this dropna() example, all the rows containing even single NA values are dropped. So here by default how parameter is any.
df.dropna()
name | subject_liked | born | |
---|---|---|---|
1 | James | Geography | 1985-04-02 |
4 | Clarke | English | 2005-10-15 |
[adrotate banner=”3″]
Example 2: dropping NA values by using columns
Here axis is specified as ‘columns’, so all the NA values in the columns are dropped.
df.dropna(axis='columns')
name | |
---|---|
0 | George |
1 | James |
2 | Maxwell |
3 | Michael |
4 | Clarke |
Example 3: using ‘how’ parameter
With this example, we understand the usage of how parameter. When specified as all, only rows which have all the values as NA are only dropped. Since this is not the case in this dataframe, no rows are dropped.
df.dropna(how='all')
name | subject_liked | born | |
---|---|---|---|
0 | George | NaN | NaT |
1 | James | Geography | 1985-04-02 |
2 | Maxwell | History | NaT |
3 | Michael | NaN | 1998-07-09 |
4 | Clarke | English | 2005-10-15 |
Example 4: using thresh parameter of dropna function
With this parameter, we can specify the minimum number of NA values which should be present in a row for dropping it. As it can be seen in the example given below, the 0th index row had 2 NA values, due to this reason it is dropped.
df.dropna(thresh=2)
name | subject_liked | born | |
---|---|---|---|
1 | James | Geography | 1985-04-02 |
2 | Maxwell | History | NaT |
3 | Michael | NaN | 1998-07-09 |
4 | Clarke | English | 2005-10-15 |
Example 5: using subset parameter in pandas dropna()
A large dataframe can be divided into small subsets for dropping values. This is achieved by dropna() function’s subset parameter.
df.dropna(subset=['name', 'born'])
name | subject_liked | born | |
---|---|---|---|
1 | James | Geography | 1985-04-02 |
3 | Michael | NaN | 1998-07-09 |
4 | Clarke | English | 2005-10-15 |
To handle duplicate values, we use drop_duplicates function.
Pandas Drop Duplicates: drop_duplicates()
Pandas drop_duplicates() function is useful in removing duplicate rows from dataframe.
Syntax
dataframe.drop_duplicates(subset,keep,inplace)
subset : column label or sequence of labels – This parameter specifies the columns for identifying duplicates. By default all the columns are considered.
keep : {first,last,False},default ‘first’ – This determines which duplicates should be kept in the dataframe.If specified as first, then all the duplicates except first are dropped. Similarly, if specified as last, then all the duplicates except last are dropped. If false is specified, then all the duplicates are dropped.
inplace – boolean – The inplace parameter ensures whether the operations performed on the dataframe are permanent or not. If true is passed to it, then the changes due to the operations are made permanent, otherwise they are not. The default value is false.
The resulting dataframe consists of all the values except the duplicated dropped.
Example 1: simple example of drop_duplicates() function
x = {'A': [6, 6, 6, 2], 'B': [9, 9, 9, 3], 'C': [7, 7, 7, 5]}
df = pd.DataFrame(x)
df
A | B | C | |
---|---|---|---|
0 | 6 | 9 | 7 |
1 | 6 | 9 | 7 |
2 | 6 | 9 | 7 |
3 | 2 | 3 | 5 |
In this, the duplicate rows are dropped where the first duplicate values are not dropped. This is the reason, 0th index row is not dropped but the other two duplicates i.e. 1st and 2nd rows are dropped.
df.drop_duplicates()
A | B | C | |
---|---|---|---|
0 | 6 | 9 | 7 |
3 | 2 | 3 | 5 |
Example 2: using ‘keep’ parameter
In this drop_duplicates function, we assign the value to keep parameter as false, this is the reason all the duplicate rows are dropped.
df.drop_duplicates(keep=False)
A | B | C | |
---|---|---|---|
3 | 2 | 3 | 5 |
Example 3: using subset parameter
Here a subset of dataframe is used to drop columns. For this, we use subset parameter. So the duplicate values in A and B columns are removed.
df.drop_duplicates(subset=['A', 'B'])
A | B | C | |
---|---|---|---|
0 | 6 | 9 | 7 |
3 | 2 | 3 | 5 |
Conclusion
It’s time to end this tutorial, in this article we went over the functions which are used for dropping unwanted values, dropping missing values and dropping duplicate rows. The functions we covered were drop(), dropna() and drop_duplicates(). These functions which are provided by pandas library are useful for data preprocessing tasks.
Reference – https://pandas.pydata.org/docs/