Tutorial – Pandas Drop, Pandas Dropna, Pandas Drop Duplicate

Tutorial - Pandas Drop, Pandas Dropna, Pandas Drop Duplicate

Introduction

While working with your data science or machine learning projects you will have to spend a lot of time with data preprocessing with the help of the Pandas library. It is very likely that your data set might contain missing data or duplicate data for which you would like to drop the entire column. Or you might just want to drop the column because you think that the feature is not important for creating the model. In this article, we’ll see how pandas in-built functions such as drop(), dropna() and drop_duplicates() are used for fulfilling this purpose.

Importing Pandas Library

Beginning this tutorial by importing pandas library

In [1]:
import pandas as pd
import numpy as np

We will commence this article with the drop function in pandas.

Pandas Drop : drop()

Pandas drop() function is used for removing or dropping desired rows and/or columns from dataframe.

For removing rows or columns, we can either specify the labels and the corresponding axis or they can be removed by using index values as well.

Let’s understand its syntax and then look at some of its examples.

Syntax

dataframe.drop(labels=None,axis=0,inplace=False)

labels : single label or list-like – In this parameter index or column names which are required to be dropped are provided.

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – This parameter helps in knowing the place from where labels are to be dropped. If specified as 0, it will be dropped from index and if specified as 1, it will be dropped from columns.

inplace – boolean – The inplace parameter ensures whether the operations performed on the dataframe are permanent or not. If true is passed to it, then the changes due to the operations are made permanent, otherwise, they are not. The default value is false.

As a result, this function returns the dataframe without the removed index or column.

Example 1: dropping columns using column names

Here we are dropping columns using column names. A dataframe is created using arange function and then reshape function is used to structure the dataframe.

We have to specify axis=1 to specify that we want to drop columns.

In [2]:
df = pd.DataFrame(np.arange(15).reshape(3, 5),

                  columns=['A', 'B', 'C', 'D','E'])
In [3]:
df
Out[3]:
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
In [4]:
df.drop(['A', 'E'], axis=1)
Out[4]:
B C D
0 1 2 3
1 6 7 8
2 11 12 13

This is another example where instead of specifying axis parameter in drop() function, we provide the columns keyword for providing the information that we want to drop columns.

As you can see both these examples are producing the same result.

In [5]:
df.drop(columns=['A', 'E'])
Out[5]:
B C D
0 1 2 3
1 6 7 8
2 11 12 13

Example 2: dropping the rows by using index values

In this example, we are dropping rows using index values 0 and 2. By using pandas drop function, we can drop the rows and generate such output.

In [6]:
df.drop([0, 2])
Out[6]:
A B C D E
1 5 6 7 8 9

Example 3: drop function over multi-index dataframe

For this example, we will create a multi-index dataframe.

In [7]:
midx = pd.MultiIndex(levels=[['BMW', 'Jaguar', 'Mercedes'],

                             ['speed', 'weight', 'length']],

                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],

                            [0, 1, 2, 0, 1, 2, 0, 1, 2]])

midx
Out[7]:
MultiIndex([(     'BMW',  'speed'),
            (     'BMW', 'weight'),
            (     'BMW', 'length'),
            (  'Jaguar',  'speed'),
            (  'Jaguar', 'weight'),
            (  'Jaguar', 'length'),
            ('Mercedes',  'speed'),
            ('Mercedes', 'weight'),
            ('Mercedes', 'length')],
           )
In [8]:
df = pd.DataFrame(index=midx, columns=['big', 'small'],

                  data=[[125, 130], [200, 100], [10, 6], [300, 200],

                        [150, 70], [8, 6], [320, 250],

                        [75, 65], [12, 10]])
In [9]:
df
Out[9]:
big small
BMW speed 125 130
weight 200 100
length 10 6
Jaguar speed 300 200
weight 150 70
length 8 6
Mercedes speed 320 250
weight 75 65
length 12 10

Here a specific index and column is dropped using drop function

In [10]:
df.drop(index='Jaguar', columns='small')
Out[10]:
big
BMW speed 125
weight 200
length 10
Mercedes speed 320
weight 75
length 12

In this example, only the length index is dropped. As it can be seen, for the three car brands, length index is dropped.

In [11]:
df.drop(index='length', level=1)
Out[11]:
big small
BMW speed 125 130
weight 200 100
Jaguar speed 300 200
weight 150 70
Mercedes speed 320 250
weight 75 65

After learning how to remove specific index or columns, let’s see how to deal with missing values. For this we will look at dropna function.

Pandas Dropna : dropna()

As mentioned above, dropna() function in pandas removes the missing values.

NOTE – Remember NA is abbreviation of Not Available i.e. the values are not present there. NA should not be confused with an empty string or 0.

Syntax

dataframe.dropna(axis=0,how=’any’,thresh=None, subset=None,inplace=False)

axis – 0 or ‘index’, 1 or ‘column’ (default 0) – This helps in specifying the rows or columns which contain missing values and thus removing them. The default value is 0 i.e. missing values from rows are removed, if 1 is passed to this parameter, then missing values from column are removed.

how – ‘any’ or ‘all’ (default – any) – By using this parameter a condition for removing missing values is provided. If mentioned as ‘any’ then even if single value is missing, then that row will be removed but if specified as ‘all’, then all the values must be missing, for that row or column to be removed.

thresh – int(optional) – Thresh parameter is used to specify the count of Non-NA values which should be present in passed dataframe.

subset – array-like(optional) – This subset is a part of dataframe considered along the other axis from where null values are removed.

inplace – boolean – The inplace parameter ensures whether the operations performed on the dataframe are permanent or not. If true is passed to it, then the changes due to the operations are made permanent, otherwise they are not. The default value is false.

The final output will be a dataframe with NA values removed.

Example 1: dropping NA values using dropna() function

In the given dataframe, nan is abbreviation for the word ‘Not a Number’ and NaT is for the term ‘Not a Time’ basically it represents missing values in datatime datatype.

In [12]:
df = pd.DataFrame({"name": ['George', 'James', 'Maxwell','Michael','Clarke'],

                   "subject_liked": [np.nan, 'Geography', 'History',np.nan,'English'],

                   "born": [pd.NaT, pd.Timestamp("1985-04-02"),pd.NaT, pd.Timestamp("1998-07-09"),
                           
                           pd.Timestamp("2005-10-15"),]})
In [13]:
df
Out[13]:
name subject_liked born
0 George NaN NaT
1 James Geography 1985-04-02
2 Maxwell History NaT
3 Michael NaN 1998-07-09
4 Clarke English 2005-10-15

In this dropna() example, all the rows containing even single NA values are dropped. So here by default how parameter is any.

In [14]:
df.dropna()
Out[14]:
name subject_liked born
1 James Geography 1985-04-02
4 Clarke English 2005-10-15

[adrotate banner=”3″]

Example 2: dropping NA values by using columns

Here axis is specified as ‘columns’, so all the NA values in the columns are dropped.

In [15]:
df.dropna(axis='columns')
Out[15]:
name
0 George
1 James
2 Maxwell
3 Michael
4 Clarke

Example 3: using ‘how’ parameter

With this example, we understand the usage of how parameter. When specified as all, only rows which have all the values as NA are only dropped. Since this is not the case in this dataframe, no rows are dropped.

In [16]:
df.dropna(how='all')
Out[16]:
name subject_liked born
0 George NaN NaT
1 James Geography 1985-04-02
2 Maxwell History NaT
3 Michael NaN 1998-07-09
4 Clarke English 2005-10-15

Example 4: using thresh parameter of dropna function

With this parameter, we can specify the minimum number of NA values which should be present in a row for dropping it. As it can be seen in the example given below, the 0th index row had 2 NA values, due to this reason it is dropped.

In [17]:
df.dropna(thresh=2)
Out[17]:
name subject_liked born
1 James Geography 1985-04-02
2 Maxwell History NaT
3 Michael NaN 1998-07-09
4 Clarke English 2005-10-15

Example 5: using subset parameter in pandas dropna()

A large dataframe can be divided into small subsets for dropping values. This is achieved by dropna() function’s subset parameter.

In [18]:
df.dropna(subset=['name', 'born'])
Out[18]:
name subject_liked born
1 James Geography 1985-04-02
3 Michael NaN 1998-07-09
4 Clarke English 2005-10-15

To handle duplicate values, we use drop_duplicates function.

Pandas Drop Duplicates: drop_duplicates()

Pandas drop_duplicates() function is useful in removing duplicate rows from dataframe.

Syntax

dataframe.drop_duplicates(subset,keep,inplace)

subset : column label or sequence of labels – This parameter specifies the columns for identifying duplicates. By default all the columns are considered.

keep : {first,last,False},default ‘first’ – This determines which duplicates should be kept in the dataframe.If specified as first, then all the duplicates except first are dropped. Similarly, if specified as last, then all the duplicates except last are dropped. If false is specified, then all the duplicates are dropped.

inplace – boolean – The inplace parameter ensures whether the operations performed on the dataframe are permanent or not. If true is passed to it, then the changes due to the operations are made permanent, otherwise they are not. The default value is false.

The resulting dataframe consists of all the values except the duplicated dropped.

Example 1: simple example of drop_duplicates() function

In [19]:
x = {'A': [6, 6, 6, 2], 'B': [9, 9, 9, 3], 'C': [7, 7, 7, 5]}
In [20]:
df = pd.DataFrame(x)
In [21]:
df
Out[21]:
A B C
0 6 9 7
1 6 9 7
2 6 9 7
3 2 3 5

In this, the duplicate rows are dropped where the first duplicate values are not dropped. This is the reason, 0th index row is not dropped but the other two duplicates i.e. 1st and 2nd rows are dropped.

In [22]:
df.drop_duplicates()
Out[22]:
A B C
0 6 9 7
3 2 3 5

Example 2: using ‘keep’ parameter

In this drop_duplicates function, we assign the value to keep parameter as false, this is the reason all the duplicate rows are dropped.

In [23]:
df.drop_duplicates(keep=False)
Out[23]:
A B C
3 2 3 5

Example 3: using subset parameter

Here a subset of dataframe is used to drop columns. For this, we use subset parameter. So the duplicate values in A and B columns are removed.

In [24]:
df.drop_duplicates(subset=['A', 'B'])
Out[24]:
A B C
0 6 9 7
3 2 3 5

Conclusion

It’s time to end this tutorial, in this article we went over the functions which are used for dropping unwanted values, dropping missing values and dropping duplicate rows. The functions we covered were drop(), dropna() and drop_duplicates(). These functions which are provided by pandas library are useful for data preprocessing tasks.

Reference – https://pandas.pydata.org/docs/

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *