How To Use Sklearn Simple Imputer (SimpleImputer) for Filling Missing Values in Dataset

There are various characteristics of missing data that you should first understand before addressing it.  The missing data falls in one of the following categories –

1. Missing at Random (MAR)

In this scenario, the missing data has some relationship with other variables in the dataset. E.g. in a survey, the phone number fields may not be filled by most of the females due to security concerns.

2. Missing Completely at Random (MCAR)

In this scenario, the data is missing just randomly and there is no relationship with other variables in the dataset. E.g. some data might be missing randomly due to some technical issue or due to human error.

3. Missing Not at Random (MNAR)

In this scenario, the data is not missing randomly and the missingness is attributed to the data that was supposed to be captured. MNAR is quite tricky to spot and deal with.  E.g. in a survey form, the rich people may not fill the Income field as they would not like to disclose it.

How to Deal with Missing Data

There are various strategies available to address the issue of the missing data however which one works best depends on your dataset. There is no thumb rule, so you will have to assess your dataset and experiment with various strategies.

1. Dropping the Variables with Missing Data

In this strategy, the row or column containing the missing data is deleted completely. This should be used cautiously as you may end up losing important information about the data. Domain knowledge is quite useful to decide whether dropping the columns is the ideal solution for your dataset.

2. Imputation of Data

In this technique, the missing data is filled up or imputed by a suitable substitute and there are multiple strategies behind it.

i) Replace with Mean 

Here all the missing data is replaced by the mean of the corresponding column. It works only with a numeric field. However, we have to be cautious here because if the data in the column contains outliers its mean will be misleading

ii) Replace with Median

Here the missing data is replaced with the median values of that column and again it is applicable only with numerical columns.

iii) Replace with Most Frequent Occurring

In this technique, the missing values are filled with the value which occurs the highest number of times in a particular column. This approach is applicable for both numeric and categorical columns.

iv) Replace with Constant

In this approach, the missing data is replaced by a constant value throughout. This can be used with both numeric and categorical columns.

Sklearn Simple Imputer

Sklearn provides a module SimpleImputer that can be used to apply all the four imputing strategies for missing data that we discussed above.

Sklearn Imputer vs SimpleImputer

The old version of sklearn used to have a module Imputer for doing all the imputation transformation. However, the Imputer module is now deprecated and has been replaced by a new module SimpleImputer in the recent versions of Sklearn. So for all imputation purposes, you should now use SimpleImputer in Sklearn.

Examples of Simple Imputer in Sklearn

Create Toy Dataset

We will create a toy dataset with the random numbers and then randomly set some values as nulls. Just to make this dataset more suitable for our examples, we duplicate two cells of the datframes.

In [1]:

# Create a radnom datset of 10 rows and 4 columns
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

# Randomly set some values as null
df = df.mask(np.random.random((10, 4)) < .15)

# Duplicate two cells with same values
df['B'][8] = df['B'][9]
df

Out[1]:

A B C D
0 -0.520643 NaN 0.080238 NaN
1 1.225041 0.505089 -1.997088 NaN
2 -0.004976 -0.082857 0.376651 -0.626456
3 1.880424 0.527540 0.129820 1.384916
4 -0.476005 NaN -1.499829 0.334039
5 2.134381 -0.365297 1.554248 -0.118477
6 0.103352 0.458311 0.424156 NaN
7 -0.759686 -0.356959 1.261324 -0.278455
8 -0.331476 0.810893 -0.466366 -0.582135
9 -1.991735 -0.343604 -0.393095 -1.190406

 

i) Sklearn SimpleImputer with Mean

We first create an instance of SimpleImputer with strategy as ‘mean’. This is the default strategy and even if it is not passed, it will use mean only. Finally, the dataset is fit and transformed and we can see that the null values of columns B and D are replaced by the mean of respective columns.

In [2]:

 

ii) Sklearn SimpleImputer with Median

We first create an instance of SimpleImputer with strategy as ‘median’ and then the dataset is fit and transformed. We can see that the null values of columns B and D are replaced by the mean of respective columns.

In [3]:

median_imputer = SimpleImputer(strategy='median')

result_median_imputer = median_imputer.fit_transform(df)

pd.DataFrame(result_median_imputer, columns=list('ABCD'))

 

iii) Sklearn SimpleImputer with Most Frequent

We first create an instance of SimpleImputer with strategy as ‘most_frequent’ and then the dataset is fit and transformed.

If there is no most frequently occurring number Sklearn SimpleImputer will impute with the lowest integer on the column.

We can see that the null values of column B are replaced with -0.343604 that is the most frequently occurring in that column. In column D since there is no such frequently occurring number the nulls got replaced by the lowest number -1.190406

In [4]:

 

iv) Sklearn SimpleImputer with Constant

We first create an instance of SimpleImputer with strategy as ‘constant’ and fill_value as 99. If we don’t supply fill_value it will take 0 as default for numerical columns. Also in a numeric column, SimpleImputer does not accept a string for default fill.

The dataset is fit and transformed and we can see that all nulls are replaced by 99.

In [5]:

 

 

  • Veer Kumar

    I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *