Introduction
As a machine learning practitioner or data scientist you would like to work on a data set that does not contain any missing data or values. But unfortunately a perfect world does not exist and neither does a perfect data set. It is very regular to come across data sets where some data is missing. Almost all machine learning techniques requires us to address missing data as a prerequisite before we can apply the techniques.
- Also Read- Types of Data in Machine Learning
In this article we will try to understand that why data set might contain missing data and how we can deal with them.
Reasons for Missing Data
If some data is missing in data set, it might be just a random miss or there might be a pattern behind the missing data. The first step towards handling missing data is to identify in which of the various scenarios your missing data falls. Let us understand this –
Missing at Random
In this case, the fact that a particular data is missing has something to do with other data present in the data set. There is a relationship between the missing data and other data present in data set.
For example in a survey, it is likely to see “Income” column empty for some respondents. A closer analysis of other data like “Area of residence” , “Education”, “Salaried/Business” etc might reveal that the missing “Income” is coming from respondents who are either very rich and are hesitant to reveal high income or they are from weaker section and are uncomfortable to reveal their lower income. Another example – an empty “Phone Number” column might be related to the fact that survey was filled by Female respondents who chose not to give their number due to privacy reasons.
This might sound counter intuitive but this scenario is known as Missing at Random (MAR) in statistics even though there is a pattern behind missing data.
Missing Completely at Random
In this case, the fact that a particular data is missing is just a random occurrence. Unlike MAR, there is no relationship between missing data and other data present in the data set. This data might have got missed due to some glitch while collecting data or the data is not available at all the source. The miss is completely random by nature and hence in statistics this scenario is known as Missing Completely as Random (MCAR)
Missing Not at Random
This is the scenario where missing data is the result of missing participant itself. For example, uneducated people might not be filling survey form. So if you are trying to work with social economical data, you might be missing out on data by uneducated people and creating a model that is bias towards data provided by educated class participated in the survey. This scenario is known as Missing Not at Random (MNAR) in statistics and is much trickier to deal with compared to above two.
Missing data due incorrect Data Engineering
This reason is obvious but not many mention about this while explaining reasons behind missing data. If you are seeing missing data at places where you think it should not be a case, then you should try to verify if the data was loaded and transformed properly or not. For example, if you are seeing many null values for “Name” or “First Name” column then there is something fishy and you should try to consult your Data Engineering team to verify the data set again and rectify any misses in the data engineering pipeline.
Why should we deal with missing data in machine learning
Short answer – the popular machine learning libraries for e.g. scikit learn does not work with null or missing values, you need to come up with ways to handle these missing values. This is because internal working of machine learning algorithms breaks down due to null or missing data.
Methods to deal with missing data
There are many ways to deal with missing data. But these are just guidelines and not rule of thumbs. Some method might work well for one data set and may not work for another. You would be the best judge that which method might work for you and for this you should have a good understanding of your data set.
Having said this, let us take take a look at various techniques
Deletion of Data
In this case we would completely delete the row or columns containing missing data. But we have to be careful in doing so as it might lead to loss of information if it is case of MAR, discussed above. Also if we are working with a small data set we should try to avoid this process.
Deletion of row
If a row is containing too many null or missing values then it is unlikely that it is going to add any value to our machine learning model. For example, if in a row 6-7 fields out of 10 are empty then it is better to delete such rows.
Deletion of column or feature
If in a column or feature, a higher majority is that of missing data, we can delete the column from our consideration. Generally if more than 70% of data is missing we go for deletion of column. But again be cautious about this, as you might be leading to loss of information by doing this if you are not sure that missing data was MCAR in nature.
Imputation of Data
Imputation of data means trying to find a suitable substitute for missing data. Deletion of missing data might lead to loss of information if it is MAR data or data set itself is very small. So a safer choice is replacing missing data with a value. But at same time we need to be careful of not adding extra bias by doing so.
Let us take a look at various imputation techniques.
Replacing by Mean
In this technique we replace the missing value with mean value of the column. We have to be careful about this, because if the data in column is skewed, replacing missing values by mean will add bias to the data.
Replacing by Median
In this technique we replace the missing value with median value of the column. This is a safe choice can Mean if the data in column is skewed with outliers.
Replacing by Mode
In this technique we replace the missing value with mode , i.e. maximum value of the column. While it is obvious that above two methods were applicable for numerical data, this technique of mode can be applied to categorical data also.
Replacing by another category
In case of categorical feature column, we can denote missing data with a new category in itself. We can replace the missing values by ‘NA’ or ‘Unknown’ or some other relevant term and treat it as a new categorical value for that feature.
Predicting missing values
In this method we try to apply regression or classification techniques to come up with educated guesses of possible candidate to replace missing value. We have to be careful about not adding any extra bias from our predictive model itself.
In the End …
Dealing with missing data can be more trickier than what you might have thought first. There are considerations like loss of information, introducing of bias that one has to deal with while doing treatment of missing data.
I hope, this article would help you to make choices regarding missing data in your data set.
Do share your feed back about this post in the comments section below. If you found this post informative, then please do share this and subscribe to us by clicking on bell icon for quick notifications of new upcoming posts.