Categorical Data Encoding with Sklearn LabelEncoder and OneHotEncoder

Label encoding may look intuitive to us humans but machine learning algorithms can misinterpret it by assuming they have an ordinal ranking. In the below example, Apple has an encoding of 1 and Brocolli has encoding 3. But it does not mean Brocolli is higher than Apple however it does misleads the ML algorithm. This is why Label Encoding is not very much used for categorical encoding for machine learning.

One Hot Encoding is much suited to overcome the shortcoming of Label Encoding and is commonly used with machine learning algorithms. However, it also has some disadvantages. When the cardinality of the categorical variable is high i.e. there are too many distinct values of the categorical column it may produce a very big encoding with a high number of additional columns that may not even fit the memory or produce not so great results

Example of LabelEncoder in Sklearn

We will now see how to do categorical encoding using Sklearn for Label Encoder. We will see an end-to-end example by using a dataset and create an ML model by applying label encoding.

Create and Train Model

Create and Train Model

Comparison

As we discussed in the label encoding vs one hot encoding section above, we can clearly see the same shortcomings of label encoding in the above examples as well.

With label encoding, the model had a mere accuracy of 66.8% but with one hot encoding, the accuracy of the model shot up by 22% to 86.74%

  • Veer Kumar

    I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.

Follow Us

4 Responses

  1. Just found a tiny error under OneHotEncoder topic – ” Then in these new columns, the absence of that value in the row is denoted by 1, and absence is denoted by 0.”
    I think the first occurrence of the word “absence” should be replaced by “presence”.

Leave a Reply

Your email address will not be published. Required fields are marked *