Categorical Data Encoding with Sklearn LabelEncoder and OneHotEncoder

Categorical Data Encoding with Sklearn OneHotEncoder and LabelEncoder

Label encoding may look intuitive to us humans but machine learning algorithms can misinterpret it by assuming they have an ordinal ranking. In the below example, Apple has an encoding of 1 and Brocolli has encoding 3. But it does not mean Brocolli is higher than Apple however it does misleads the ML algorithm. This is why Label Encoding is not very much used for categorical encoding for machine learning.

One Hot Encoding is much suited to overcome the shortcoming of Label Encoding and is commonly used with machine learning algorithms. However, it also has some disadvantages. When the cardinality of the categorical variable is high i.e. there are too many distinct values of the categorical column it may produce a very big encoding with a high number of additional columns that may not even fit the memory or produce not so great results

Example of LabelEncoder in Sklearn

We will now see how to do categorical encoding using Sklearn for Label Encoder. We will see an end-to-end example by using a dataset and create an ML model by applying label encoding.

Create and Train Model

Create and Train Model

Comparision

As we discussed in the label encoding vs one hot encoding section above, we can clearly see the same shortcomings of label encoding in the above examples as well.

With label encoding, the model had a mere accuracy of 66.8% but with one hot encoding, the accuracy of the model shot up by 22% to 86.74%

LEAVE A REPLY

Please enter your comment!
Please enter your name here