Categorical Data Encoding with Sklearn LabelEncoder and OneHotEncoder

Veer Kumar
Last Updated On August 8, 2022
Python

Introduction

In this tutorial, we will be seeing two encoding methods of Sklearn – LabelEncoder and OnehotEcoder, to encode categorical variables to numeric variables. We will first understand what is categorical data and why it requires encoding for machine learning. We will then understand and compare the techniques of Label Encoding and One Hot Encoding and finally see their example in Sklearn.

What is Categorical Data

Categorical data is the kind of data that describes the characteristics of an entity. The common examples and values of categorical data are –

Gender: Male, Female, Others

Education qualification: High school, Undergraduate, Master’s or PhD
City: Mumbai, Delhi, Bangalore or Chennai, and so on.

Usually, categorical data are non-numeric in nature and are text. It is, however, possible that categorical data is denoted by numbers e.g. Colors can be denoted by numbers like Red = 1, Orange = 2, Blue = 3, and so on but it does not have any mathematical significance.

Also Read – Types of Data in Statistics – A basic understanding for Machine Learning

Why Categorical Data Encoding Needed in ML

Most machine learning algorithms like Regression, Support Vector Machines, Neural Networks, KNN, etc. cannot work with text-based categorical data. So it becomes necessary to convert the categorical data into some sort of numerical encoding as part of data preprocessing and then feed it to the ML algorithms.

It should be noted however that tree-based machine learning algorithms like Decision Trees and Random Forest can work on categorical data and you may not be required to do categorical data encoding.

We next take a look at the two most common techniques of categorical data encoding – i) Label Encoding and ii) One Hot Encoding.

Label Encoding

In label encoding, each distinct value of the feature is assigned numeric values starting from 0 to N-1 where N is the total number of distinct values.

E.g. if we have continents of the world such as [‘Africa’, ‘Asia’, ‘Europe’, ‘South America’, ‘North America’, ‘Others’], after using Label Encoder these categorical values will be encoded as [0,1,2,3,4,5].

One Hot Encoding

In this technique, first of all, for each distinct value of the feature, new columns are created. Then in these new columns, the presence of that value in the row is denoted by 1, and the absence is denoted by 0.

In the below example, the feature Color has three values Red, Yellow, and Green. As we can see three columns are created corresponding to each of these values. Finally, the presence and absence of the value in each row is denoted by 1 and 0 respectively.

Label Encoding vs One Hot Encoding

Label encoding may look intuitive to us humans but machine learning algorithms can misinterpret it by assuming they have an ordinal ranking. In the below example, Apple has an encoding of 1 and Brocolli has encoding 3. But it does not mean Brocolli is higher than Apple however it does misleads the ML algorithm. This is why Label Encoding is not very much used for categorical encoding for machine learning.

One Hot Encoding is much suited to overcome the shortcoming of Label Encoding and is commonly used with machine learning algorithms. However, it also has some disadvantages. When the cardinality of the categorical variable is high i.e. there are too many distinct values of the categorical column it may produce a very big encoding with a high number of additional columns that may not even fit the memory or produce not so great results

Sklearn OneHotEncoder and LabelEncoder — Source

Example of LabelEncoder in Sklearn

We will now see how to do categorical encoding using Sklearn for Label Encoder. We will see an end-to-end example by using a dataset and create an ML model by applying label encoding.

About the Dataset

This is a small dataset about startups that consists of 51 rows and 5 columns. The objective is to predict the Profit based on the other four independent variables of the dataset. Since one of our variables here, ‘State’ is a categorical variable, we will first be encoding it to the numeric variable by using Sklearn LabelEncoder and OneHotEncoder.

Importing Libraries and Read Dataset

We first import all the necessary libraries for our example. And next, we read the dataset in the CSV file into the Pandas dataframe.

In [1]:

#importing the necassary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
#reading the dataset
df=pd.read_csv(r"C:\Users\Veer Kumar\Downloads\MLK internship\Encoder\50_Startups.csv")
df.head(10)

Out[1]:

	R&D Spend	Administration	Marketing Spend	State	Profit
0	165349.20	136897.80	471784.10	New York	192261.83
1	162597.70	151377.59	443898.53	California	191792.06
2	153441.51	101145.55	407934.54	Florida	191050.39
3	144372.41	118671.85	383199.62	New York	182901.99
4	142107.34	91391.77	366168.42	Florida	166187.94
5	131876.90	99814.71	362861.36	New York	156991.12
6	134615.46	147198.87	127716.82	California	156122.51
7	130298.13	145530.06	323876.68	Florida	155752.60
8	120542.52	148718.95	311613.29	New York	152211.77
9	123334.88	108679.17	304981.62	California	149759.96

In [2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.8+ KB

Apply Sklearn Label Encoding

The Sklearn Preprocessing has the module LabelEncoder() that can be used for doing label encoding. Here we first create an instance of LabelEncoder() and then apply fit_transform by passing the state column of the dataframe.

In the output, we can see that the values in the state are encoded with 0,1, and 2.

In [3]:

# label_encoder object
label_encoder =LabelEncoder()
# Encode labels in column. 
df['State']= label_encoder.fit_transform(df['State'])
df.head(10)

Out[3]:

	R&D Spend	Administration	Marketing Spend	State	Profit
0	165349.20	136897.80	471784.10	2	192261.83
1	162597.70	151377.59	443898.53	0	191792.06
2	153441.51	101145.55	407934.54	1	191050.39
3	144372.41	118671.85	383199.62	2	182901.99
4	142107.34	91391.77	366168.42	1	166187.94
5	131876.90	99814.71	362861.36	2	156991.12
6	134615.46	147198.87	127716.82	0	156122.51
7	130298.13	145530.06	323876.68	1	155752.60
8	120542.52	148718.95	311613.29	2	152211.77
9	123334.88	108679.17	304981.62	0	149759.96

Train Test Split

Here, we will separate the feature and target variables by splitting the dataset into training and testing sets

In [4]:

X=df.iloc[:,[0,1,3]]
y=df.iloc[:,[2]]

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.10,random_state=42)

Create and Train Model

We now create an object of Sklearn Linear regression and train the model by passing the dataset.

In [5]:

from sklearn.linear_model import LinearRegression

model=LinearRegression()
model.fit(X_train,y_train)

Out[5]:

LinearRegression()

Finding Model Accuracy

Model accuracy on the test data will tell us about how well our model is generalized on the training data, based on predicting values for unseen data.

In [7]:

from sklearn.metrics import r2_score

y_pred=model.predict(X_test)
score=r2_score(y_test,y_pred)

print("Accuracy for our testing dataset using LabelEncoder is : {:.3f}%".format(score*100) )

Out[7]:

Accuracy for our testing dataset using LabelEncoder is : 66.826%

Example of OneHotEncoder in Sklearn

We will now see how to do categorical encoding using Sklearn for One Hot Encoding. We will use the same dataset that was used in the above example hence the similar sections at the start are skipped and we straight jump to the encoding part below

One Hot Encoding in Sklearn

The Sklearn Preprocessing has the module OneHotEncoder() that can be used for doing one hot encoding.

We first create an instance of OneHotEncoder() and then apply fit_transform by passing the state column. This returns a new dataframe with multiple columns categorical values. This is stored in an intermediate dataframe which is finally joined with the original dataframe.

The drop=’first’ parameter is used to help us overcome the multicollinearity that may arise due to the dummy variables of one hot encoding. This is also known dummy variable trap. Rest assured by dropping the column there will be no loss of the information.

In [9]:

# creating instance of one-hot-encoder
enc = OneHotEncoder(drop='first')

enc_df = pd.DataFrame(enc.fit_transform(data[['State']]).toarray())
# merge with main df bridge_df on key values
df =df.join(enc_df)
df.head()

Out[9]:

	R&D Spend	Administration	Marketing Spend	State	Profit	0	1
0	165349.20	136897.80	471784.10	2	192261.83	0.0	1.0
1	162597.70	151377.59	443898.53	0	191792.06	0.0	0.0
2	153441.51	101145.55	407934.54	1	191050.39	1.0	0.0
3	144372.41	118671.85	383199.62	2	182901.99	0.0	1.0
4	142107.34	91391.77	366168.42	1	166187.94	1.0	0.0

Train Test Split

Here, we will separate the feature and target variables and do train test split of the data set.

In [10]:

X=df.iloc[:,[0,1,2,5,6]]
y=df.iloc[:,[4]]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.10,random_state=42)

Create and Train Model

We now create an object of Sklearn Linear regression and train the model by passing the dataset.

In [12]:

model=LinearRegression()
model.fit(X_train,y_train)

Out[12]:

LinearRegression()

Finding Model Accuracy

Model accuracy on the test data will tell us about how well our model is generalized on the training data, based on predicting values for unseen data.

In [14]:

y_pred=model.predict(X_test)
score=r2_score(y_test,y_pred)

print("Accuracy for our testing dataset using OneHotEncoder is : {:.3f}%".format(score*100) )

Out[14]:

Accuracy for our testing dataset using OneHotEncoder is : 86.748%

Comparison

As we discussed in the label encoding vs one hot encoding section above, we can clearly see the same shortcomings of label encoding in the above examples as well.

With label encoding, the model had a mere accuracy of 66.8% but with one hot encoding, the accuracy of the model shot up by 22% to 86.74%

Conclusion

We hope you liked our tutorial and now understand better how to implement LabelEncoder and OneHotEncoder using Sklearn (Scikit Learn) in Python. Here, we have illustrated end-to-end examples of both by using a dataset to build a Linear Regression model. We also did the comparison of label encoding vs one hot encoding.

Veer Kumar

I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.
View all posts

Tags: machine learning, python, scikit learn, sklearn

4 Responses

ADUGE Austine says:

December 2, 2021 at 8:58 pm

The article is great. I have learnt so muchA

Reply
1. MLK says:
  
  December 3, 2021 at 12:49 am
  
  We are glad you found the article useful !
  
  Reply
Pujesh says:

August 3, 2022 at 10:20 am

Just found a tiny error under OneHotEncoder topic – ” Then in these new columns, the absence of that value in the row is denoted by 1, and absence is denoted by 0.”
I think the first occurrence of the word “absence” should be replaced by “presence”.

Reply
1. MLK says:
  
  August 8, 2022 at 2:20 pm
  
  Yes, Pujesh.. it was a small typo mistake. Thanks for pointing it out. It is corrected now.
  
  Reply

Categorical Data Encoding with Sklearn LabelEncoder and OneHotEncoder

Introduction

What is Categorical Data

Why Categorical Data Encoding Needed in ML

Label Encoding

One Hot Encoding

Label Encoding vs One Hot Encoding

Example of LabelEncoder in Sklearn

About the Dataset

Importing Libraries and Read Dataset

Apply Sklearn Label Encoding

Train Test Split

Create and Train Model

Finding Model Accuracy

Example of OneHotEncoder in Sklearn

One Hot Encoding in Sklearn

Train Test Split

Create and Train Model

Finding Model Accuracy

Comparison

Conclusion

4 Responses

Leave a Reply Cancel reply

Related Posts

Decision Tree Regression in Python Sklearn with Example

Random Forest Regression in Python Sklearn with Example

Agglomerative Hierarchical Clustering in Python Sklearn & Scipy

Tutorial for K Means Clustering in Python Sklearn

Sklearn Feature Scaling with StandardScaler, MinMaxScaler, RobustScaler and MaxAbsScaler

Tutorial for DBSCAN Clustering in Python Sklearn

Follow US