Tutorial for K Means Clustering in Python Sklearn

Veer Kumar
Last Updated On February 27, 2022
Python

Introduction:

In this tutorial, we will learn how to apply the K-means clustering in Sklearn library. We will first have a brief overview of what is meant by clustering, followed by understanding what the K-means algorithm is. Then, we will go through the working principle of the K-means algorithm, after which we shall implement and end to end code in which we shall implement this algorithm to perform customer segmentation using the ‘Mall_Customers.csv’ dataset. We will then look at the two methods namely the Elbow and Silhouette methods, by which we can calculate the optimum number of clusters within a given dataset.

What is Clustering?

Clustering is the task of segmenting a set of data into distinct groups such that the data points in the same group will bear similar characteristics as opposed to those data points which lie in the groups/clusters. Our main objective here is to segregate groups having similar characteristics assign them unique clusters.

The points present in the same cluster should have similar properties

The points present in the different clusters should be as dissimilar as possible

What is K-Means Algorithm?

K-Means Clustering comes under the category of Unsupervised Machine Learning algorithms, these algorithms group an unlabeled dataset into distinct clusters. The K defines the number of pre-defined clusters that need to be created, for instance, if K=2, there will be 2 clusters, similarly for K=3, there will be three clusters. The primary goal while implementing k-means involves defining k clusters such that total within-cluster variation (or error) is minimum.

The cluster center is the arithmetic mean of all the data points that belong to that cluster. The squared distance between every given point and its cluster center is called variation. The goal of the k-means clustering is to ascertain these k clusters and their centers whilst reducing the total error.

How does the K-Means Algorithm Work?

The steps of the underlying working principle that govern the K-Means Algorithm have been enlisted below:

Step-1:To decide the number of clusters, we select an appropriate value of K.

Step-2: Now choose random K points/centroids.

Step-3: Each data point will be assigned to its nearest centroid and this will form a predefined cluster.

Step-4: Now we shall calculate variance and position a new centroid for every cluster.

Step-5: The 3rd step will be repeated, meaning, every data point will be assigned to the new nearest centroid.

Step-6: If a reassignment has occurred then step-4 shall be executed otherwise execution finishes.

Step-7: Finally, the model is ready

Example of K Means Clustering in Python Sklearn

We can easily implement K-Means clustering in Python with Sklearn KMeans() function of sklearn.cluster module. For this example, we will use the Mall Customer dataset to segment the customers in clusters based on their Age, Annual Income, Spending Score, etc.

Import Libraries

Let us import the important libraries that will be required by us.

In [1]:

from sklearn.cluster import KMeans
from sklearn import preprocessing
import sklearn.cluster as cluster
import sklearn.metrics as metrics
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

Load Dataset

Let us load the dataset into a dataframe and take a look at some of its rows. Also with the shape function, it can be seen that it has 4 columns and 200 rows.

In [2]:

df = pd.read_csv(r"Mall_Customers.csv")
df.head()

Out[2]:

	Gender	Age	Annual Income (k$)	Spending Score (1-100)
0	Male	19	15	39
1	Male	21	15	81
2	Female	20	16	6
3	Female	23	16	77
4	Female	31	17	40

In [3]:

df.shape

Out[3]:

(200, 4)

Objective

Customer segmentation deals with grouping clusters together based on some common patterns within their attributes. To keep the example simple and to visualize the clustering on a 2-D graph we will use only two attributes Annual Income and Spending Score. Don’t worry we will also show you after this how you can use more than 2 attributes for clustering and still visualize the results in 2-D with the help of Principal Component Analysis (PCA).

Apply Feature Scaling

Clustering algorithms like K-means require feature scaling of the data as part of data preprocessing to produce good results. This is because clustering techniques use distance calculation between the data points. Hence it is proper to bring data of different units under a common scale.

For more details, you may read following article –

Sklearn Feature Scaling with StandardScaler, MinMaxScaler, RobustScaler and MaxAbsScaler

We have used MinMaxScaler for our example as shown below. In the new dataframe df_scale it can been seen that both attributes Annual Income and Spending Score are normalized.

In [4]:

scaler = MinMaxScaler()
scale = scaler.fit_transform(df[['Annual Income (k$)','Spending Score (1-100)']])
df_scale = pd.DataFrame(scale, columns = ['Annual Income (k$)','Spending Score (1-100)']);
df_scale.head(5)

Out[4]:

	Annual Income (k$)	Spending Score (1-100)
0	0.000000	0.387755
1	0.000000	0.816327
2	0.008197	0.051020
3	0.008197	0.775510
4	0.016393	0.397959

Applying Kmeans with 2 Clusters (K=2)

Let us see how to apply K-Means in Sklearn to group the dataset into 2 clusters (0 and 1). The output shows the cluster (0th or 1st) corresponding to the data points in the dataset.

In [5]:

km=KMeans(n_clusters=2)
y_predicted = km.fit_predict(df[['Annual Income (k$)','Spending Score (1-100)']])
y_predicted

Out[5]:

array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1])

To find the centroid of the two clusters the attribute cluster_centers_ can be used as shown below

In [6]:

km.cluster_centers_

Out[6]:

array([[0.36929553, 0.31163817],
       [0.37861485, 0.73950929]])

Finally, let us visualize the results. From the graph, it is evident that there is a scope for data to be grouped into more clusters than only 2. But how to know how many clusters? Let us understand this in the next section.

In [7]:

df['Clusters'] = km.labels_
sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters',  data=df,palette='viridis')

Out[7]:

<AxesSubplot:xlabel='Spending Score (1-100)', ylabel='Annual Income (k$)'>

Finding Optimum number of Clusters in K Means

The tricky part with K-Means clustering is you do not know in advance that in how many clusters the given data can be divided (hence it is an unsupervised learning algorithm). It can be done with the trial and error method but let us see a more proper technique for this.

i) Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)

The Elbow Method is a popular technique for determining the optimal number of clusters. Here, we calculate the Within-Cluster-Sum of Squared Errors (WCSS) for various values of k and choose the k for which WSS first starts to diminish. In the plot of WSS-versus-k, this can be observed as an elbow.

The Squared Error for a data point is the square of the distance of a point from its cluster center.

The WSS score is the summation of Squared Errors for all given data points.
Distance metrics like Euclidean Distance or the Manhattan Distance can be used.

Continuing with our example, we calculate the WCSS for K=2 to k=12 and calculate the WCSS in each iteration.

In [8]:

K=range(2,12)
wss = []

for k in K:
    kmeans=cluster.KMeans(n_clusters=k)
    kmeans=kmeans.fit(df_scale)
    wss_iter = kmeans.inertia_
    wss.append(wss_iter)

Let us now plot the WCSS vs K cluster graph. It can be seen below that there is an elbow bend at K=5 i.e. it is the point after which WCSS does not diminish much with the increase in value of K.

In [9]:

plt.xlabel('K')
plt.ylabel('Within-Cluster-Sum of Squared Errors (WSS)')
plt.plot(K,wss)

Out[9]:

Sklearn K Means Elbow Method Example

ii) The Silhouette Method

The silhouette value measures the similarity of a data point within its cluster. It has a range between +1 and -1 and the higher values denote a good clustering.

Below we calculate the Silhouette Score for k=2 to 12 and it can be seen that the maximum value is for k=5. This is in line with the elbow method.

In [10]:

import sklearn.cluster as cluster
import sklearn.metrics as metrics
for i in range(2,13):
labels=cluster.KMeans(n_clusters=i,random_state=200).fit(df_scale).labels_
print ("Silhouette score for k(clusters) = "+str(i)+" is "
+str(metrics.silhouette_score(df_scale,labels,metric="euclidean",sample_size=1000,random_state=200)))

Out[10]:

Silhouette score for k(clusters) = 2 is 0.33340205479521
Silhouette score for k(clusters) = 3 is 0.4514909309424474
Silhouette score for k(clusters) = 4 is 0.49620078745146784
Silhouette score for k(clusters) = 5 is 0.5594854531227246
Silhouette score for k(clusters) = 6 is 0.5380652777999084
Silhouette score for k(clusters) = 7 is 0.43787929453711455
Silhouette score for k(clusters) = 8 is 0.43074523601514214
Silhouette score for k(clusters) = 9 is 0.4421331695270676
Silhouette score for k(clusters) = 10 is 0.44563877983976935
Silhouette score for k(clusters) = 11 is 0.44293254541345917
Silhouette score for k(clusters) = 12 is 0.4427512711673661

Applying Kmeans with 5 Clusters (K=5)

Now that we have identified that the optimum value of K is 5

In [11]:

# We will use 2 Variables for this example
kmeans = cluster.KMeans(n_clusters=5 ,init="k-means++")
kmeans = kmeans.fit(df[['Annual Income (k$)','Spending Score (1-100)']])

Finally, let us plot the graph with k=5 clusters and we can see that now the K-Means result looks good.

In [12]:

df['Clusters'] = kmeans.labels_

sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters',  data=df,palette='viridis')

Out[12]:

K Means Clustering in Python Sklearn with Principal Component Analysis

In the above example, we used only two attributes to perform clustering because it is easier for us to visualize the results in 2-D graph. We cannot visualize anything beyond 3 attributes in 3-D and in real-world scenarios there can be hundred of attributes. So how can we visualize the clustering results?

Well, it can be done by applying principal component analysis (PCA) on the dataset to reduce its dimension to only two while still preserving the information. And then clustering can be applied to this transformed dataset and then visualized in a 2-D plot. Moreover, PCA can also help to avoid the curse of dimensionality.

For more details on PCA, you can read the following article –

Complete Tutorial of PCA in Python Sklearn with Example

So let us see this practically below where we will use 3 attributes on the same dataset.

Load Dataset

Let us again load the dataset in the dataframe like before.

In [13]:

df = pd.read_csv(r"Mall_Customers.csv")
df.head()

Out[13]:

	Gender	Age	Annual Income (k$)	Spending Score (1-100)
0	Male	19	15	39
1	Male	21	15	81
2	Female	20	16	6
3	Female	23	16	77
4	Female	31	17	40

Apply Feature Scaling

This time we are applying feature scaling on our desired columns Age, Annual Income and Spending Score.

In [14]:

scaler = MinMaxScaler()

scale = scaler.fit_transform(df[['Age','Annual Income (k$)','Spending Score (1-100)']])

df_scale = pd.DataFrame(scale, columns = ['Age','Annual Income (k$)','Spending Score (1-100)']);
df_scale.head(5)

Out[14]:

	Age	Annual Income (k$)	Spending Score (1-100)
0	0.019231	0.000000	0.387755
1	0.057692	0.000000	0.816327
2	0.038462	0.008197	0.051020
3	0.096154	0.008197	0.775510
4	0.250000	0.016393	0.397959

Applying PCA

Now let us reduce the dimensionality of the dataset into two components.

In [15]:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df_scale)
pca_df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
pca_df.head()

Out[15]:

	principal component 1	principal component 2
0	-0.192221	0.319683
1	-0.458175	-0.018152
2	0.052562	0.551854
3	-0.402357	-0.014239
4	-0.031648	0.155578

Finding Optimum Value of K

i) Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)

Let us again use the elbow method with Within-Cluster-Sum of Squared Error (WCSS) to determine the optimum value of K. From the graph it looks like there is a bend between 5 and 6.

In [16]:

K=range(2,12)
wss = []
for k in K:
    kmeans=cluster.KMeans(n_clusters=k)
    kmeans=kmeans.fit(pca_df)
    wss_iter = kmeans.inertia_
    wss.append(wss_iter)

In [17]:

plt.xlabel('K')
plt.ylabel('Within-Cluster-Sum of Squared Errors (WSS)')
plt.plot(K,wss)

Out[17]:

ii) The Silhouette Method

Using the Silhouette method, it can be seen that the Silhouette value is maximum for K=5. Hence it can be concluded that the dataset can be segmented properly with 6 clusters.

In[18]:

import sklearn.cluster as cluster
import sklearn.metrics as metrics
for i in range(2,12):
    labels=cluster.KMeans(n_clusters=i,random_state=200).fit(pca_df).labels_
    print ("Silhouette score for k(clusters) = "+str(i)+" is "
           +str(metrics.silhouette_score(pca_df,labels,metric="euclidean",sample_size=1000,random_state=200)))

Out[18]:

Silhouette score for k(clusters) = 2 is 0.4736269407502857
Silhouette score for k(clusters) = 3 is 0.44839082753844756
Silhouette score for k(clusters) = 4 is 0.43785291876777566
Silhouette score for k(clusters) = 5 is 0.45130680489606634
Silhouette score for k(clusters) = 6 is 0.4507847568968469
Silhouette score for k(clusters) = 7 is 0.4458795480456887
Silhouette score for k(clusters) = 8 is 0.4132957148795121
Silhouette score for k(clusters) = 9 is 0.4170428610065107
Silhouette score for k(clusters) = 10 is 0.4309783655094101
Silhouette score for k(clusters) = 11 is 0.42535265774570674

Applying Kmeans with 5 Clusters (K=5)

K Mean clustering is applied with SKlearn KMeans() by passing the value of k=5

In [19]:

kmeans = cluster.KMeans(n_clusters=5)
kmeans = kmeans.fit(pca_df)

On visualizing the clustering it can be seen that it has produced good results.

In [20]:

pca_df['Clusters'] = kmeans.labels_

sns.scatterplot(x="principal component 1", y="principal component 2",hue = 'Clusters',  data=pca_df,palette='viridis')

Out[20]:

Conclusion

We hope you liked our tutorial and now better understand how to implement K-means clustering using Sklearn(Scikit Learn) in Python. Here, we have illustrated an end-to-end example of using a dataset to build a K-means clustering model to achieve customer segmentation using(KMeans Clustering in Python.

References:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
https://realpython.com/k-means-clustering-python/
https://www.kaggle.com/khotijahs1/k-means-clustering-of-iris-dataset

https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187

Veer Kumar

I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.
View all posts

Tutorial for K Means Clustering in Python Sklearn

Introduction:

What is Clustering?

What is K-Means Algorithm?

How does the K-Means Algorithm Work?

Example of K Means Clustering in Python Sklearn

Import Libraries

Load Dataset

Objective

Apply Feature Scaling

Applying Kmeans with 2 Clusters (K=2)

Finding Optimum number of Clusters in K Means

i) Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)

ii) The Silhouette Method

Applying Kmeans with 5 Clusters (K=5)

K Means Clustering in Python Sklearn with Principal Component Analysis

Load Dataset

Apply Feature Scaling

Applying PCA

Finding Optimum Value of K

i) Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)

ii) The Silhouette Method

Applying Kmeans with 5 Clusters (K=5)

Conclusion

Leave a Reply Cancel reply

Latest Posts

Follow US