Introduction:
In this tutorial, we will learn how to apply the Kmeans clustering in Sklearn library. We will first have a brief overview of what is meant by clustering, followed by understanding what the Kmeans algorithm is. Then, we will go through the working principle of the Kmeans algorithm, after which we shall implement and end to end code in which we shall implement this algorithm to perform customer segmentation using the â€˜Mall_Customers.csvâ€™ dataset. We will then look at the two methods namely the Elbow and Silhouette methods, by which we can calculate the optimum number of clusters within a given dataset.
What is Clustering?
Clustering is the task of segmenting a set of data into distinct groups such that the data points in the same group will bear similar characteristics as opposed to those data points which lie in the groups/clusters. Our main objective here is to segregate groups having similar characteristics assign them unique clusters.
 The points present in the same cluster should have similar properties
 The points present in the different clusters should be as dissimilar as possible
What is KMeans Algorithm?
KMeans Clustering comes under the category of Unsupervised Machine Learning algorithms, these algorithms group an unlabeled dataset into distinct clusters. The K defines the number of predefined clusters that need to be created, for instance, if K=2, there will be 2 clusters, similarly for K=3, there will be three clusters. The primary goal while implementing kmeans involves defining k clusters such that total withincluster variation (or error) is minimum.
The cluster center is the arithmetic mean of all the data points that belong to that cluster. The squared distance between every given point and its cluster center is called variation. The goal of the kmeans clustering is to ascertain these k clusters and their centers whilst reducing the total error.
How does the KMeans Algorithm Work?
The steps of the underlying working principle that govern the KMeans Algorithm have been enlisted below:
Step1:To decide the number of clusters, we select an appropriate value of K.
Step2: Now choose random K points/centroids.
Step3: Each data point will be assigned to its nearest centroid and this will form a predefined cluster.
Step4: Now we shall calculate variance and position a new centroid for every cluster.
Step5: The 3rd step will be repeated, meaning, every data point will be assigned to the new nearest centroid.
Step6: If a reassignment has occurred then step4 shall be executed otherwise execution finishes.
Step7: Finally, the model is ready
Example of K Means Clustering in Python Sklearn
We can easily implement KMeans clustering in Python with Sklearn KMeans() function of sklearn.cluster module. For this example, we will use the Mall Customer dataset to segment the customers in clusters based on their Age, Annual Income, Spending Score, etc.
Import Libraries
Let us import the important libraries that will be required by us.
from sklearn.cluster import KMeans
from sklearn import preprocessing
import sklearn.cluster as cluster
import sklearn.metrics as metrics
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
Load Dataset
Let us load the dataset into a dataframe and take a look at some of its rows. Also with the shape function, it can be seen that it has 4 columns and 200 rows.
df = pd.read_csv(r"Mall_Customers.csv")
df.head()
Gender  Age  Annual Income (k$)  Spending Score (1100)  

0  Male  19  15  39 
1  Male  21  15  81 
2  Female  20  16  6 
3  Female  23  16  77 
4  Female  31  17  40 
df.shape
(200, 4)
Objective
Customer segmentation deals with grouping clusters together based on some common patterns within their attributes. To keep the example simple and to visualize the clustering on a 2D graph we will use only two attributes Annual Income and Spending Score. Donâ€™t worry we will also show you after this how you can use more than 2 attributes for clustering and still visualize the results in 2D with the help of Principal Component Analysis (PCA).
Apply Feature Scaling
Clustering algorithms like Kmeans require feature scaling of the data as part of data preprocessing to produce good results. This is because clustering techniques use distance calculation between the data points. Hence it is proper to bring data of different units under a common scale.
 For more details, you may read following article â€“
Sklearn Feature Scaling with StandardScaler, MinMaxScaler, RobustScaler and MaxAbsScaler
We have used MinMaxScaler for our example as shown below. In the new dataframe df_scale it can been seen that both attributes Annual Income and Spending Score are normalized.
scaler = MinMaxScaler()
scale = scaler.fit_transform(df[['Annual Income (k$)','Spending Score (1100)']])
df_scale = pd.DataFrame(scale, columns = ['Annual Income (k$)','Spending Score (1100)']);
df_scale.head(5)
Annual Income (k$)  Spending Score (1100)  

0  0.000000  0.387755 
1  0.000000  0.816327 
2  0.008197  0.051020 
3  0.008197  0.775510 
4  0.016393  0.397959 
Applying Kmeans with 2 Clusters (K=2)
km=KMeans(n_clusters=2)
y_predicted = km.fit_predict(df[['Annual Income (k$)','Spending Score (1100)']])
y_predicted
array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
km.cluster_centers_
array([[0.36929553, 0.31163817], [0.37861485, 0.73950929]])
df['Clusters'] = km.labels_
sns.scatterplot(x="Spending Score (1100)", y="Annual Income (k$)",hue = 'Clusters', data=df,palette='viridis')
<AxesSubplot:xlabel='Spending Score (1100)', ylabel='Annual Income (k$)'>
Finding Optimum number of Clusters in K Means
i) Elbow Method with WithinClusterSum of Squared Error (WCSS)
The Elbow Method is a popular technique for determining the optimal number of clusters. Here, we calculate the WithinClusterSum of Squared Errors (WCSS) for various values of k and choose the k for which WSS first starts to diminish. In the plot of WSSversusk, this can be observed as an elbow.
 The Squared Error for a data point is the square of the distance of a point from its cluster center.
 The WSS score is the summation of Squared Errors for all given data points.
 Distance metrics like Euclidean Distance or the Manhattan Distance can be used.
Continuing with our example, we calculate the WCSS for K=2 to k=12 and calculate the WCSS in each iteration.
In [8]:
K=range(2,12) wssÂ =Â [] for k in K: Â Â Â kmeans=cluster.KMeans(n_clusters=k) Â Â Â kmeans=kmeans.fit(df_scale) Â Â Â wss_iterÂ =Â kmeans.inertia_ Â Â Â wss.append(wss_iter)
Â
Let us now plot the WCSS vs K cluster graph. It can be seen below that there is an elbow bend at K=5 i.e. it is the point after which WCSS does not diminish much with the increase in value of K.
plt.xlabel('K')
plt.ylabel('WithinClusterSum of Squared Errors (WSS)')
plt.plot(K,wss)
ii) The Silhouette Method
The silhouette value measures the similarity of a data point within its cluster. It has a range between +1 and 1 and the higher values denote a good clustering.
Below we calculate the Silhouette Score for k=2 to 12 and it can be seen that the maximum value is for k=5. This is in line with the elbow method.
In [10]:
import sklearn.cluster as cluster import sklearn.metrics as metrics for i in range(2,13): labels=cluster.KMeans(n_clusters=i,random_state=200).fit(df_scale).labels_ print ("Silhouette score for k(clusters) = "+str(i)+" is " +str(metrics.silhouette_score(df_scale,labels,metric="euclidean",sample_size=1000,random_state=200)))
Silhouette score for k(clusters) = 2 is 0.33340205479521 Silhouette score for k(clusters) = 3 is 0.4514909309424474 Silhouette score for k(clusters) = 4 is 0.49620078745146784 Silhouette score for k(clusters) = 5 is 0.5594854531227246 Silhouette score for k(clusters) = 6 is 0.5380652777999084 Silhouette score for k(clusters) = 7 is 0.43787929453711455 Silhouette score for k(clusters) = 8 is 0.43074523601514214 Silhouette score for k(clusters) = 9 is 0.4421331695270676 Silhouette score for k(clusters) = 10 is 0.44563877983976935 Silhouette score for k(clusters) = 11 is 0.44293254541345917 Silhouette score for k(clusters) = 12 is 0.4427512711673661
Applying Kmeans with 5 Clusters (K=5)
# We will use 2 Variables for this example
kmeans = cluster.KMeans(n_clusters=5 ,init="kmeans++")
kmeans = kmeans.fit(df[['Annual Income (k$)','Spending Score (1100)']])
Â
Finally, let us plot the graph with k=5 clusters and we can see that now the KMeans result looks good.
df['Clusters'] = kmeans.labels_
sns.scatterplot(x="Spending Score (1100)", y="Annual Income (k$)",hue = 'Clusters', data=df,palette='viridis')
Â
K Means Clustering in Python Sklearn with Principal Component Analysis
In the above example, we used only two attributes to perform clustering because it is easier for us to visualize the results in 2D graph. We cannot visualize anything beyond 3 attributes in 3D and in realworld scenarios there can be hundred of attributes. So how can we visualize the clustering results?
Well, it can be done by applying principal component analysis (PCA) on the dataset to reduce its dimension to only two while still preserving the information. And then clustering can be applied to this transformed dataset and then visualized in a 2D plot. Moreover, PCA can also help to avoid the curse of dimensionality.
 For more details on PCA, you can read the following article â€“
Complete Tutorial of PCA in Python Sklearn with Example
So let us see this practically below where we will use 3 attributes on the same dataset.
Load Dataset
Let us again load the dataset in the dataframe like before.
df = pd.read_csv(r"Mall_Customers.csv")
df.head()
Gender  Age  Annual Income (k$)  Spending Score (1100)  

0  Male  19  15  39 
1  Male  21  15  81 
2  Female  20  16  6 
3  Female  23  16  77 
4  Female  31  17  40 
Apply Feature Scaling
This time we are applying feature scaling on our desired columns Age, Annual Income and Spending Score.
scaler = MinMaxScaler() scale = scaler.fit_transform(df[['Age','Annual Income (k$)','Spending Score (1100)']]) df_scale = pd.DataFrame(scale, columns = ['Age','Annual Income (k$)','Spending Score (1100)']); df_scale.head(5)
Out[14]:
Age  Annual Income (k$)  Spending Score (1100)  

0  0.019231  0.000000  0.387755 
1  0.057692  0.000000  0.816327 
2  0.038462  0.008197  0.051020 
3  0.096154  0.008197  0.775510 
4  0.250000  0.016393  0.397959 
Â
Applying PCA
Now let us reduce the dimensionality of the dataset into two components.
In [15]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df_scale)
pca_df = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
pca_df.head()
principal component 1  principal component 2  

0  0.192221  0.319683 
1  0.458175  0.018152 
2  0.052562  0.551854 
3  0.402357  0.014239 
4  0.031648  0.155578 
Â
Finding Optimum Value of K
i) Elbow Method with WithinClusterSum of Squared Error (WCSS)
K=range(2,12)
wss = []
for k in K:
kmeans=cluster.KMeans(n_clusters=k)
kmeans=kmeans.fit(pca_df)
wss_iter = kmeans.inertia_
wss.append(wss_iter)
plt.xlabel('K')
plt.ylabel('WithinClusterSum of Squared Errors (WSS)')
plt.plot(K,wss)
Â
ii) The Silhouette Method
Using the Silhouette method, it can be seen that the Silhouette value is maximum for K=5. Hence it can be concluded that the dataset can be segmented properly with 6 clusters.
In[18]:
import sklearn.cluster as cluster
import sklearn.metrics as metrics
for i in range(2,12):
labels=cluster.KMeans(n_clusters=i,random_state=200).fit(pca_df).labels_
print ("Silhouette score for k(clusters) = "+str(i)+" is "
+str(metrics.silhouette_score(pca_df,labels,metric="euclidean",sample_size=1000,random_state=200)))
Silhouette score for k(clusters) = 2 is 0.4736269407502857 Silhouette score for k(clusters) = 3 is 0.44839082753844756 Silhouette score for k(clusters) = 4 is 0.43785291876777566 Silhouette score for k(clusters) = 5 is 0.45130680489606634 Silhouette score for k(clusters) = 6 is 0.4507847568968469 Silhouette score for k(clusters) = 7 is 0.4458795480456887 Silhouette score for k(clusters) = 8 is 0.4132957148795121 Silhouette score for k(clusters) = 9 is 0.4170428610065107 Silhouette score for k(clusters) = 10 is 0.4309783655094101 Silhouette score for k(clusters) = 11 is 0.42535265774570674
Â
Applying Kmeans with 5 Clusters (K=5)
K Mean clustering is applied with SKlearn KMeans() by passing the value of k=5
In [19]:
kmeans = cluster.KMeans(n_clusters=5) kmeans = kmeans.fit(pca_df)
pca_df['Clusters'] = kmeans.labels_ sns.scatterplot(x="principal component 1", y="principal component 2",hue = 'Clusters', data=pca_df,palette='viridis')
Out[20]:
Conclusion
We hope you liked our tutorial and now better understand how to implement Kmeans clustering using Sklearn(Scikit Learn) in Python. Here, we have illustrated an endtoend example of using a dataset to build a Kmeans clustering model to achieve customer segmentation using(KMeans Clustering in Python.
References:
 https://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html
 https://realpython.com/kmeansclusteringpython/
 https://www.kaggle.com/khotijahs1/kmeansclusteringofirisdataset
 https://heartbeat.fritz.ai/kmeansclusteringusingsklearnandpython4a054d67b187https://heartbeat.fritz.ai/kmeansclusteringusingsklearnandpython4a054d67b187

I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.
View all posts