Image Classification using Bag of Visual Words Model


Currently, there are many deep learning models that are used for image classification. No doubt these models show a very impressive state of art accuracy and have become industry standards. However, prior to the deep learning boom, we still had many classical techniques for image classification. In this article, we will look at one such approach of image classification with Bag of Visual Words.

What is Bag of Visual Words

Relation with Bag of Words

The concept of “Bag of Visual Words” is taken from the related “Bag of Word” concept of Natural Language Processing.

In the bag of word model, the text is represented with the frequency of its word without taking into account the order of the words (hence the name ‘bag’).

The main idea behind the counting of the word is:

Documents that share a large number of the same keywords, regardless of the order the keywords appear in, are considered to be relevant to each other.

Bag of Visual Words

In Computer Vision, the same concept is used in the bag of visual words. Here instead of taking the word from the text,  image patches and their feature vectors are extracted from the image into a bag. Features vector is nothing but a unique pattern that we can find in an image.

To put it simply, Bag of Visual Word is nothing but representing an image as a collection of unordered image patches, as shown in the below illustration.

Bag of Visual Words
Bag of Visual Words (reference-

What is the Feature?

Basically, the feature of the image consists of keypoints and descriptors. Keypoints are the unique points in an image, and even if the image is rotated, shrink, or expand, its keypoints will always be the same. And descriptor is nothing but the description of the keypoint. The main task of a keypoint descriptor is to describe an interesting patch(keypoint)in an image.

Bag of Visual Words
Source –

Image classification with Bag of Visual Words

This Image classification with Bag of Visual Words technique has three steps:

  1. Feature Extraction – Determination of Image features of a given label.
  2. Codebook Construction – Construction of visual vocabulary by clustering, followed by frequency analysis.
  3. Classification – Classification of images based on vocabulary generated using SVM.

Let us go through each of the steps in detail.

Feature Extraction

The first step to build a bag of visual words is to perform feature extraction by extracting descriptors from each image in our dataset.

Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors.

A good descriptor should have the ability to handle the intensity, rotation, scale and affine variations to some extent.

One of the most famous descriptors is Scale-invariant feature transform (SIFT) and another one is ORB.

SIFT converts each patch to 128-dimensional vector. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance.

Bag of Visual Words - Feature Extraction
Feature Extraction (reference-

Codewords and Codebook Construction

The vectors generated in the feature extraction step above are now converted into the codewords which is similar to words in text documents. Codewords are nothing but vector representation of similar patches. This codeword also produces a codebook is similar to a word dictionary

This step normally accomplished via the k-means clustering algorithm. The outline of the K-Means clustering is shown below –

Given k:

  1. Select initial centroids at random.
  2. Assign each object to the cluster with the nearest centroid.
  3. Compute each centroid as the mean of the objects assigned to it.
  4. Repeat steps 2 and 3 until no change.

Some points to consider over here –

  • Clustering, which is an unsupervised learning method, is commonly used for creating visual vocabulary or codebook.
  • Each cluster center produced by k-means becomes a codeword.
  • The number of clusters is the codebook size.
  • Codebook can be learned on the separate training sets.
  • Provided the training set is sufficiently representative, the codebook will be “universal”.
  • The codebook is used for quantizing features. Quantization of features means that the Feature vector maps it to the index of the nearest codeword in a codebook.

So summarizing this step, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the histogram of the codewords.

Bag of Visual Words - Codebook Construction
Codebook Construction (reference-

Example of Codebook


The next step consists of representing each image into a histogram of codewords.

It is done by first applying the keypoint detector or feature extractor and descriptor to every training image, and then matching every keypoint with those in the codebook.

The result of this is a histogram where the bins correspond to the codewords, and the count of every bin corresponds to the number of times the corresponding codeword matches a keypoint in the given image. In this way, an image can be represented by a histogram of codewords.

The histograms of the training images can then be used to learn a classification model. Here I am using SVM as a classification model.

Image classification with bag of visual words
Image classification with bag of visual words – Schematic Diagram (Source – Reference[1])
[adrotate banner=”3″]

Coding Image Classifier using Bag Of Visual Words

In this example, we will use bag of visual words approach to perform image classification on dog and cat dataset.

Importing the required libraries

In [3]:
import cv2
import numpy as np
import os
import matplotlib.pyplot as plt
import random
import pylab as pl
from sklearn.metrics import confusion_matrix,accuracy_score

Defining the training path

In [4]:

In [5]:
['Dog', 'Cat']
In [10]:

Function to List all the filenames in the directory

In [11]:
def img_list(path):
    return (os.path.join(path,f) for f in os.listdir(path))
In [12]:
for training_name in class_names:
In [13]:
In [14]:
In [15]:
In [16]:

Append all the image path and its corresponding labels in a list

In [18]:
In [19]:
for i in range(len(image_paths)):

Shuffle Dataset and split into Training and Testing

In [20]:
dataset = D
train = dataset[:180]
test = dataset[180:]

image_paths, y_train = zip(*train)
image_paths_test, y_test = zip(*test)

Feature Extraction using ORB

In [21]:
In [22]:
In [23]:
In [24]:
<matplotlib.image.AxesImage at 0x7f5dd1b37e50>

Function for plotting keypoints

In [25]:
def draw_keypoints(vis, keypoints, color = (0, 255, 255)):
    for kp in keypoints:
            x, y =
            plt.imshow(, (int(x), int(y)), 2, color))

Plotting the keypoints

In [26]:
kp = orb.detect(im,None)
kp, des = orb.compute(im, kp)
Appending descriptors of the training images in list
In [27]:
for image_pat in image_paths:
    keypoints,descriptor= orb.compute(im, kp)
In [28]:
for image_path,descriptor in des_list[1:]:
In [29]:
(81096, 32)
In [30]:

Performing K Means clustering on Descriptors

In [31]:
from scipy.cluster.vq import kmeans,vq
In [32]:

Creating histogram of training image

In [33]:
for i in range(len(image_paths)):
    for w in words:

Applying standardisation on training feature

In [34]:
from sklearn.preprocessing import StandardScaler

Creating Classification Model with SVM

In [35]:
from sklearn.svm import LinearSVC
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=80000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,

Testing the Classification Model

In [36]:

In [37]:
for image_pat in image_paths_test:
    keypoints_test,descriptor_test= orb.compute(image, kp)
In [38]:
In [39]:
from scipy.cluster.vq import vq
for i in range(len(image_paths_test)):
    for w in words:
In [40]:
array([[ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 4.,  4.,  1., ...,  0.,  3.,  4.],
       [ 1.,  6.,  2., ...,  1.,  2.,  1.],
       [ 3.,  2.,  1., ..., 18.,  0.,  1.],
       [ 2.,  2., 11., ...,  1.,  3.,  2.],
       [ 0.,  3.,  3., ...,  2.,  0.,  2.]], dtype=float32)
In [41]:
In [42]:
for i in y_test:
    if i==1:
In [43]:
for i in clf.predict(test_features):
    if i==1:
In [44]:
['Cat', 'Dog', 'Dog', 'Cat', 'Dog', 'Dog', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Dog', 'Cat', 'Dog', 'Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Cat', 'Dog', 'Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Dog', 'Cat', 'Dog']
In [45]:
['Dog', 'Cat', 'Dog', 'Cat', 'Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Cat', 'Dog', 'Dog', 'Cat', 'Dog', 'Dog', 'Cat', 'Cat', 'Cat', 'Dog', 'Cat', 'Dog', 'Dog', 'Dog', 'Dog', 'Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Dog', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Cat', 'Dog']
In [46]:
array([0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0])
In [47]:



As we can see above, we were able to achieve an accuracy of 65% with this classical technique of image classification with bag of visual words model. Now deep learning models have raised the bar of accuracy to more than 90% but before that, accuracy in the range of 65% to 75% was the benchmark with old techniques.

Reference –


L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 2, pp. 524–531, IEEE, 2005


  • Sachin Mohan

    My name is Sachin Mohan, an undergraduate student of Computer Science and Engineering. My area of interest is ‘Artificial intelligence’ specifically Deep learning and Machine learning. I have attended various online and offline courses on Machine learning and Deep Learning from different national and international institutes My interest toward Machine Learning and deep Learning made me intern at ISRO and also I become the 1st Runner up in TCS EngiNX 2019 contest. I always love to share my knowledge and experience and my philosophy toward learning is "Learning by doing". So thats why I believe in education which have include both theoretical as well as practical knowledge.

Follow Us

2 Responses

  1. Nice tutorial. Well done. Please can you make tutorial using Canny edge and HOG too with SVM or RF. Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *