CSCI 1430 - Introduction to Computer Vision

Bag of words

The hypothesis of the project: having images of several contextual classes we can extract features from them, cluster and store as a vocabulary. If every image gets a histogram of feature frequencies, we can train a classifier like SVM and build a model. It should be able to assign a class label to new images never seen before.

Scene recognition is one of the core vision tasks. For this project we build a classifier based on bag of words approach. Words are visual features in this context. The features' arrangement from images doesn't matter in this paradigm. The only thing that matters is the distribution of features among images of the same class. First we collect the features from images. The bigger the size of the vocabulary we are building, the better – easier to discriminate between classes. The downside of a large vocabulary is a slow clustering. Kmeans is used to cluster image into visual vocabulary. After the vocabulary is built based on all possible training images, we build a histogram for each of them, describing what features from the vocabulary they have. We use SVM to build a classification model based on histograms belonging to particular classes. After that we test our algorithm on new images, never seen before and try to classify them into one of the predefined classes. The hypothesis has been proven to be correct: accuracy of 0.6107 was achieved. The confusion matrix shows confident level on diagonal, which means that images that it was trained on are classified correctly to the classes they belong to (see Image 1). Although some classes are confused more often than others, like 'kitchen' and 'forest'. This issue can be resolved by vocabulary increase.

The algorithm of the program:
VLFeat paths are set to be able to call the SIFT, Kmeans and KDTrees related functions
read all filenames
set vocabulary size
build vocabulary:
      for every image:
           read it
           convert to single precision
           extract SIFT features
           append to feature matrix
      convert feature matrix to single precision
      cluster using kmeans
      return vocabulary
build a forest out of vocabulary using KDTree
for every class and image in it:
      make histogram:
           read image
           convert to gray scale
           extract SIFT features
           query the forest with extracted features
           build histograms out of results
           normalize it dividing by the largest value contained
use SVM to classify histograms and obtain models
classify every test image
visualize it and get the score
calculate confusion matrix

Image 1. Confusion matrix for the trained classes

Page owner: Georgy Megrelishvili. Date October 24, 2011