Project 3: Scene recognition with bag of words

Project Overview

Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based only on a histogram of the frequency of visual words. Visual words are identified by clustering a large corpus of example features.

The major steps are :

Collect a lot of features.
Use k-means to cluster those features into a visual vocabulary.
For each of the training images build a three level pyramid. Construct the histograms of each pyramid and concatenate them.
Feed these hitograms to an SVM.
Build a histogram for test images and classify them with SVM just trained.

Build Vocabulary

In this part, I use k-means to cluster the sift-features to get a vocabulary of visual words. Here, I use 100 training images for each category scene. Then for each image, I get the sift features of it using vl_dsift, then use vl_kmeans to cluster all the features get from all the training images. The result is the vocabulary.

Make Histogram

In this part, I use vl_kdtreeBuild() to index the columns of the vocabulary. Then I use vl_kdtreequery() to calculate the histogram. For extra credit, I made a three level pyramid. For each pyramid, I calculate the histograms for each subpart and fianlly concatenate them all together. The three level pyramid increases the accuracy by 12%.

Results

The spatial pyramid method improves the accuracy of the prediction from 0.6107 to 0.7273(3 level). Three level pyramid works better than 2 level pyramid, improves the result from 0.6827 to 0.7273.

Basic result: accuracy 0.6107

2 level spatial pyramid result: accuracy: 0.6827

3 level spatial pyramid result: accuracy: 0.7273