CS 143 / Project 3: Scene recognition with bag of words / Andrew Ayer <andrew>

Background

This goal of this project was to implement a simple scene detector using a bag of words model. Features are extracted from training images and used to build a visual vocabulary. A SVM is trained using histograms of "word" frequency in each training image. The SVM is then used to classify an image based on the histogram of "word" frequency in the test image.

Algorithm

The algorithm is implemented as described in the project handout.

Creating Visual Vocabulary

In this step, every training image is fed to the 'vlfeat' library to extract SIFT descriptors. Parameters include the sigma value for smoothing the image, bin size, and step size. I used parameters of 0.1, 4, and 8 respectively. Each training image produces around 1,000 descriptors. The descriptors for all images are clustered using K-Means to create a visual vocabulary. I used a vocabulary size of 200 words. The descriptors of each image can optionally be randomly sampled to speed up the clustering. I experimented with sampling 100 words, however this seemed to result in only a modest speedup at the expense of some accuracy. Since the vocab can be generated once and saved to a file, I figured it wasn't worth using random sampling.

Creating Histograms for Training Images

In this step, SIFT descriptors are found for each training image, as described above. For each descriptor, the nearest visual word is found, and a count kept of how many times each word is found in the training image. Finally, a histogram is produced. The histogram is normalized such that all counts are between 0 and 1000, so the size of an image does not matter.

Training SVM

The histogram of each training image, along with its label, is fed to a linear SVM for training. I experimented with many different values for the lambda parameter to the SVM. I found that increasing lambda improved accuracy greatly, though it flatlined at around 256, which is the value I settled on for lambda.

Results

I gained an accuracy of 0.6167. The confusion matrix is below: