CS143 Project 3: Scene recognition with bag of words

Andy Loomis
October 24th, 2011

Overview

For this project I implemented a bag of words model to classify scenes into one of 15 different categories. The dataset that I used was first intorduced in Lazebnik et al. 2006 and contains 15 scene categories with several hundred images each for a total of 4485 images. The basic bag of words pipeline is divided into 5 steps.

Collect a large set of dense SIFT descriptors from the training images.
Select a random subset of these features and cluster them into a visual vocabulary using k-means.
Build a histogram of words from the vocabulary for each image in the training set.
Train an SVM with these histogram.
Build a histogram of words from the vocabulary for each image in the training set, and use the SVM to classify each image.

Testing

For extra credit I implemented a cross-validation procedure to test the accuracy of the bag of words pipeline. The cross-validation consisted of 10 independant splits. For each split 100 training images and 100 testing images were randomly selected from each category. The reported preformance is the average classification rate across all categories and splits.

Results 1

For the basic pipeline I used a vocabulary of 200 visual words generated from 512 randomly selected features from each training image. In addition, I used a linear SVM to classify the test images. The results were a 65.39 ± 1.33% classification rate.

Improved Pipeline

To improve the above results I made some significant changes to the basic pipeline. I implemented the spatial pyramid and pyramid match kernel described in Lazebnik et al. 2006. The spatial pyramid involves placing increasingly finer grids over each image and generating a histogram in each bin. This technique is able to encode some spatial information into the histogram representation. For a pyramid with a depth of 2, 5 histograms are created (1 for the first level and 4 for the second). I used a pyramid with a depth of 3, where 21 histograms are created for each image.

With a vocabulary of 200 words and a spatial pyramid of depth 3, the feature vector for each image has 4200 dimensions. The space was so large that a linear SVM was no longer adequate. The maximum classification rate I could achieve was 45.56%. Instead of using a linear SVM I switched to the nonlinear pyramid match kernel. This improved the classification rate to almost 80%.

In an attempt to further imporve the results I also experimented with the soft assignment of features to vocabulary bins. I used 2 techniques presented in Gemert et al. 2008. The first technique is called kernel codebooks and the second is called codeword uncertainty. They each use the a Gaussian kernel to distribute a single feature across multiple bins. The codeword uncertainty method additionally scales the contributions to each bin, so that each feature contributes a constant amount accross all of the bins.

Results 2

The table below shows the results of the improved pipeline using the three feature assignment techniques described above. The hard assignment of each feature to a single bin results in a classification rate of almost 80%. The codeword uncertainty technique has a classification rate of 75% and the kernel codebook technique has a classification rate of 75%.

Hard Assignment	Codeword Uncertainty	Kernel Codebook
79.91 ± 0.96%	75.65 ± 0.78%	75.46 ± 0.78%

The only free parameter the in Gaussian kernel methods is the standard deviation of the distribution. A larger standard deviation smooths the histograms to a greater extent. The Gemert paper suggests a standard deviation of 200 for the codeword uncertainty method, and 100 for the kernel codebook method. I did experiment with some other values for sigma, but none of them seemed to preform better than hard assignment. Below is the confusion matrix for the hard assignment, and per category classification results, averaged over each of the ten trials.