Computer Vision, Project 3: Scene Recognition with Bag of Words

Bryce Richards


Project Description

Scene recognition is the task of classifying a photograph as belonging to a particular scene category -- forest, bathroom, mountains, etc. The bag of words method, inspired from natural language processing, classifies a scene based only (or primarily) on the frequency of "visual words" present in that scene. In this project we implement the basic pipeline of Lazebnik et al's 2006 paper, available at http://www.di.ens.fr/willow/pdfs/cvpr06b.pdf.


Algorithm Design

The algorithm consists of four main steps. The first is to collect tens or hundreds of thousands of local features from the training images, and cluster these features into a few hundred visual words. Then, for each training image, we build a histogram recording the distribution of the visual words that are contained in that image. We feed these histograms to a linear SVM. Lastly, we compute the histograms of the test images, and use the SVM to classify each test image as belonging to one of the 15 scenes for which we trained the SVM. The results are displayed in a visual matrix that shows the percentage of images from each scene that are correctly identified. More details about each step of the process follow.

Step 1: Build Visual Vocabulary We used SIFT descriptors to collect local features of each of the training images. We sampled the images densely, at 8 pixel intervals. Of all the collected features, we then sampled 100,000 of them in total. To ensure that we sampled from every training image, we randomly picked 100,000/(number of training images per class * number of class) features from each image. Then, to create the visual vocabulary, we clustered all of these features using k-means clustering, with k = vocabulary size = 200.

Step 2: Build Histograms of Training Images We converted each training image into a histogram that counts the number of times each visual vocabulary word was found in the image. For every SIFT feature found in an image, we found which of the 200 cluster centers it was closest to, and added 1 to that word's histogram count. We then normalized each histogram, so that the sum of all the bins' entries is 1.

Step 3: SVM Generation: We fed all of the test images' histograms (labeled with the scene class of the image) into a linear SVM. We did not find that adjusting the SVM parameters had much effect on the accuracy of the scene classifications, so we left them all as they were in the starter code.

Step 4: Classify Test Images As we did for the training images, we convert all the training images into histograms of visual word counts. We then classify each image with our linear SVM. This gives us a "confusion matrix" recording how well we identified each class of test images. The diagonals of this matrix are how often we correctly identified each member of a particular class as belonging to that class. We displayed this matrix and averaged along its diagonals, which gives us the percentage of the time our bag of words model correctly identified a scene.





Results and Extra Credit
Our initial implementation performed reasonably well, achieving a score of 0.620. Below is a visualization of the confusion matrix.


Baseline Result




Varying Vocabulary Size: To determine the optimal size of the visual vocabulary, we tested our baseline algorithm on visual vocabularies of size 10, 20, 50, 100, 200, 500, 1000, and 5000. As one might expect, the algorithm performed worse with the very small vocabulary sizes. However, somewhat surprisingly, its results were almost exactly identical for vocabularies of sizes 100, 200, 500, and 1000. Below is a chart showing the performances by vocabulary size, and below that are the confusion matrices for vocabularies of size 10, 100, 500, and 5000.

vocab size accuracy
10 0.4660
20 0.5133
50 0.5813
100 0.6180
200 0.6147
500 0.6193
1000 0.6180
5000 0.5813



Vocabulary Size = 10, Accuracy = 0.466



Vocabulary Size = 100, Accuracy = 0.618



Vocabulary Size = 500, Accuracy = 0.6193



Vocabulary Size = 5000, Accuracy = 0.5813


Cross-Validation: To ensure the accuracy of our algorithm's performance, we rain 10 trials of cross-validation. That is, of all the images from each scene category, we randomly picked disjoint sets of 100 images to serve as the training and testing images, and then ran the algorithm. This way, we can be sure that the algorithm's baseline performance was not distorted by some artificial difference between the predesignated training and testing images. The results of the 10 trials are displayed below, along with the average and standard deviation of the performances.

1 .6340
2 .6267
3 .6087
4 .6233
5 .6460
6 .6307
7 .6320
8 .6120
9 .6180
10 .6413
average .6273
standard deviation .0121



Soft Histogram: We attempted to boost the classifier's accuracy by implementing a soft histogram. That is, for each feature collected from an image, instead of adding 1 to the bin of the vocabulary word closest to that feature, we added to several vocabulary words' bins. We tried several ways of doing this, but none of them boosted accuracy. For instance, the simplest method we tried was to add 1/k^2 to a vocabulary word's bin if that vocabulary word was the k'th closest word to the feature. So that we weren't adding to completely irrelevant bins, we only did this for the three closest vocabulary words. Slightly more sophisticated modifications of this approach proved fruitless. If we wanted boosted accuracy, we should have followed the "kernel codebook encoding" methods of Chatfield et. al., but we were short on time. Below is the result of our algorithm with our simple version of a soft histogram.


Results with a soft histogram, accuracy = 0.6107



Conclusion
Our bag of words scene scene recognition algorithm performed modestly well, with a typical accuracy of between 62 and 63 percent. For even better results, we should have devoted more attention to tuning parameters and used an SVM with a more sophisticated kernel. However, even with saving and loading data, running the algorithm took a very long time, which made it tedious to adjust parameters. It is also interesting to note which scenes our algorithm performed best and worst on. It seemed to have an easy time recognizing Inside City and Highway images, and a difficult time recognizing Industrial images.