Computer Vision, Project 3: Scene Recognition with Bag of Words
Bryce Richards
Project Description
Scene recognition is the task of classifying a photograph as belonging to a particular scene category -- forest, bathroom, mountains, etc.
The bag of words method, inspired from natural language processing, classifies a scene based only (or primarily) on the frequency of "visual words"
present in that scene. In this project we implement the basic pipeline of Lazebnik et al's 2006 paper, available at http://www.di.ens.fr/willow/pdfs/cvpr06b.pdf.
Algorithm Design
The algorithm consists of four main steps. The first is to collect tens or hundreds of thousands of local features from the training images, and
cluster these features into a few hundred visual words. Then, for each training image, we build a histogram recording the distribution of the
visual words that are contained in that image. We feed these histograms to a linear SVM. Lastly, we compute the histograms of the test images, and
use the SVM to classify each test image as belonging to one of the 15 scenes for which we trained the SVM. The results are displayed in a visual
matrix that shows the percentage of images from each scene that are correctly identified. More details about each step of the process follow.
Step 1: Build Visual Vocabulary We used SIFT descriptors to collect local features of
each of the training images. We sampled the images densely, at 8 pixel intervals. Of all the collected features, we then sampled 100,000 of them in
total. To ensure that we sampled from every training image, we randomly picked 100,000/(number of training images per class * number of class) features from each image.
Then, to create the visual vocabulary, we clustered all of these features using k-means clustering, with k = vocabulary size = 200.
Step 2: Build Histograms of Training Images We converted each training image into a histogram that
counts the number of times each visual vocabulary word was found in the image. For every SIFT feature found in an image, we found which of the
200 cluster centers it was closest to, and added 1 to that word's histogram count. We then normalized each histogram, so that the sum of all the bins'
entries is 1.
Step 3: SVM Generation: We fed all of the test images' histograms (labeled with the scene
class of the image) into a linear SVM. We did not find that adjusting the SVM parameters had much effect on the accuracy of the scene classifications,
so we left them all as they were in the starter code.
Step 4: Classify Test Images As we did for the training images, we convert all the
training images into histograms of visual word counts. We then classify each image with our linear SVM. This gives us a "confusion matrix" recording
how well we identified each class of test images. The diagonals of this matrix are how often we correctly identified each member of a particular
class as belonging to that class. We displayed this matrix and averaged along its diagonals, which gives us the percentage of the time our bag of words
model correctly identified a scene.
Results and Extra Credit
Our initial implementation performed reasonably well, achieving a score of 0.620. Below is a visualization of the confusion matrix.
Baseline Result
Varying Vocabulary Size: To determine the optimal size of the visual vocabulary, we tested our baseline algorithm on visual vocabularies of size 10, 20, 50, 100, 200,
500, 1000, and 5000. As one might expect, the algorithm performed worse with the very small vocabulary sizes. However, somewhat surprisingly, its
results were almost exactly identical for vocabularies of sizes 100, 200, 500, and 1000. Below is a chart showing the performances by vocabulary size,
and below that are the confusion matrices for vocabularies of size 10, 100, 500, and 5000.
vocab size |
accuracy |
10 |
0.4660 |
20 |
0.5133 |
50 |
0.5813 |
100 |
0.6180 |
200 |
0.6147 |
500 |
0.6193 |
1000 |
0.6180 |
5000 |
0.5813 |
Vocabulary Size = 10, Accuracy = 0.466
Vocabulary Size = 100, Accuracy = 0.618
Vocabulary Size = 500, Accuracy = 0.6193
Vocabulary Size = 5000, Accuracy = 0.5813
Cross-Validation: To ensure the accuracy of our algorithm's performance, we rain 10 trials of cross-validation.
That is, of all the images from each scene category, we randomly picked disjoint sets of 100 images to serve as the training and testing images,
and then ran the algorithm.
This way, we can be sure that the algorithm's baseline performance was not distorted by some artificial difference between the predesignated
training and testing images. The results of the 10 trials are displayed below, along with the average and standard deviation of the performances.
1 |
.6340 |
2 |
.6267 |
3 |
.6087 |
4 |
.6233 |
5 |
.6460 |
6 |
.6307 |
7 |
.6320 |
8 |
.6120 |
9 |
.6180 |
10 |
.6413 |
average |
.6273 |
standard deviation |
.0121 |
Soft Histogram: We attempted to boost the classifier's accuracy by implementing a soft histogram.
That is, for each feature collected from an image, instead of adding 1 to the bin of the vocabulary word closest to that feature, we added to several
vocabulary words' bins. We tried several ways of doing this, but none of them boosted accuracy. For instance, the simplest method we tried was
to add 1/k^2 to a vocabulary word's bin if that vocabulary word was the k'th closest word to the feature. So that we weren't adding to completely
irrelevant bins, we only did this for the three closest vocabulary words. Slightly more sophisticated modifications of this approach proved
fruitless. If we wanted boosted accuracy, we should have followed the "kernel codebook encoding" methods of Chatfield et. al., but we were short on
time. Below is the result of our algorithm with our simple version of a soft histogram.
Results with a soft histogram, accuracy = 0.6107
Conclusion
Our bag of words scene scene recognition algorithm performed modestly well, with a typical accuracy of between 62 and 63 percent. For even better results,
we should have devoted more attention to tuning parameters and used an SVM with a more sophisticated kernel. However, even with saving and loading data,
running the algorithm took a very long time, which made it tedious to adjust parameters. It is also interesting to note which scenes our algorithm
performed best and worst on. It seemed to have an easy time recognizing Inside City and Highway images, and a difficult time recognizing Industrial
images.