Scene Recognition with Bag of Words

Implemented by Andrew Scheff (ajscheff) for CSCI1430, Fall 2011

"Jade must be chiseled before it can be considered a gem."
                                                 -Chinese Proverb

In this project, we were given a large set of labeled training images in 15 categories and tasked with building a model for scene recognition. Given a new image, the goal is to categorize it in one of the 15 categories of the training set. We use a bag of words model that first builds up a "vocabulary" of local image features or "visual words", then compares images by which visual words they contains.

The Algorithm

The first step is to build our vocabulary of visual words. We need a vocabulary that is large enough to describe a wide variety of local features but not so large that it separates two features that are slightly different but should be considered the same. This is the classic bias-variance trade-off. The vocabulary is built by taking a random sample of local features from all training images and running k-means on these features where k is the number of words we want in the vocabulary. Our local features are 128 dimensional vectors representing SIFT descriptors.

The next step is analyzing the training database and teaching the SVM. Now that we have our vocabulary of visual words, we can describe any image by what visual words are seen in the image. For each image in the training set we get a dense set of local features and for each we find the word in our vocabulary that best describes it. Then, we build a histogram with these words that describes the image. 2 different images and their visual word histograms are shown below. Notice that the different images are represented by very different visual word distributions.


These histograms along with the lables on the images that they were built from serve as the training data for a linear SVM. We have 15 categories, so we train 15 different 1 vs. all classifiers. When given a test image, we simply pick the classifier that best fits the set of visual words we get from that test image.

Results

I started with the suggested parameters (local feature size 4, spaced out by 8, and with a vocabulary of 200 words) and tried a couple of variations for science. Below is a summary of my results. The table shows the percentage of images in the test set that were classified correctly, the warmness of color in the image corresponds to the amount of positive classifications of image a into category b where a is the row of the grid and b is the column. As you can see, the diagonal entries of this color matrix are warmer, which means we're getting many correct classifications (an image into the category that it actually belongs).

Suggested Parameters: size 4, step 8, vocab size 200

Subarb Coast Forest Highway Inside City Mountain Open Country Street Tall Building Office Bedroom Industrial Kitchen Living Room Store
94% 78% 95% 80% 62% 81% 40% 56% 75% 89% 45% 31% 51% 9% 52%
mean accuracy: 62.53%            running time: 26 minutes

Increased feature size and spacing: size 8, step 16, vocab size 200

Subarb Coast Forest Highway Inside City Mountain Open Country Street Tall Building Office Bedroom Industrial Kitchen Living Room Store
84% 73% 94% 79% 57% 79% 51% 64% 76% 86% 48% 26% 63% 28% 54%
mean accuracy: 64.13%            running time: 39 minutes

Increased vocab size: size 4, step 8 vocab size 300

Subarb Coast Forest Highway Inside City Mountain Open Country Street Tall Building Office Bedroom Industrial Kitchen Living Room Store
92% 78% 94% 77% 64% 82% 39% 54% 72% 90% 44% 34% 51% 7% 49%
mean accuracy: 61.80%            running time: 30 minutes

The best results I got were achieved with a larger feature size. Smaller features are the worst for the living room images, which typically have large objects in the foreground and very little small object similarity. With larger feature size I was able to achieve a better than random result for this category.