Bag of Words Scene Recognition
by Sam Swarr (sswarr)

CS 143 Fall 2011

Building the Vocabulary

In order to have a vocabulary, we must first have "words". In the computer vision sense, "words" are common features from the images that appear frequently. To find these words, I used vl_dsift to find a dense cluster of 128-dimensional scale-invariant features for each image in my training set. (I used 100 images from each of 15 categories). I then amalgamated 600 random features from each training image into one big collection of features. Then, I used vl_kmeans on this large collection of features, asking it to find 200 clusters. These 200 clusters formed my vocabulary of "words".

Making Histograms

The next step was to make histograms for each training image to show the frequencies with which each word appeared in an image. This was done by again finding features for each test image and then for each feature of that image, finding the closest word in the vocab using vl_alldist2. Here are some example histograms:

suburb1 suburb2 suburb3
suburbhist1 suburbhist2 suburbhist3
These are three histograms for images from the Suburban category. Note how they are similar. An image's histogram is its "Bag of Words". The axis represents the 200 words of the vocabulary and the spikes represent the frequency with which each word appears in the image. (The histograms have been normalized to account for different sized images that may have more or less features.)

Training Classifiers

Now that we have histograms for the training images, we need to use them to define what each of the image categories look like in terms of our vocabulary. In other words, what does a typical Forest or typical Office image histogram look like? To do that, I used the default primal_svm function with a linear kernel. The results are 15 different classifiers, one for each image category. A classifier maps a positive or negative value to each of the 200 words in the vocabulary. Positive values correspond to words that typify that classifier's category and negative values correspond to words that are rare for that category.

suburb classifier
This is the classifier for the Suburban category. Note how it has positive spikes on the words that had high frequency from the histograms above.

Classifying the Test Images

Now we have everything we need to start classifying test images. For each test image, find its histogram. Then, multiply its histogram with each classifier (and add the bias for that classifier). Find which multiplication resulted in the highest confidence value. Hopefully, the classifier for the correct image category produced the highest confidence!


Results

I classified 100 test images from each category with these results:

100 Training Images per Category; 600 Features per Training Image; 200 Word Vocabulary;
Linear SVM to Train Classifiers

confusion graph
The strong diagonal suggests that the classification was fairly successful. Indeed, for most categories, the majority of images were classified correctly. The classifier did very well on the Suburb, Forest, and Office categories. It struggled the most with Living Room pictures which it confused for Bedroom or Office images.

Here is the numerical confusion matrix:
    93     1     0     1     1     0     0     1     1     0     1     0     0     1     0
     4    78     0     4     0     2    12     0     0     0     0     0     0     0     0
     0     0    94     0     0     3     1     2     0     0     0     0     0     0     0
     0     7     1    81     4     2     3     0     0     0     1     0     1     0     0
     4     4     0     1    60     0     1     5     5     0     1     0    10     1     8
     8     0     3     2     0    81     2     2     1     0     1     0     0     0     0
     5    23     9     9     0     9    43     1     0     0     1     0     0     0     0
     1     0     0     9    21     3     0    52     4     0     0     3     0     2     5
     0     3     2     0     2     4     0     7    73     2     1     3     2     0     1
     0     0     0     0     0     0     0     0     0    89     3     0     8     0     0
     1     0     0     0     1     2     2     0     3    14    42     2    23     6     4
     7     5     1     9     8     5     4     3     9     2     1    28     5     3    10
     1     0     0     0    12     1     0     1     1    20     7     1    51     2     3
     2     0     0     1     4     3     0     1     3    22    26     2    18    12     6
     1     0     3     3    18     7     0     3     6     2     4     1     4     1    47

The overall accuracy was 61.60%.

Classifications

Category

Accuracy

Sample Training Images

Correct Classifications

Incorrect Classifications

Suburb 93%
Coast 78%
Forest 94%
Highway 81%
City 60%
Mountain 81%
Open Country 43%
Street 52%
Tall Building 73%
Office 89%
Bedroom 42%
Industrial 28%
Kitchen 51%
Living Room 12%
Store 47%