Project 3: Scene recognition with bag of words

mashby's writeup

Intro

In the bag of words approach to image classification, a scene is categorized by the frequencies of the appearance of a each of a set of visual words that appear in the image. For this project, scenes are classified with the bag of words approach by training and testing on 15 image categories.

The algorithm

The first step in the algoritm is to collect features on a set of images and create a visual vocabulary, or bag of words. For my implementation, I sample 300 random SIFT featues for each of 100 pictures in 15 categories (1500 picutes total, 450,000 total samples). SIFT features are taken with size set to 4 and step set to 8 (as suggested in the handout). Then, from this set of 450,000 random featues, a vocabulary is created by running kmeans and clustering these features, and using the centers as our words.

The next step is to convert each of the training images to a histogram representation, that is, convert each image into a histogram that keeps track of the frequency of words in our vocabulary that appear. To do this, SIFT features are extracted from each image with the same parameters as above (size = 4 and step = 8), and then using distances between each of the feature vectors and the vocab vectors, a nearest vocab neighbor is chosen for each SIFT vector which adds one vote for that vocab word to the histogram.

For the next three steps (learning a set of SVMs from the histograms, testing images, and scoring), I generally stuck with the baseline provided. I kept the linear kernel and lambda at 0.1

Results

Average accuracy: 62.53%

Here are the results, with accuracy of classfication organized by image category:

Suburb		96%
Forest		79%
InsideCity	93%
OpenCountry	79%
TallBuilding	62%
Bedroom		84%
Kitchen		44%
Store		55%
Coast		74%
Highway		89%
Mountain	43%
Street		33%
Office		50%
Industrial	10%
Livingroom	47%
 

Here's the confusion matrix visualization

While performing superbly for a select few categories, my bag of words implementation has somewhat mixed results. With the features randomly selected and clustered in this attempt, performance ended up being very high for 'Suburb' and 'InsideCity' categories, but performace was poor for others, particularly for the 'Industrial' category. On the whole, results were decent, with a fairly baseline average accuracy of 62.5%