Scene Recognition Via Bag of Words

Purpose:

We attempt to develop a clasifier for scenes into one of 15 categories suburb, coast, forest, highway, insidecity, mountain, open country, street, tall building, office, bedroom, industrial, kitchen, livingroom or store.

Algorithm:

Main Idea

We follow the bag of words model to classify our scenes. The general pipeline is as follows. To classify our images we will first build a vocabulary of image features. We then use this vocabulary to make histograms for each training image. We then train linear one-vs-all classifiers, to deal with future queries. Finally, we test with more images and evaluate our result using a confusion matrix, which marks a test scenes true answer and what we classified it as.

Build Vocabulary

We are given a set of training images with known labels. We examine our training images and extract SIFT features from each of them. To speed up the process we randomly select k-features from each image rather than use all of them. Once we have extracted features from all of our images we use k-means to cluster them into a vocabulary of size n. This set of visual words forms our vocabulary that we use to histogram individual images later.

Building an Image Histogram

For each training image we first recompute the SIFT features. We then find query the vocabulary with each SIFT feature and record its nearest neighbor. We build a histogram of how many SIFT features selected each vocabulary word as its nearest neighbor. Finally we normalize the histogram, because otherwise larger images will have larger histograms.

Training Our Clasifier

Once we have histograms for each of our training images we train one-vs-all linear SVM classifiers, using the all of that hisotogram data combined with our knowledge of the true label of each image.

Evaluation

Using our Linear SVM's given a query image, we find our confidence that the image belongs to the category of each 1-vs-all classifier. We select the category that we are most confident about. To test our algorithm, we run our algorithm on many images from each category where we know the label of the image. We evaluate our accuracy and also view a confusion matrix. The confusion matrix has one axis be the true label and the other axis the produced label (in our case, this is a 15 x 15 matrix). If we do well, we will see a high accuracy percentage and a strong diagonal in the confusion matrix.

Extra Credit Spatial Information

To add some spatial information I compared entire images in addition to quartile's of each image. I still built my vocabulary the same way, examining only the whole image. I then built 5 histograms per image and trained 5 classifiers for each one-vs-all. To do this I split each image into fourths (upper left, upper right, lower left, and lower right). Then for each query I aggregated the results of all 5-versions of the image. In the paper they discuss weighting matches at lower levels higher because there is spatial and SIFT feature match up, however, I am not sure this is valid assumption and therefore weighted them equally. My logic is that if anything the larger version of the image should be weighted more because its similarity encompasses the entire image rather than a smaller portion of it. To resolve these conflicting arguments I just weighted them equally. As seen in the results section, I obtained significantly better results using this additional break down using spatial information. While the results suggest this is a good method I am still skeptical. It seems to operate on the assumption that all images are taken with the same rotation. If this algorithm were applied to real data, I could easily see it hurting the result. For example if I were to gather images from a random persons camera, half would be taken with the camera horizontal and the other half vertical. Even if we were to rotate all vertical images, some would likely then be upside down. Directly comparing spatial locations seems as though it could easily damage the results in less nice data sets. (This is not even accounting if our sensor sometimes rotated at non-multiples of 90 degrees).

Results:

With Spatial Addition

Using parameters:
Vocab Size = 200
Training Images per Class = 100
SIFT Samples per Image (vocab building) = 250

Our accuracy is 71.07% with a confusion matrix that looks like:

Baseline

Param Set One:

Using parameters:
Vocab Size = 200
Training Images per Class = 100
SIFT Samples per Image (vocab building) = 250

We find 62.2% accuracy, with a confusion matrix that looks like:

Param Set Two:

Using parameters:
Vocab Size = 150
Training Images per Class = 70
SIFT Samples per Image (vocab building) = 150

We find 59% accuracy, with confusion matrix that looks like:

Param Set Three:

Using parameters:
Vocab Size = 50
Training Images per Class = 10
SIFT Samples per Image (vocab building) = 50