Scene Recognition

Overview

This project is meant to categorize pictures into one of 15 scenes (bedroom, suburb, industrial, kitchen, living room, coast, forest, highway, inside city, mountain, open country, street, tall building, office, store). At a high level, the implementation follows this pipeline: build a visual vocabulary from the training images, represent the training images as histograms of the visual words in our vocabulary, train 1 vs. all classifiers for each category, represent each of the test images as a histogram of visual words, feed these representations into the classifiers, and classify each of the test images by assigning it to whichever classifier claims it the most strongly.

The building of the visual vocabulary is accomplished by first generating SIFT descriptors for each image by densely sampling the images. The set of SIFT descriptors for all of the training images are then fed into a kmeans function in order to cluster them into the given vocabulary size.

In order to turn an image into a histogram of visual words from the previously defined vocabulary, we first perform the same dense sampling SIFT extraction on the image. Each descriptor is then assigned to the closest visual word in the vocabulary based on squared euclidean distances.

Baseline

For my baseline, I used the following parameters:

This baseline resulted in 61.47% accuracy, with the confusion matrix shown below:

SIFT Density Parameters

In order to attempt improving on my baseline, I experimented with increasing the SIFT step size and bin size. I did not experiment with decreasing the bin and step size because the baseline parameters were already small and decreasing them further would have led to thousands of tiny SIFT descriptors that would not have captured relevant information.

Larger bin and step size:

This parameterization resulted in 44.4% accuracy, with the confusion matrix shown below:

Vocabulary Size

I also experimented with different vocabulary sizes, both larger and smaller than the 200 baseline.

This parameterization resulted in 60.33% accuracy, with the confusion matrix shown below:

This parameterization resulted in 44.13% accuracy, with the confusion matrix shown below:

This parameterization resulted in 63.67% accuracy, with the confusion matrix shown below:

Conclusion

Overall, I found that increasing the bin and step size for the dense SIFT decreased performance markedly. As far as vocabulary size is concerned, I found that a decrease in vocabulary size of approximately one order of magnitude (200 to 50) led to a negligible decrease in accuracy, but that once the vocabulary size was decreased by two orders of magnitude (200 to 10), a significant fall off in accuracy was seen. When increasing the vocabulary size, a doubling of the vocabulary size led to a negligible increase in the accuracy.