Scene Recognition
Benjamin Leib
bgleib

The Algorithm

This programs uses a bag of words model to do scene recognition. It involves a pipeline with several steps. First, features are extracted from the training images. For my implementation, these are SIFT features sampled densely with a bin size of 4 and a step size of 8, found using the vlfeat library. These features are then used to create a 'visual vocabulary.' To do this, I first randomly sample 1/50th of the features (to reduce computational complexity) and then run them through vlfeat's K-means implementation, which clusters them into 200 categories. The cluster centers for these categories then define my 200-word visual vocabulary. I then go back through the training images and create for each one a histogram of its distribution over the 'words', where each feature in the image is assigned to the histogram bin corresponding to the word, or cluster center, which it is closest to according to normal L2 distance. After normalizing these histograms I feed them along with scene labels for the images into a linear SVM implementation in order to train a series of 1 vs. all models, with one model for each scene category. With these models, I then classify the test images by building their histograms the same way as for the training images and then running them through the SVM models.

Results

I achieved an overall accuracy of 66.27% on 1500 test images. See below for the confusion matrix. As is evident from the matrix, the most common error where almost exactly what you'd expect- for instance, the two most common errors appear to be bedrooms and livingrooms getting confused and open country and coasts getting confused.