CS 143 Project 3

Scene Recognition with bag of words

Will Allen (weallen)

Methods


For this project, we implemented a simple scene recognition algorithm using a bag-of-words representation of SIFT features to classify images as belonging to a particular scene.

Algorithm

There were several steps in our pipeline:

Step 1:

We first extracted dense SIFT features from all of the images in the training database, using a step of size 4 and and bin size of 8 (this was accomplished by a C function vl_dsift that was part of the VLFeat vision library). This gave us a matrix of 128 rows by several million columns, where each column is a SIFT feature from some image. We then sampled ~30% of the features iid without replacement, to obtain a computationally feasible training subset of features. These sampled SIFT features were then clustered with the vl_kmeans function into 200 clusters. The 128-dimensional center of each of these clusters was used as a entry in our vocabulary.

Step 2:

We then again computed SIFT descriptors for each image in the training and test sets. For each image, we computed a histogram over our vocabulary by looking up the nearest (by L2 distance) word in our vocabulary, and adding one to that word's histogram bin, for each SIFT feature in the image. These histograms were each normalized. We constructed a kD-tree of our vocabulary, to increase the speed of lookup to be better than doing N x 200 comparisons, where N is the number of SIFT feature in a given image. We took this histogram to be the bag-of-words representation of the image given our particular 200 word vocabulary.

Step 3:

We then trained a 1-vs.-all linear SVM classifier for each scene category using the training histograms computed in Step 2, using the labels associated with each of those training images.

Step 4:

We then used these classifiers by converting each test image to a bag-of-words representation, then using all 15 classifiers to predict that image's category.

Results


Our approach achieved an accuracy of ~60% of test images being classified correctly. There were several improvements we could have made, to achieve better performace. For example, we didn't experiment with any different parameter settings, and we used the simplest kernel for the SVM.

The confusion matrix of our results indicates that the algorithm did well on certain categories (such as Suburb, InsideCity, and Bedroom) and poorly on others (particularly Office, Industrial, and TallBuilding). It seems likely that the categories in which the algorithm performed poorly shared some visual words with other categories in the training set, and so the classifier couldn't distinguish between the categories as well: