Nabeel Gillani - Scene Recognition with Bag of Words

Algorithms and Decisions

Building the Vocabulary

For most of my experiments, I used a vocabulary size of approximately 800. Building the vocabulary was pretty straightforward. For each image in our training set, I first compute the SIFT vectors. I then randomly sample 100 vectors from each image's matrix and store it in a "meta" feature matrix. Then, I run K-means on the "meta" matrix, feeding in the vocab size as the number of clusters. The result is a dim x vocab_size matrix of "clusters", which represents our vocabulary.

I included spatial information after talking to Paul / reading the footnote in the Lazebnik 2006 paper. I extended the dimension of each SIFT vector from 128xn to 130xn to encode X-Y information by simply concatenating the "locations" vector to the "sift_features" vector. As the results section will show, this led to a relatively good increase in accuracy.

Making the Histograms

In building the histogram, I first computed the SIFT vectors and concatenated location information to the feature matrix for each image. I then initially used vl_dist2 to compute the distance matrix for each feature-vocab word pair. After this, I found the argmin of the distance vector for each SIFT feature and incremented my histogram for the corresponding feature. I concluded by returning a normalized version of the histogram.

To save some time, I used VL Fleat's KD-Tree. However, before I settled on the KD-tree with "hard" binning, I experimented a little with soft-binning as described in this paper. This was a paper linked off of the Chatfield paper listed on the assignment page. I tried both the Kernal codebook encoding and the Codeword uncertainty encoding (they are left commented out in the make_hist file). The former uses a Gaussian kernel to smooth distances between features and the vocabulary, ultimately distributing "weighted" votes to different histogram bins. Codeword uncertainty is similar, but instead it normalizes the smoothed distance between each feature and word by the sum over all of these distances. Unfortunately, the SVM computed singular matrices for both of these approaches, and so, was not able to train appropriately. I abandoned these approaches, but think it would be interesting to try and implement them in the future.

Results

Sample Size = 100 per image, Vocab Size = 800, Using X-Y information

Accuracy: 0.6547

This was the best performance I was able to achieve. In my initial run with a vocabulary size of 200, sample size of 50 for each image while building the vocab, and without the use of spatial information, my accuracy level was 0.613. Therefore, with the parameters / changes described above, I saw a performance increase of about 7%. I think most of this boost was due to the use of spatial info (effectively the "single level" implementation), as Lazebnik et. al. showed that larger vocab sizes alone did not lead to gains in accuracy comparable to those achieved through the use of spatial info. If I had gotten soft binning to work out, I don't expect there would have been a huge advantage over the current implementation if the results in the papers linked above are any indication.

Summary

Overall, the simple addition of x-y information for the feature vectors and vocab led to a pretty decent performance boost. The few soft-binning strategies I tried out seemed promising until the SVM couldn't converge on a solution because of the presence of singular matrices. In the future, I would try varying the vocab size more, while holding everything else constant, to discern more precisely how vocab size alone impacts the classifier's accuracy.

Overview

Algorithms and Decisions

Building the Vocabulary

Making the Histograms

Results

Summary

Shoutouts