Project 3: Scene Recognition with Bag of Words

Varun Singh

For this assignment I classified images by scene based on a 'bag of words' model of image features. First, I went through all the training images and extracted features from them using VLFeat library's vl_dsift function. To speed things ups, I only used around 1/10 of the images, randomly chosen from the entire training set. Because I chose the images randomly, there was some variance in my reslts.

Then, I collected dense features on the image, with a size of 4 and a step of 8. I clustered the features for each image using K means into a number bins equal to the size of the vocab, again using VLFeat, in order to reduce the total number of features that I was collecting as I iterated over the images. Without clustering I ended up with 154,289 features, and with clustering I had only 28,800. This sped up the final K means clustering, and the collection of features as I iterated over the images, significantly. I also got better results doing this (I got .6580 clustering only once, and .6613 clustering after every image). After going through the entire training set, I again clustered all of the (already clustered) features that I had collected to get a final numVocab number of features, which were my 'visual words'.

Then, I went through all the training images and created a representation of them that was a histogram of the visual features that were found in each image. At each point in the image, I simply found the 'visual word' that was closest to the feature at that point.

I didn't modify the last 3 steps of the assignment, and simply used the baseline given to us:

After converting my training images to a histogram representation, I trained a 1 vs. all classifier for each scene category. Then, I classified each test image by converting it to a histogram representation, and running all 15 classifiers (one for each scene) on the test image. The image was classified as the scene whos classifier returned the highest confidence rating. Out of these classifications I built a confusion matrix:

This is a visualization of it:

Finally, I measured the accuracy of my scene classification as the average of the number of images that I classified correctly for each scene, divided by the number of test images per scene.

I got an accuracy of .6613 with these parameters, using 1/10 of the training images. Using all the training images, I got an accuracy of .6767.

One of the suggested extra credits I did was to do a soft-assignment when assigning visual words to histogram bins, using 'Kernel Codebook Encoding' (Chatfield et al.). This improved my accuracy. Using 1/10 of the images, my accuracy went from .6613 to .6647. However, due to the variance because of randomly choosing the images, this was not statistically significant.

So, I redid the classification using soft-assignment and all of the training images. Now, my accuracy went from .6767 to .6813. Because there is now no random variance except from k-means, this is a significant improvement, and thus soft assignment does help.

My parameters for soft-assignment were:

3 - incremented the values of the 3 closest bins.

sigma^2 - 6250

Equation: value to increment bin by = (distance^2) / (2 * sigma^2)

Another extra credit I did was to try different number of vocabulary sizes. I tried 10/20/50/100/400/1000 numbers of visual words in addition to the default 200 that I tried originally. I couldn't do 10,000 because converting all the training images to a histogram representation was going to take many many hours.

The accuracies I got, with soft-assignment, were: (in order) .3933/.5133/.6220/.6507/.6687/.6847/.7073 for 20/30/50/100/400/1000 visual words. Clearly, accuracy is strictly increasing as the vocab size gets bigger, at least until a vocab size of 1000.

The reason my value differs from above, for a vocab size of 200, is because I'm randomly sampling 1 out of 10 images, so there is some randomness involved. Also, because my matab was seg-faulting on the larger vocabs, for vocabs >=200 I clustered the features for every image into exactly 200 'words' (as opposed to into vocabSize clusters), and then combined all those clusters at the end, and clustered them again to get the appropriate number of words.