CS 143 / Project 3 / Scene Recognition with Bag of Words

The implementation of this was relatively straightforward. I didn't add any bells or whistles, and there's one error message at the end of running, which doesn't compromise the results--simply the output.

Here we implemented scene recognition using a variety of methods. For image representation, we implemented tiny images and a bag of SIFT features. To perform scene matching, we used a nearest neighbor classifier and a linear SVM classifier. Much of this functionality was implemented using matlab functions from VLFeat.org.


Image Representation

Tiny Images

To implement a tiny image descriptor, I first cropped every image to a perfect square, based off of the length of its shortest side. I then resized the image using Matlab's imresize function, and then linearized it using the im2col function.


Bag of Sift

Luckily, rather than using my SIFT feature detector from last time, I was able to use Matlab's built in vl_dsift function. To implement this, I ran vl_dsift using the 'fast' setting with step size of 10. I then ran the function knnsearch on the resulting SIFT features, matching them with the set of vocabulary we imported in (see next section). This resulted with the matrix index, which holds the indices of the best matches, and the matrix distances, which holds the respective least distance. I then tabulated the number of features in the entire image, generating a histogram of some 400 different possible visual vocabulary. Rather than just incrementing by one, I gave it a weight based off of the distance, thus giving better matches more weight.


Building the Vocabulary

To acutally build the training data of vocabulary, I pulled a very, very large number of SIFT features out of a training data set. I'm currently testing the accuracy v. efficiency trade-off of a smaller step size. I had started at 20, which produced okay results, but I cranked it down to 5 to see how it could possible improve. It's been running for a while, as I've been typing this.

After generating the massive set of SIFT features, I clustered them using the vl_kmeans to cluster the SIFT features into a idiolect of 400 distinct visual words. Clustering, it turns out, also takes a huge amount of time when you're working on 3,800,991 SIFT features. I think I might have to tone this down.

Well, it was taking over an hour, so I turned the step size back to 20. It's running vl_kmeans on just 254,700 SIFT features now.


Classifying

K-Nearest Neighbor

To build the nearest neighbor classifier, I took the histogram representation of each image, and expanded it using repmat so that the new repeated matrix matched the dimensions of the matrix holding the represntation of all the training images and their histograms. I then took the sum squared difference, and found the minimum in the resulting column, and its index thus returned the nearest neighbor.


SVM Classifier

This was the hardest to build. I first had to create all the SVMs that tested a given category against all others. To do this, I had to label all the training data with labels of being the category or not the category. I then ran the training data with the new labels through the vl_svmtrain function, which generates a W and B score which are the respective parameters of the SVM. vl_svmtrain also took in a value lambda which I tested from 1 to 0.000001, which was a finicky process, to say the least.

This was the hardest to build. I first had to create all the SVMs that tested a given category against all others. To do this, I had to label all the training data with labels of being the category or not the category. I then ran the training data with the new labels through the vl_svmtrain function, which generates a W and B score which are the respective parameters of the SVM. vl_svmtrain also took in a value lambda which I tested from 1 to 0.000001, which was a finicky process, to say the least.

Then for each image histogram, I looped through the generated SVMs and found the one with the highest computed score according to W.*image + B, and returned the respective category.

I got the paramters to generate results around 0.68, but I can't seem to recreate that.


Results

Tiny Image with K-Nearest Neighbors: 0.197 accuracy

Tiny Image with SVM classifier: 0.164 accuracy, Lambda = 0.00006

Bag of Sift with K-Nearest Neighbors: 0.465 accuracy

Bag of Sift with SVM classifier: 0.582 accuracy, Lambda = 0.00002