The first step in the algoritm is to collect features on a set of images and create a visual vocabulary, or bag of words. For my implementation, I sample 300 random SIFT featues for each of 100 pictures in 15 categories (1500 picutes total, 450,000 total samples). SIFT features are taken with size set to 4 and step set to 8 (as suggested in the handout). Then, from this set of 450,000 random featues, a vocabulary is created by running kmeans and clustering these features, and using the centers as our words.
The next step is to convert each of the training images to a histogram representation, that is, convert each image into a histogram that keeps track of the frequency of words in our vocabulary that appear. To do this, SIFT features are extracted from each image with the same parameters as above (size = 4 and step = 8), and then using distances between each of the feature vectors and the vocab vectors, a nearest vocab neighbor is chosen for each SIFT vector which adds one vote for that vocab word to the histogram.
For the next three steps (learning a set of SVMs from the histograms, testing images, and scoring), I generally stuck with the baseline provided. I kept the linear kernel and lambda at 0.1
Here are the results, with accuracy of classfication organized by image category: Suburb 96% Forest 79% InsideCity 93% OpenCountry 79% TallBuilding 62% Bedroom 84% Kitchen 44% Store 55% Coast 74% Highway 89% Mountain 43% Street 33% Office 50% Industrial 10% Livingroom 47%
Here's the confusion matrix visualization
While performing superbly for a select few categories, my bag of words implementation has somewhat mixed results. With the features randomly selected and clustered in this attempt, performance ended up being very high for 'Suburb' and 'InsideCity' categories, but performace was poor for others, particularly for the 'Industrial' category. On the whole, results were decent, with a fairly baseline average accuracy of 62.5%