We first extracted dense SIFT features from all of the images in the training database, using a step of size 4 and and bin size of 8 (this was accomplished by a C function vl_dsift that was part of the VLFeat vision library). This gave us a matrix of 128 rows by several million columns, where each column is a SIFT feature from some image. We then sampled ~30% of the features iid without replacement, to obtain a computationally feasible training subset of features. These sampled SIFT features were then clustered with the vl_kmeans function into 200 clusters. The 128-dimensional center of each of these clusters was used as a entry in our vocabulary.
Our approach achieved an accuracy of ~60% of test images being classified correctly. There were several improvements we could have made, to achieve better performace. For example, we didn't experiment with any different parameter settings, and we used the simplest kernel for the SVM.
The confusion matrix of our results indicates that the algorithm did well on certain categories (such as Suburb, InsideCity, and Bedroom) and poorly on others (particularly Office, Industrial, and TallBuilding). It seems likely that the categories in which the algorithm performed poorly shared some visual words with other categories in the training set, and so the classifier couldn't distinguish between the categories as well: