By combining different SIFT descriptors into a Bag of Words model we can formulate a vocabulary for different scenes and then use histogram distance to determine if an image is part of a scene category.
For each image in the training dataset we use a random SIFT sampling and pass these features to k-means which will then cluster the features into a vocabulary of size k. We then create histograms for each image and determine how many of each feature is in the image. Once the dataset has been analyzed, we test each image by computing the feature vocab of each image and its histogram. The data is then passed to a SVM which has been trained on the dataset.
The vocabulary size was altered with values of 100 and 200 with accuracy results of 0.6093 and 0.6167 respectively. Therefor the vocabulary size doesn't seem to improve performance that much when the random sample size is smaller than the vocab size.
 
       
    The Bag of words model seems to work, however the loss of spacial information does seem to decrease the performance as can be seen with the Lazebnik algorithm. without spatial information we get a good estimate of what the scene is.