Scene Recognition Using Bag of Words Models

by Maggie Pace

This webpage displays some of my results for the Brown University CS 1430: Introduction to Computer Vision Scene Recognition with Bag of WordsWhen considering how to detect the scene of an image, one would usually consider the spacial alignment of different features in the image to be important. However, in the bag-of-words recognition model, large spacial information is thrown out, yet the method still produces remarkable results at scene recognition.

The scene recognition pipeline that I implemented goes as follows:

  1. Create a vocabulary of visual words by collecting local features and clustering them
  2. Create a histogram bag-of-words representation of training images
  3. 1 vs. all classifier training for each scene category based off of the bag of words histograms
  4. Convert each test image to bag-of-words representation and evaluate it based off of all 15 classifiers to determine what scene category it belongs to
  5. Build confusion matrix and measure accuracy

The final 3 steps were given in the stencil code.

The results of the scene recognition training and classification. Redder squares mean that more images were assigned to that category. Squares at the points where the test image and the classifier are the same signify correct classifications. This chart depicts an accuracy of .6147

The Algorithm

First, I generated a vocabulary. I did this by running vl_dsift on all of the training images to collect many local features. I reshaped the matrix generated by this process and passed it into vl_kmeans in order to cluster the features. These clusters represent the visual words. The combination of all of them is the vocabulary.

Next, I represented the test images as combinations of these visual words. After running vl_dsift on them, I then generated images. For each image for each feature vector I had to find the closest visual word in the vocabulary generated in the previous step. In order to accomplish this, I created a distance matrix using the function all_dist that computed the distances between each feature vector in the test image and each visual word. Afterwards, I generated a histogram for each image by incrementing the count on the bin corresponding to the word with the smallest distance for each feature vector generated. Each histogram is normalized so that every image has the same weight on the classifier regardless of the size of that image. I normalized the histogram by dividing the amount by which I incremented each bin by the number of feature vectors.

The following is an example of one of the histograms. Note how there is no significance to the relationships of distance in the x-axis since the histogram represents a bag of words.

In the third step, 1-vs-all classifiers for each scene were trained based off of the histograms generated in the previous step using linear SVMs. Finally, the test images were classified based off of the training models. The data from this classification was used to create the confusion matrix seen above.