CS143 Project 3

Project 3: Scene recognition with bag of words

by rmcverry
10/24/11

Project Overview

High level scene recognition is an important problem in Computer Vision. We want to see if computers can sort images into a thematic scene-based classification scheme. A good way to approach this problem is to use a Bag of Words model. The model is simple, visual "words" (or SIFTS in our case) can be loosely or tightly associated with particular scenes. Given a dictionary of vocabulary "words", we learn the distribution of these words across images from each scene. To do this we use an SVM that creates a model of "words". We will use the SVM later to classify new test images. We will find the histogram of the vocabulary "words" for a test image and then try to match it to one of the scene distributions with the SVM.

The Bag of Words Model Workflow:

  1. Build Vocabulary List for a Collection of Images
    1. From each image, collect a lot features
    2. Perform k-means to cluster these means into a visual vocabulary
  2. Build a SVM Model for each Scene
    1. For each training image, build a histogram of the visual word frequencies
    2. For each scene, pump the corresponding training image histograms into an SVM to create a representive model
  3. Classify a Test Image
    1. Find the histogram of the visual vocab word frequencies
    2. Test that histogram against the SVM models from each scene
    3. Assign the image to the scene that produced the best result.



Build Vocabulary List

We need to devise a strong list of vocab for our Bag of Words model. Our vocabulary list will be a collection of visual "words", and specifically the words will be scale invariant features. These features allow us to identify and describe local features in images. In my code, I used a vocabulary size of 200 "words" taken from all 1500 training images. Specifically I use vl_dsift to give me a column of SIFT features from each image. I then concatenated these features and pumped them into vl_kmeans. The vl_kmeans algorithms clusters all of the SIFT features into 200 clusters. I take the center feature from each of these clusters to add to my vocabulary list.

Note: the relative advantage of clustering all features from all 1500 training images was minimal. Clustering a randomized subset of features, from a randomized subset of images can find 200 strong vocabulary features and reduce the running time of this step tremendously.

Creating a Histogram for Each Image

In our workflow, the SVM will create a model for a scene, based on the corresponding image's "word" frequencies histogram. The next part of the project is to build these word frequencies histograms. Given an image and a vocab list, we first find all of the SIFT features in the image using vl_dsift(). We then create a histogram with 200 buckets, with 200 being the size of the vocab list. Next for each SIFT feature from the image, I compare it to all 200 words in the vocabulary. I find the minimum distance using vl_alldist2 and increment its corresponding bucket. I then normalize the histogram

The make_hist() script is used in two steps. During the learning stage, we take the histograms from each image in a scene and pump it into the SVM to determine a model. In my execution I found histograms for 100 images per scene class. The second time make_hist.m is called during the testing phase.

Results

for my final test I did a full evaluation.

I set the parameters to the following:

num_train_per_class = 100
vocab_size = 200

in build_vocabulary.m and make_hist.m, my vl_dsift() function call used a size of 4 and a step of 8.

The complete run through and testing showed that my code had an accuracy of 0.6173. The confusion matrix shows a strong response down the horizontal which means that the scenes of most images were correctly recognized.

Confusion Matrix: