Project 3: Scene Recognition with Bag of Words

Seth Goldenberg

Background

Scene recognition is a widely studied task in the field of computer vision. For this project, I implemented the baseline method described by Lazebnik et. al (2006), which uses a bag of words model to build a visual vocabulary for describing a scene. The visual words are comprised of local SIFT descriptors in the training images with no regard to their spatial arrangement within the image. A model is then trained using machine learning. The competence of our model is tested in a scene recognition tasks on a 15 scene dataset (categories displayed above). More details on the assignment can be found here.

Pipeline

Collecting Features

We start off by collecting image features for our training images, totaling 100 images for each of 15 scene categories. For image features, I take dense SIFT features across each image for every 8 pixels over a 4 pixel patch. This provides us with a large number of descriptors for each image. At no point in the pipeline is the spatial information of the image considered in the task of scene recognition. This has some upsides and downsides. On one hand, matching scenes based on components with no consideration to where they appear in an image results in a model that is more forgiving of variation in the image's structure. Certain objects may frequently appear for images of a certain category but not necessarily show up in the same locations within the images. This model would handle these cases. On the other hand, a certain arrangement of objects may be specific to a certain category of scene.

Visualization of a SIFT descriptor centered around an image point (Credit: Aresh T. Saharkhiz)

I also experimented with the Gist descriptor, a global image descriptor. Like the SIFT descriptor, the Gist descriptor examines the orientations of edge energies in an image. However, the Gist descriptor varies in that it calculates the cumulative edge energies for blocks of an image, and it is a global descriptor that produces one feature vector for the entire image. Categorization is learned on all the edges of the image, as opposed to a bag of words collected from the edges in small patches of an images. For parameters, I mimicked the example in Antonio Torralba's scene recognition example. In general, SIFT descriptors are better for object recognition due to its local nature, while the Gist descriptor is strongest at scene recognition as it considers the entire image. It should be noted that neither SIFT nor Gist consider color information of an image.

Example of a Gist descriptor visualization (Credit: Prof. James Hays)

Building a Bag of Words

With 1500 images and around 1000 SIFT features per images, there is more than enough sample points to build visual vocabulary for categorization. I randomly sampled 100 SIFT features from each image, giving a grand total of 150,000 features per category, still a large amount of data for training. With the Gist descriptor, I simply used the descriptor for each training image, a total of 1500 512-dimensional vectors, considerably less data than the 150,000 128-dimensional vector from the SIFT features. The dimensionality of the SIFT features is not an issue. In the next step, we use k-means to separate all the SIFT features into a visual vocabulary, clustering all the features into N clusters where N is the number of words in the vocabulary. I experimented with vocabulary sizes, with N = 10, 20, 50, 100, 200, 400, 1000, 5000 on separate trial runs. I then built a histogram from the assignments of the clusters. The histograms are then normalized as images of different sizes contribute a different number of SIFT features to the bag of words. For classification purposes, we can't let larger images have more weight in our model.

Training a Model

Once we have normalized histograms, we pass all 1500 histograms along with each image's category label into a support vector machine. With Gist descriptors, we pass in the entire 1500 x 320 matrix into the SVM. I

Classifying Test Images

To evaluate the models, I built the histogram for each image and used a KD-tree built from the vocabulary to find the nearest categorization for the image. I then constructed a confusion matrix from the resulting assignments to evaluate the performance of each category classification.

Results

From the SIFT-only model, you can see that the bag of words model works best when N = 5000. I ran out of memory while attempting 10,000 on a department Linux machine, so I stuck with a vocabulary size of 5000 for the rest of my trials.

Vocabulary Size (N)	Accuracy
10	0.4620
20	0.5393
50	0.6100
100	0.6300
200	0.6547
400	0.6653
1000	0.6833
5000	0.7020

The Gist descriptor alone performed with an accuracy of 0.6573. I used an image size of 200 x 200 pixels. Images that weren't square were cropped before determining the descriptor to guarantee that each image descriptor would have the same dimensionality.

Testing with Random Splits

To combine both the SIFT and Gist descriptors to a single classifier, I trained each model individually, extracted the confidences for the guess of each model, summed the confidences and then used the scene category with the maximum confidence as the guess for that image. This gives each model equal weight. On my first run, I used a vocabulary size of 5000 words and had very good performance (77.3% accuracy compared to 70.2% with SIFT alone (N = 5000) and 65.7% for Gist alone). This seemed like somewhat of a fluke, given how haphazardly I merged my two models.

To see if I actually had a good classifier, I did a random split of training and test images. I ran 5 trials, with 100 training and 100 test images randomly selected on each run. Here are the overall accuracies.

	Accuracy	St. Dev.
SIFT (N = 5000)	0.726667	0.0072
Gist	0.6561	0.0104
Combined	0.7785	0.0077

As you can see, the classifier's performance was no fluke. The Gist descriptor alone performed slightly worse than the SIFT bag of words model. By skipping the step of building a vocabulary, the model built with the Gist descriptor encoded far less information than the bag of words model. It's not surprising that it did not perform as well. It still holds up well to though to the baseline model. The combined classifier takes into account both local and global image information, which may account for why the two classifiers combined perform better than they do individually.

Confusion Matrices

N = 10	N = 20
N = 50	N = 100
N = 200	N = 400
N = 1000	N = 5000
Gist descriptor alone	Gist and SIFT (N = 5000)