Project 3: Scene Recognition with Bag of Words

Tala Huhe (thuhe)

Overview

Scene recognition is a basic computer vision task. The goal of scene recognition is to determine the category any input image falls in. This is a task that humans are fundamentally very good at but computers often struggle with, due to the many possible representations of an image.

One such representations is a bag of words model. Inspired by natural language processing literature, this model discards all spacial information in an image. Instead, the image is considered as a set (i.e. bag) of features (the words in our model).

In this project, we implement a scene categorizer using the bag of words representation for our images. Since this algorithm does not account for spacial relationships between our features, we are bound to miscategorize some scenes. It turns out in practice that this algorithm performs well enough for our purposes.

Fifteen images that were correctly classified as houses in the suburbs by our method.

Algorithm

Below is a rough outline of our algorithm.

We begin by extracting features from our training images. These features will help us build a vocabulary of words, which we will use to classify images.
We the cluster all of the features we found from all of the images. We then use the center of each cluster we found as a different word in our image vocabulary.
We take the features that we extracted from each image, and categorize each feature as the word it is closest to. By doing this, we have constructed our bag of words representation -- each image is simply a collection of words in our vocabulary.
We take our bag of words representation and construct a histogram detailing which words each image contains. We will use this to represent the image as a whole during testing time.
For each class of images, we build a classifier for all testing images in that class. We train the classifier using our histogram representation.
Finally, during testing time, we can determine whether an image falls into a class by using our classifier.

The following image, taken from Chatfield et al., is a rough description of our pipeline:

Feature Extraction

We use SIFT vectors to represent features in our images. They are constructed by computing histograms of oriented gradients in multiple directions and scales.

Some nice properties of these vectors are invariance to scale, orientation, and affine transforms. They are also partially invariant to brightness shifts. Because of these properties, they are a fairly standard way of representing features in images. We extract SIFT vectors at densely at regular intervals in our testing image.

Visual Vocabulary

We cluster our SIFT vectors using k-means, where k is the desired size of our vocabulary. This will effectively separate our domain of features into several discrete classes. We consider each of these classes to be a separate vocabulary word.

In order to determine which words occur in an image, we first extract feature vectors from each image. We assume that each feature is an occurance of the word that is most similar to it. This makes vocabulary words rather inconsistent, as every random texture in the image is regarded as meaningful to some degree. Below are 100 instances of one vocabulary word. It seems to correspond to vertical edges.

Classification

We generate a histogram for each image containing the number of occurances of each visual word. We feed these histograms into a SVM for classification. At testing time, we test an image using all of our classifiers and use the classifer which reports the highest confidence.

To get an idea of word distribution, below are some histograms for our images. In each graph, the X axis represents different visual words that occur. The Y axis represents the relative frequency of each visual word within the image.

Results

This method performed fairly well. Below are some test results:

Fifteen images that our algorithm recognized as coastal images.

Five coastal images that our algorithm fail to classify as coastal.

Five images of roads that our algorithm incorrectly classified as coastal.

We can statistically judge our results by building a confusion matrix for our classifications. The rows of this matrix represent ground truth classes, and the columns represent what the classifiers categorized the image as. Thus, the ideal confusion matrix is an NxN identity matrix, where N is the number of classes we want to categorize in. Below is our confusion matrix:

Qualitatively, the average of the values on our diagonal is about 0.6273. This implies we have about a 62.73% accuracy on our classifications.