Project 3: Scene Recognition with Bag of Words

Hobart Reynolds (hcreynol)

October 24, 2011

In this project, we attempt to classify images as belonging to certain types of scenes (such as a street scene or a kitchen scene). We do this using a bag of words model: features of the scene are extracted and treated as independent of one another. Using a training set, we can train a classifier using these features. We can then test our classifier by extracting features from a testing set and classifying them.

Algorithm

Step 1: Using a training set, build a vocabulary (a set of features that are indicative of the features extracted from the various images). This is done by:

Step 1a: Get a set of features for each image. I used dense SIFT descriptors, which are a large number of 128-element vectors taken from different locations in the image. For the SIFT descriptors of each image, I randomly sampled a number of them to create a large collection of these SIFT descriptors over the entire training set.

Step 1b: Use a clustering algorithm to get the vocabulary words. I used the k-means algorithm, where k is the vocabulary size. The algorithm would cluster the features into k different clusters, and then the cluster centers were treated as vocabulary words. Note that the vocabulary words were not necessarily SIFT descriptors extracted from the images, they were merely 128-element vectors at the center of a cluster of SIFT descriptors extracted from the images.

Step 2: For each image, the features (SIFT descriptors) are extracted once again. A histogram of word frequencies is then constructed. For each feature, the distance between it and each vocabulary word is taken, and the histogram bin relating to the closest vocabulary word is incremented. So, the histogram will reflect the distribution of vocabulary words in each image. Note that all features, not just the ones that were sampled from the image, are assigned to clusters and used to create the histogram.

Step 3: A classifier is trained on the histograms and scene labels of the training data. In this project, I used a set of SVMs (support vector machines). Because SVMs are best used for two-class classification, a one-vs-all classifier was created for each of the scene categories (15 different SVMs were created, each of which reported the confidence of the classifier that a scene image belonged to a certain category).

Step 4: The classifier was evaluated by taking images from the testing set, extracting their features (the SIFT descriptors) and building a histogram (as was done in Step 2 to the training images) and then assigning it to a scene category using the classifier from Step 3 (it was assigned to the category that gave the highest confidence). The accuracy of the classifier was determined by comparing the assigned category to the scene category to which the test image actually belonged.

The evaluation of the algorithm can be seen in terms of the following confusion matrix: a matrix of the the scene assignments of the test images where the rows indicate the actual scene labels and the columns indicate the classifier's labels. So, the number in row i, column j indicates how many images from the ith scene category were assigned to the jth scene category by the classifier. (If i = j, then the image was assigned to the correct category.) Below is a graphic of a confusion matrix (using 100 descriptors per image and 200 vocabulary words) and the numerical output of that matrix. In the graphic, the red indicates high numbers, blue indicates low, and green indicates numbers in between.

    96     1     0     0     1     0     0     0     0     0     1     0     0     1     0
     3    82     0     3     1     3     8     0     0     0     0     0     0     0     0
     0     0    95     0     0     3     0     2     0     0     0     0     0     0     0
     0    11     0    77     2     4     3     1     0     0     1     1     0     0     0
     4     4     1     1    60     0     0     5     8     0     0     0     7     2     8
     7     1     3     2     0    80     4     2     0     0     1     0     0     0     0
    10    27     8     7     0     8    38     2     0     0     0     0     0     0     0
     1     0     0     9    22     3     0    52     3     0     0     1     0     2     7
     0     3     1     0     0     5     0     7    76     2     1     2     2     0     1
     0     0     0     0     1     0     0     0     0    87     2     0    10     0     0
     2     0     0     0     2     2     2     0     2    13    44     3    20     7     3
     5     2     0     9     8     5     9     4     9     2     2    29     3     3    10
     0     0     0     0     8     1     0     1     0    20     8     3    51     2     6
     2     0     0     2     3     2     0     1     4    23    23     3    20    10     7
     2     0     3     3    17     7     1     3     5     2     4     0     3     2    48

For this matrix, 100 test images of each of the 15 categories were classified. As examples, this classifier was best at recognizing suburb (96% accurate) and inside city (95% accurate) scenes. It was worst at recognizing industrial scenes (only 10% accurate, while 23% of the industrial test images were incorrectly classified as highway scenes, another 23% were incorrectly classified as mountain scenes, and so on). For the total accuracy, sum the main diagonal (the correctly classified scenes) and divide by the number of test images (so, 925 correctly classified images divided by 1500 test images is 61.67% total accuracy).

Parameters

The parameters to be tuned for this algorithm were the number of training images to be used, the number of descriptors to be sampled from each training image, and the size of the vocabulary. I did not vary the number of training images, instead using the maximum from each category (the 15 categories each had 100 images, so 1500 training images were used). I did this because using the greatest variety of training images for each category would probably give the best performance. I varied the number of descriptors that were randomly sampled from each image, trying 50 (75000 total descriptors), 75 (112500 total), and 100 (150000 total). I used vocabulary sizes of 10, 20, 50, 100, 200, 400, and 1000. See the Results section for an evaluation of these parameters.

Results

Below are the results of running the algorithm on various parameters. Note that due to randomly sampling from the descriptors and random initializations of the k-means algorithm, the accuracy of the classifiers may change slightly if run repeatedly.

Number of Descriptors	Vocabulary Size	Accuracy
50 per image (75000 total)	200	61.13%
75 per image (112500 total)	200	61.53%
100 per image (150000 total)	200	61.67%
400 per image (600000 total)	200	62.07%
100 per image (150000 total)	10	44.80%
100 per image (150000 total)	20	51.33%
100 per image (150000 total)	50	59.07%
100 per image (150000 total)	100	61.27%
100 per image (150000 total)	400	62.00%
100 per image (150000 total)	1000	60.40%

As the number of descriptors sampled from each image increases, the accuracy appears to increase slightly and steadily. This is to be expected because building a vocabulary uses clustering. The more samples that there are, the better those samples will represent the true data, and so the more accurately the clusters will reflect the actual distribution of the data. This would be expected to improve the performance of the algorithm.

As vocabulary size increases, the accuracy increases dramatically at first. When 10 vocabulary words were used, the accuracy was about 44.80%, but this increased to 62.00% when 400 vocabulary words were used. However, as vocabulary size continues to increase past some point, performance appears to gradually decrease, with 1000 vocabulary words having an accuracy of only 60.40%. This is probably because if the vocabulary is too large, then clusters will become subdivided into multiple clusters. This will cause descriptors that should be treated as the same vocabulary word to be treated as different ones, and so the images with these descriptors will have drastically different histograms. This may confuse the classifier and cause a decrease in accuracy.

Some of the confusion matrices from the above evaluations are below:

100 descriptors per image, 10 vocabulary words:

100 descriptors per image, 50 vocabulary words:

100 descriptors per image, 400 vocabulary words:

These matrices display the increase in accuracy as the number of vocabulary words increase. While the classifier with 10 vocabulary words has difficulty recognizing suburb and bedroom scenes, the other classifiers do very well at recognizing them. Interestingly, all of the classifiers perform very well on the inside city and highway scenes, and all of the classifiers perform very poorly on the industrial scenes, repeatedly mis-identifying them as highway, mountain, and office scenes. This probably indicates that the inside city and highway scenes have very distinct representations, which make them easily recognizable, while the industrial scenes have very ambiguous features, making it difficult to recognize them with high confidence.