Project 3: Scene recognition with bag of words

CS 143: Introduction to Computer Vision

Justin Ardini (jardini)

Overview

The goal of this assignment is to implement a bag of words model for classifying images. This type of model, unlike many others, ignores spatial information from images, instead classifying images entirely based on histograms of word frequencies. In the context of image processing, a word is a local image descriptor such as the commonly-used SIFT descriptor. We will be using a bag of words model to recognize semantic categories of images. The image database used was first introduced in a paper by Lazebnik et al.

Algorithm

I will break down my bag of words algorithm into three steps: building a visual vocabulary, forming histograms of word frequencies, and using a linear SVM to classify images.

Building a Visual Vocabulary

The first part of my bag of words model builds a vocabulary of visual words from a training set of images. To gather these visual words, I first extract features from each test image by gathering a dense grid of SIFT descriptors. I randomly sample 100 of the sift descriptors from each image and run all the descriptors through k-means clustering.

The result of k-means clustering is a visual vocabulary of n words. For my algorithm, I use a vocabulary size of 200.

Histograms of Word Frequencies

The next step of the algorithm is to run over the training set of images and gather histograms of word frequencies. I use a simple nearest-neighbor approach to determine the vocubulary word most similar to a given word. The metric for the distance calculation is the sum-of-squared difference. By runnning over all visual words in a training image and finding the nearest word in the vocabulary, I obtain a histogram of frequencies of vocubalary words.

SVM Classification

Once histograms are provided for the training set, the histograms are used to train linear SVMs. This training provides a set of parameters to use for our test data. I then obtain histograms for the test data in the same manner as the training data. These histograms are passed to the linear SVMs to obtain a confidence score for each image category. Based on these confidence scores, I build a confusion matrix as a representation of the number of correct classifications.

Results

The algorithm presented involves a number of parameters which can modified to obtain different results. My results all use a fixed vocabulary size of 200 and the same linear SVM. Furthermore, all results use a fixed set of 100 training images and 100 test images for each of the 15 classes.

For a baseline result, I gather a dense grid of width-4 SIFT descriptors with a step size of 8. For each training image, I take 100 random samples of these descriptors to feed into k-means. These parameters provided an overall accuracy of 0.6173. The accuracy for each class of images and the confusion matrix are given below.

Width-4 SIFT descriptor
Suburb Coast Forest Highway City Mountain Country Street Building Office Bedroom Industrial Kitchen Living Room Store
93 79 95 75 58 83 42 48 72 90 43 36 52 11 49

These results provide a good starting point, but clearly the algorithm does not do very well classifying living rooms or industrial scenes. In general, the algorithm performed much better with outdoor categories than indoor ones. For intuition as to why this is the case, consider a sample set of images from the forest category and a sample set from the living room category. It is clear that the forest scenes have many common repeated elements, most obviously the tree branches. So, the algorithm can usually find common vocabulary words from these images. The living room scenes, however, have much greater variability, so there are few common words for the algorithm to latch onto.


Forest images (easy) vs. living room images (hard)

I attempted to improve upon the baseline score by using a larger sift descriptor. The results for a width-8 SIFT descriptor are promising, providing an accuracy of 0.6573. Here are the details for the width-8 SIFT descriptor.

Width-8 SIFT descriptor
Suburb Coast Forest Highway City Mountain Country Street Building Office Bedroom Industrial Kitchen Living Room Store
92 78 94 76 59 81 63 70 78 88 42 27 59 31 48

I continued by once again doubling the SIFT descriptor size, next trying a width of 16. This provided an accuracy of 0.6453. Below are the detailed results. Notice that the variance in accuracy between different classes is lower than it is for smaller descriptors.

Width-16 SIFT descriptor
Suburb Coast Forest Highway City Mountain Country Street Building Office Bedroom Industrial Kitchen Living Room Store
89 77 87 79 46 81 67 79 81 82 31 19 55 44 51

Since the different-scale SIFT descriptors each captured overlapping but distinct features from the images, I then gathered size 4, 8 and 16 SIFT descriptors and used all of them together. To do this, I implemented a naive modification where all the different scale descriptors were combined and run under the standard pipeline. This gave an accuracy of 0.6213 with 50 samples per image from each of the three descriptors. I suspect a more sophisticated combination of different-scale descriptors could do much better than this.

Width 4, 8 and 16 SIFT descriptors
Suburb Coast Forest Highway City Mountain Country Street Building Office Bedroom Industrial Kitchen Living Room Store
84 79 85 79 43 81 66 75 79 86 33 19 48 35 40

Here is a sampling of some of the images that the algorithm attempted to classify. The higher the confidence score, the more confidence the classification. Notice that words that occur in multiple categories of images, like railings, cause problems for the algorithm.

Correct

Label: Suburb, Confidence: -0.0245







Label: Coast, Confidence: -0.3519

Label: Forest, Confidence: 1.1192




Label: Highway, Confidence: -0.7269

Label: Living room, Confidence: -0.3158

Label: Store, Confidence: -0.4008



Incorrect

Label: Living room, Confidence: -0.4343







Label: Highway, Confidence: -0.0275


Label: Open Country, Confidence: -0.4693



Label: Mountain, Confidence: -0.4050


Label: Bedroom, Confidence: -0.5354





Label: Forest, Confidence: 0.3892




There are many ways the algorithm can be improved. Some obvious improvements include adding spatial information, using a non-linear SVM, and additional feature representations beyond SIFT.