The goal of this assignment is to implement a bag of words model for classifying images.
This type of model, unlike many others, ignores spatial information from images, instead
classifying images entirely based on histograms of word frequencies. In the context of image
processing, a word is a local image descriptor such as the commonly-used SIFT descriptor.
We will be using a bag of words model to recognize semantic categories of images. The image
database used was first introduced in a paper by
Lazebnik et al.
I will break down my bag of words algorithm into three steps: building a visual vocabulary, forming histograms of word frequencies, and using a linear SVM to classify images.
The first part of my bag of words model builds a vocabulary of visual words from a training set of images. To gather these visual words, I first extract features from each test image by gathering a dense grid of SIFT descriptors. I randomly sample 100 of the sift descriptors from each image and run all the descriptors through k-means clustering.
The result of k-means clustering is a visual vocabulary of n words. For my algorithm, I use a vocabulary size of 200.
The next step of the algorithm is to run over the training set of images and gather histograms of word frequencies. I use a simple nearest-neighbor approach to determine the vocubulary word most similar to a given word. The metric for the distance calculation is the sum-of-squared difference. By runnning over all visual words in a training image and finding the nearest word in the vocabulary, I obtain a histogram of frequencies of vocubalary words.
Once histograms are provided for the training set, the histograms are used to train linear
SVMs. This training provides a set of parameters to use for our test data. I then obtain histograms
for the test data in the same manner as the training data. These histograms are passed to the linear
SVMs to obtain a confidence score for each image category. Based on these confidence scores, I
build a confusion matrix as a representation of the number of correct classifications.
The algorithm presented involves a number of parameters which can modified to obtain different results. My results all use a fixed vocabulary size of 200 and the same linear SVM. Furthermore, all results use a fixed set of 100 training images and 100 test images for each of the 15 classes.
For a baseline result, I gather a dense grid of width-4 SIFT descriptors with a step size of 8. For each training image, I take 100 random samples of these descriptors to feed into k-means. These parameters provided an overall accuracy of 0.6173. The accuracy for each class of images and the confusion matrix are given below.
Width-4 SIFT descriptorSuburb | Coast | Forest | Highway | City | Mountain | Country | Street | Building | Office | Bedroom | Industrial | Kitchen | Living Room | Store |
93 | 79 | 95 | 75 | 58 | 83 | 42 | 48 | 72 | 90 | 43 | 36 | 52 | 11 | 49 |
These results provide a good starting point, but clearly the algorithm does not do very well classifying living rooms or industrial scenes. In general, the algorithm performed much better with outdoor categories than indoor ones. For intuition as to why this is the case, consider a sample set of images from the forest category and a sample set from the living room category. It is clear that the forest scenes have many common repeated elements, most obviously the tree branches. So, the algorithm can usually find common vocabulary words from these images. The living room scenes, however, have much greater variability, so there are few common words for the algorithm to latch onto.
I attempted to improve upon the baseline score by using a larger sift descriptor. The results for a width-8 SIFT descriptor are promising, providing an accuracy of 0.6573. Here are the details for the width-8 SIFT descriptor.
Width-8 SIFT descriptorSuburb | Coast | Forest | Highway | City | Mountain | Country | Street | Building | Office | Bedroom | Industrial | Kitchen | Living Room | Store |
92 | 78 | 94 | 76 | 59 | 81 | 63 | 70 | 78 | 88 | 42 | 27 | 59 | 31 | 48 |
I continued by once again doubling the SIFT descriptor size, next trying a width of 16. This provided an accuracy of 0.6453. Below are the detailed results. Notice that the variance in accuracy between different classes is lower than it is for smaller descriptors.
Width-16 SIFT descriptorSuburb | Coast | Forest | Highway | City | Mountain | Country | Street | Building | Office | Bedroom | Industrial | Kitchen | Living Room | Store | 89 | 77 | 87 | 79 | 46 | 81 | 67 | 79 | 81 | 82 | 31 | 19 | 55 | 44 | 51 |
Since the different-scale SIFT descriptors each captured overlapping but distinct features from the images, I then gathered size 4, 8 and 16 SIFT descriptors and used all of them together. To do this, I implemented a naive modification where all the different scale descriptors were combined and run under the standard pipeline. This gave an accuracy of 0.6213 with 50 samples per image from each of the three descriptors. I suspect a more sophisticated combination of different-scale descriptors could do much better than this.
Width 4, 8 and 16 SIFT descriptorsSuburb | Coast | Forest | Highway | City | Mountain | Country | Street | Building | Office | Bedroom | Industrial | Kitchen | Living Room | Store |
84 | 79 | 85 | 79 | 43 | 81 | 66 | 75 | 79 | 86 | 33 | 19 | 48 | 35 | 40 |
Here is a sampling of some of the images that the algorithm attempted to classify. The higher the confidence score, the more confidence the classification. Notice that words that occur in multiple categories of images, like railings, cause problems for the algorithm.
Label: Suburb, Confidence: -0.0245 |
Label: Coast, Confidence: -0.3519 |
Label: Forest, Confidence: 1.1192 |
Label: Highway, Confidence: -0.7269 |
Label: Living room, Confidence: -0.3158 |
Label: Store, Confidence: -0.4008 |
Label: Living room, Confidence: -0.4343 |
Label: Highway, Confidence: -0.0275 |
Label: Open Country, Confidence: -0.4693 |
Label: Mountain, Confidence: -0.4050 |
Label: Bedroom, Confidence: -0.5354 |
Label: Forest, Confidence: 0.3892 |
There are many ways the algorithm can be improved. Some obvious improvements include adding spatial information, using a non-linear SVM, and additional feature representations beyond SIFT.