CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of the project was to implement a 'visual bag of words', which functions as a vocabulary of visual features. This was paired with a linear SVM in order to classify images. Classification was in the realm of scenes, not specific objects. Specifically, testing and training images ranged from forests to cities to kitchens.

Algorithm Description

Before the algorithm is described in any detail, it must be noted that in addition to a bag of sift features paired with a linear SVM as a classification method, 'tiny image features' and the k-nearest neighbors classifier were also implemented. All results will be discussed in the 'Results' section.

The algorithm works as follows:

Part 1: building the visual vocabulary.

1.1
First, a visual vocabulary must be built from the training data. Each vocab word will ultimately represent a 'feature'. One intuitive way to think about this is that a vocabulary word can be a SIFT descriptor of a coconut. If an image has many coconuts, it is very likely a beach.
1.2
In order to build a vocabulary, SIFT descriptors are sampled from training images (whose category is already known). The descriptors are then clustered with kmeans. Each cluster center represents a vocab word. How many training images was a free parameter, but I found that if I sampled half of the training images given by image_paths, then speed was increased enough to be worth the accuracy cost.
1.3
This was done in the file build_vocabulary.m, and had many free parameters. The function vl_kmeans was used to find the cluster centers of the SIFT descriptors, while the SIFT descriptors were found with vl_dsift.
1.4
Through experimentation, I found that passing a 'size' parameter of 16 to vl_dsift was too big. A 'size' parameter of 8 returned much more accurate results. In addition, my 'step' parameter was 8, but results between 4 and 10 were all suitable. I chose 8 as a tradeoff between speed and accuracy. The 'fast' parameter was left include because I found excluding it only improved accuracy by only about 0.5%, while the time taken to build the vocabulary matrix was much greater.
1.5
Vocabulary size, an argument passed to build_vocabulary, was found to be an extremely important parameter. Accuracy improved by approximately 5-8% when I changed vocabulary size from 400 to 800. Of course, this greatly increased the time it takes to build vocab.mat. Depending on the computer, it can take up to almost an hour, with over 2/3rds of that time taken by vl_kmeans.

Part 2: building the bag of sift features.

2.1
Now that we have our vocabulary words, we can compare our test images (the images to be classified) to them. First, we must extract the SIFT features from the test images. This was done with the same with vl_dsift, and with the same 'size' parameter (8), but with a smaller step (6). The 'fast' option was also included, to have a lower overall computation time.
2.2
The function vl_dsift returned a huge number of descriptors for each image. Each descriptor was compared to the vocabulary, and the vocab word closest to the descriptor was found (using min and vl_alldist2). A histogram of every matched vocab word was created, with 1 added to a histogram bin if the bin was the matched vocab word. Each image had its own histogram created, but then the histogram was normalized to prevent large images (with many dsift features) from having much larger histograms than small images (with few dsift features). The normalized histogram is the image description vector that must be passed to a classifier.

Part 3: the linear SVM classifier.

3.1
The first step in building the classifier is to actually make the SVMs; there were 15 linear SVMs created, using vl_svmtrain. Each SVM was trained as a one-vs-all classifier for a specific category. For example, the 'kitchen' SVM was trained to recognize images as either 'kitchen' or 'not kitchen'. It should be noted that the only free parameter in svm_classify.m is the LAMBDA value that vl_svmtrain takes as input. I found the LAMBDA value that created the highest accuracy results was 0.001. I was getting a range from 0.49 to 0.66 in accuracy and an average of 0.57 with a lambda of 0.01, but when I switched to 0.001 my average accuracy jumped to 0.66.
3.2
Now, all that is left is to apply the classifiers to the test images and figure out which image is which scene; each of the 15 SVMs must be applied to each test image. The SVM with the highest response to a particular test image is used as that test image's predicted category. So if a test image had a response to the 'kitchen' SVM of 0.2, but to the 'mountain' SVM of 0.8, then the image is classified as a 'mountain' scene. After each test image is classified, the scene recognition pipeline is complete.


Results Visualization and Discussion

Tiny image features and nearest neighbor (k = 1) classifier:


Accuracy of the tiny image feature representation and the nearest neighbor (with k = 1) was 0.225. Experimentation with different k values yielded little improvement, and in many cases, significant decline in accuracy (sometimes down to ~19%). Experimentation with the linear SVM classifier also yielded poor results, with accuracy ~19%.

Bag of SIFT features and nearest neighbor (k = 11) classifier:


Accuracy of the bag of SIFT feature representation and the nearest neighbor (with k = 11) was 0.563. Experimentation with different k values yielded a wide range of values, from ~0.51 to 0.563, although it is still possible that there is a k value that makes the accuracy even higher. However, the k value in this case is heavily dependent on the dataset, so the 'best' k value will not likely transfer to different tests.

Bag of SIFT features and linear SVM classifier:


Accuracy of the bag of SIFT feature representation and the linear SVM classifier varied from 0.64 to 0.68. The instance represented by this image was one of the better tests, with an accuracy of 0.671. In total, from start to finish, the pipeline takes about 1-1.25 hours to run. Building the vocabulary matrix takes between 40 minutes and an hour (with ~2/3rds of the time taken by vl_kmeans), and getting the bag of SIFT feature representation takes 10-20 minutes, all depending on the computer being used. Tiny image features, nearest neighbor classification, and linear SVM classification all take only a few seconds.

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.610
LivingRoom

InsideCity

Industrial

Suburb
Store 0.500
TallBuilding

Industrial

Forest

Highway
Bedroom 0.490
Store

TallBuilding

Kitchen

Office
LivingRoom 0.210
Industrial

Bedroom

Street

Kitchen
Office 0.870
Bedroom

Kitchen

Bedroom

Kitchen
Industrial 0.340
InsideCity

InsideCity

Kitchen

TallBuilding
Suburb 0.970
Store

Industrial

LivingRoom

LivingRoom
InsideCity 0.620
Industrial

Store

Store

Kitchen
TallBuilding 0.790
Industrial

Store

Store

Street
Street 0.790
Store

LivingRoom

Highway

InsideCity
Highway 0.830
Street

Industrial

InsideCity

InsideCity
OpenCountry 0.480
Coast

Coast

Forest

Coast
Coast 0.820
OpenCountry

Industrial

OpenCountry

Mountain
Mountain 0.800
Kitchen

TallBuilding

Forest

OpenCountry
Forest 0.940
OpenCountry

Mountain

Mountain

OpenCountry
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Overall, the most important parameters were the vocabulary size, and lambda in the SVM classifier. The results are good, in that the false positives are not drastically different from their correct categories. For example, an 'bedroom' scene classified as a 'living room' is a reasonable mistake, as there are many similar features in the two scenes. A 'street' being classified as a 'highway' is also reasonable. It would be worse if a 'mountain' were classified as a 'living room', as there are very few similar features. Thus, this pipeline's accuracy is justified, and does not really present any unexpected results.