CS 143 Project

CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of the project was to implement a 'visual bag of words', which functions as a vocabulary of visual features. This was paired with a linear SVM in order to classify images. Classification was in the realm of scenes, not specific objects. Specifically, testing and training images ranged from forests to cities to kitchens.

Algorithm Description

Before the algorithm is described in any detail, it must be noted that in addition to a bag of sift features paired with a linear SVM as a classification method, 'tiny image features' and the k-nearest neighbors classifier were also implemented. All results will be discussed in the 'Results' section.

The algorithm works as follows:

Part 1: building the visual vocabulary.

1.1

First, a visual vocabulary must be built from the training data. Each vocab word will ultimately represent a 'feature'. One intuitive way to think about this is that a vocabulary word can be a SIFT descriptor of a coconut. If an image has many coconuts, it is very likely a beach.

1.2

In order to build a vocabulary, SIFT descriptors are sampled from training images (whose category is already known). The descriptors are then clustered with kmeans. Each cluster center represents a vocab word. How many training images was a free parameter, but I found that if I sampled half of the training images given by image_paths, then speed was increased enough to be worth the accuracy cost.

1.3

This was done in the file build_vocabulary.m, and had many free parameters. The function vl_kmeans was used to find the cluster centers of the SIFT descriptors, while the SIFT descriptors were found with vl_dsift.

1.4

Through experimentation, I found that passing a 'size' parameter of 16 to vl_dsift was too big. A 'size' parameter of 8 returned much more accurate results. In addition, my 'step' parameter was 8, but results between 4 and 10 were all suitable. I chose 8 as a tradeoff between speed and accuracy. The 'fast' parameter was left include because I found excluding it only improved accuracy by only about 0.5%, while the time taken to build the vocabulary matrix was much greater.

1.5

Vocabulary size, an argument passed to build_vocabulary, was found to be an extremely important parameter. Accuracy improved by approximately 5-8% when I changed vocabulary size from 400 to 800. Of course, this greatly increased the time it takes to build vocab.mat. Depending on the computer, it can take up to almost an hour, with over 2/3rds of that time taken by vl_kmeans.

Part 2: building the bag of sift features.

2.1

Now that we have our vocabulary words, we can compare our test images (the images to be classified) to them. First, we must extract the SIFT features from the test images. This was done with the same with vl_dsift, and with the same 'size' parameter (8), but with a smaller step (6). The 'fast' option was also included, to have a lower overall computation time.

2.2

The function vl_dsift returned a huge number of descriptors for each image. Each descriptor was compared to the vocabulary, and the vocab word closest to the descriptor was found (using min and vl_alldist2). A histogram of every matched vocab word was created, with 1 added to a histogram bin if the bin was the matched vocab word. Each image had its own histogram created, but then the histogram was normalized to prevent large images (with many dsift features) from having much larger histograms than small images (with few dsift features). The normalized histogram is the image description vector that must be passed to a classifier.

Part 3: the linear SVM classifier.

3.1

The first step in building the classifier is to actually make the SVMs; there were 15 linear SVMs created, using vl_svmtrain. Each SVM was trained as a one-vs-all classifier for a specific category. For example, the 'kitchen' SVM was trained to recognize images as either 'kitchen' or 'not kitchen'. It should be noted that the only free parameter in svm_classify.m is the LAMBDA value that vl_svmtrain takes as input. I found the LAMBDA value that created the highest accuracy results was 0.001. I was getting a range from 0.49 to 0.66 in accuracy and an average of 0.57 with a lambda of 0.01, but when I switched to 0.001 my average accuracy jumped to 0.66.

3.2

Now, all that is left is to apply the classifiers to the test images and figure out which image is which scene; each of the 15 SVMs must be applied to each test image. The SVM with the highest response to a particular test image is used as that test image's predicted category. So if a test image had a response to the 'kitchen' SVM of 0.2, but to the 'mountain' SVM of 0.8, then the image is classified as a 'mountain' scene. After each test image is classified, the scene recognition pipeline is complete.

Results Visualization and Discussion

Tiny image features and nearest neighbor (k = 1) classifier:

Accuracy of the tiny image feature representation and the nearest neighbor (with k = 1) was 0.225. Experimentation with different k values yielded little improvement, and in many cases, significant decline in accuracy (sometimes down to ~19%). Experimentation with the linear SVM classifier also yielded poor results, with accuracy ~19%.

Bag of SIFT features and nearest neighbor (k = 11) classifier:

Accuracy of the bag of SIFT feature representation and the nearest neighbor (with k = 11) was 0.563. Experimentation with different k values yielded a wide range of values, from ~0.51 to 0.563, although it is still possible that there is a k value that makes the accuracy even higher. However, the k value in this case is heavily dependent on the dataset, so the 'best' k value will not likely transfer to different tests.

Bag of SIFT features and linear SVM classifier:

Accuracy of the bag of SIFT feature representation and the linear SVM classifier varied from 0.64 to 0.68. The instance represented by this image was one of the better tests, with an accuracy of 0.671. In total, from start to finish, the pipeline takes about 1-1.25 hours to run. Building the vocabulary matrix takes between 40 minutes and an hour (with ~2/3rds of the time taken by vl_kmeans), and getting the bag of SIFT feature representation takes 10-20 minutes, all depending on the computer being used. Tiny image features, nearest neighbor classification, and linear SVM classification all take only a few seconds.

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.610
LivingRoom
InsideCity
Industrial
Suburb

Store 0.500
TallBuilding
Industrial
Forest
Highway

Bedroom 0.490
Store
TallBuilding
Kitchen
Office

LivingRoom 0.210
Industrial
Bedroom
Street
Kitchen

Office 0.870
Bedroom
Kitchen
Bedroom
Kitchen

Industrial 0.340
InsideCity
InsideCity
Kitchen
TallBuilding

Suburb 0.970
Store
Industrial
LivingRoom
LivingRoom

InsideCity 0.620
Industrial
Store
Store
Kitchen

TallBuilding 0.790
Industrial
Store
Store
Street

Street 0.790
Store
LivingRoom
Highway
InsideCity

Highway 0.830
Street
Industrial
InsideCity
InsideCity

OpenCountry 0.480
Coast
Coast
Forest
Coast

Coast 0.820
OpenCountry
Industrial
OpenCountry
Mountain

Mountain 0.800
Kitchen
TallBuilding
Forest
OpenCountry

Forest 0.940
OpenCountry
Mountain
Mountain
OpenCountry

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label
Kitchen	0.610					LivingRoom	InsideCity	Industrial	Suburb
Store	0.500					TallBuilding	Industrial	Forest	Highway
Bedroom	0.490					Store	TallBuilding	Kitchen	Office
LivingRoom	0.210					Industrial	Bedroom	Street	Kitchen
Office	0.870					Bedroom	Kitchen	Bedroom	Kitchen
Industrial	0.340					InsideCity	InsideCity	Kitchen	TallBuilding
Suburb	0.970					Store	Industrial	LivingRoom	LivingRoom
InsideCity	0.620					Industrial	Store	Store	Kitchen
TallBuilding	0.790					Industrial	Store	Store	Street
Street	0.790					Store	LivingRoom	Highway	InsideCity
Highway	0.830					Street	Industrial	InsideCity	InsideCity
OpenCountry	0.480					Coast	Coast	Forest	Coast
Coast	0.820					OpenCountry	Industrial	OpenCountry	Mountain
Mountain	0.800					Kitchen	TallBuilding	Forest	OpenCountry
Forest	0.940					OpenCountry	Mountain	Mountain	OpenCountry
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label

Overall, the most important parameters were the vocabulary size, and lambda in the SVM classifier. The results are good, in that the false positives are not drastically different from their correct categories. For example, an 'bedroom' scene classified as a 'living room' is a reasonable mistake, as there are many similar features in the two scenes. A 'street' being classified as a 'highway' is also reasonable. It would be worse if a 'mountain' were classified as a 'living room', as there are very few similar features. Thus, this pipeline's accuracy is justified, and does not really present any unexpected results.

Daniel Kocoj (dkocoj)