CS 143 / Project 3 / Scene Recognition with Bag of Words

Introduction

In this project, we set out to make a program that could recognize scenes. We attempt various strategies that will sample features from a training set of pictures, and then create a classifier that should identify a given image as belonging to one of fifteen possible categories. The differences between the different approaches lie in how to approach the two steps of the problem : sampling features and classifying the features based on the training data. The three combinations of methods used are:

  1. Tiny images representations, combined with a nearest neighbor classifier;
  2. Bag of SIFT representation and nearest neighbor classifier;
  3. and Bag of SIFT representation and linear SVM classifier.

Tiny images and Nearest Neighbor

In a tiny image representation, all the images are simply rescaled to a tiny format. In this case, each image is resized to a 16 x 16 pixel size, effectively representing each image as a 256-dimensional feature. Then, a k-nearest neighbor algorithm is used: we simply check what the nearest k labelled points are to a given testing point, and assign the testing point the majority vote.

While tiny images seem like a very rudimentary idea, they do perform a little better than chance performance would: with 1 nearest neighbor, there was a 19.5 % recognition accuracy; using 5 nearest neighbors there is an improvement to 22.5 %.

It should be noted that these results are better than using tiny image representation along with a support vector machine classifier, which, depending on parameters of the SVM training, will perform at about 10 % - 14 % accuracy. In this combination, the SVM classifiers will tend to conglomerate all the data into just a few categories in a wildly inaccurate way.

Bag of SIFT words and k-Nearest Neighbor

Bag of words is a strategy used primarily in natural language processing, in which documents are identified as unordered collections of words drawn from a large vocabulary. Every document is essentially represented as histogram. In the visual world, the "words" are SIFT features, densely sampled from the image. Our vocabulary is created by densely sampling SIFT features from the training set, and then clustering them using a k-means clustering algorithm into a workable size - I used a vocabulary of 400 visual words. Each image being tested is then densely sampled, and every sample is added to the histogram bin corresponding to the feature in the vocabulary to which it is most similar. So in the end, each image is represented as a 400-bin histogram.

The nearest-neighbor classifier again then determines which labels it is closest to. The difference between the tiny image representation and this representation is that here the features exist in 400-dimensional space, whereas in the previous example the feature space had 256 dimensions.

Results with this method were already significantly better than the previous method. Using a step size of 30 to create the vocabulary was slow but seems to pay off in results. In the actual image sampling, I sampled at a step size of 10, again slowing down the code but leading to good results in the process. (I did take to saving the results for future runs, though.)

Results using a 1-nearest neighbor were around 45.3 %, with 5-nearest neighbor performing slightly better with 46.8 %. Using chi-squared distance metrics (instead of the Euclidian default) led to a 49.6 % recognition rate.

Bag of SIFT words and Support Vector Machine Classifiers

Using the same image representation as before, I tried a final classifier: the Support Vector Machine. Here, binary classifiers are trained by giving a set of training data along with labels. The SVM will try to define a function which as best as possible partitions the data such that the function can be given an input, and the function's output's sign determines which side of the binary decision the image is deemed to be on. As these classifiers are binary, a one-vs-all classifier was needed for every category : the entire dataset was used in training, so the SVM had seen both examples of the label as well as counterexamples. A tested image is then assigned the label whose SVM classifier seemed most "confident" that it partained to that label.

I did not alter the sampling frequency of the bag of SIFT words (as I had saved both the vocabulary and the image features from previous runs). I played around a lot with one parameter in the SVM training function, LAMBDA, which determined how strongly the function should penalise being closely aligned to the data set. Recognition rates varied significantly with orders of magnitude of this parameter, with recognition rates ranging from anywhere between 36.3 % to 56.7%. There is an apparent peak performance at values of lambda around 0.0001.

Interestingly, recognition also decreases when a higher vocabulary size is used, to approximately 52.6 %. There is a very elusive interplay between the different parameters and 56.2 % was the highest rate I was able to attain.

Conclusion

Even after playing around with the parameters a little, I determined I couldn't find the "perfect mix" of parameter finetuning to really bring the recognition up to scratch. I figured I was close enough to 60 % for it to seem reasonable. Probably adding spatial elements to the image feature representations would help improve recognition rates to higher levels.

Results visualisation


Accuracy (mean of diagonal of confusion matrix) is 0.567

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.620
LivingRoom

Bedroom

Store

LivingRoom
Store 0.350
Kitchen

InsideCity

InsideCity

Mountain
Bedroom 0.240
LivingRoom

TallBuilding

Kitchen

Suburb
LivingRoom 0.190
Kitchen

Industrial

InsideCity

Bedroom
Office 0.760
Bedroom

Bedroom

Coast

Kitchen
Industrial 0.170
LivingRoom

TallBuilding

Mountain

Street
Suburb 0.910
Mountain

Industrial

Store

Industrial
InsideCity 0.530
Highway

TallBuilding

Highway

TallBuilding
TallBuilding 0.670
Bedroom

Industrial

Coast

Kitchen
Street 0.610
TallBuilding

InsideCity

TallBuilding

Forest
Highway 0.740
Store

Coast

Coast

Coast
OpenCountry 0.260
Industrial

InsideCity

Forest

Highway
Coast 0.830
OpenCountry

OpenCountry

Bedroom

Suburb
Mountain 0.680
Store

OpenCountry

Bedroom

Street
Forest 0.950
Mountain

OpenCountry

Street

Bedroom
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label