CS 143 Project

In this project, we set out to make a program that could recognize scenes. We attempt various strategies that will sample features from a training set of pictures, and then create a classifier that should identify a given image as belonging to one of fifteen possible categories. The differences between the different approaches lie in how to approach the two steps of the problem : sampling features and classifying the features based on the training data. The three combinations of methods used are:

Tiny images and Nearest Neighbor

In a tiny image representation, all the images are simply rescaled to a tiny format. In this case, each image is resized to a 16 x 16 pixel size, effectively representing each image as a 256-dimensional feature. Then, a k-nearest neighbor algorithm is used: we simply check what the nearest k labelled points are to a given testing point, and assign the testing point the majority vote.

While tiny images seem like a very rudimentary idea, they do perform a little better than chance performance would: with 1 nearest neighbor, there was a 19.5 % recognition accuracy; using 5 nearest neighbors there is an improvement to 22.5 %.

It should be noted that these results are better than using tiny image representation along with a support vector machine classifier, which, depending on parameters of the SVM training, will perform at about 10 % - 14 % accuracy. In this combination, the SVM classifiers will tend to conglomerate all the data into just a few categories in a wildly inaccurate way.

Bag of SIFT words and k-Nearest Neighbor

Bag of words is a strategy used primarily in natural language processing, in which documents are identified as unordered collections of words drawn from a large vocabulary. Every document is essentially represented as histogram. In the visual world, the "words" are SIFT features, densely sampled from the image. Our vocabulary is created by densely sampling SIFT features from the training set, and then clustering them using a k-means clustering algorithm into a workable size - I used a vocabulary of 400 visual words. Each image being tested is then densely sampled, and every sample is added to the histogram bin corresponding to the feature in the vocabulary to which it is most similar. So in the end, each image is represented as a 400-bin histogram.

The nearest-neighbor classifier again then determines which labels it is closest to. The difference between the tiny image representation and this representation is that here the features exist in 400-dimensional space, whereas in the previous example the feature space had 256 dimensions.

Results with this method were already significantly better than the previous method. Using a step size of 30 to create the vocabulary was slow but seems to pay off in results. In the actual image sampling, I sampled at a step size of 10, again slowing down the code but leading to good results in the process. (I did take to saving the results for future runs, though.)

Results using a 1-nearest neighbor were around 45.3 %, with 5-nearest neighbor performing slightly better with 46.8 %. Using chi-squared distance metrics (instead of the Euclidian default) led to a 49.6 % recognition rate.

Bag of SIFT words and Support Vector Machine Classifiers

Using the same image representation as before, I tried a final classifier: the Support Vector Machine. Here, binary classifiers are trained by giving a set of training data along with labels. The SVM will try to define a function which as best as possible partitions the data such that the function can be given an input, and the function's output's sign determines which side of the binary decision the image is deemed to be on. As these classifiers are binary, a one-vs-all classifier was needed for every category : the entire dataset was used in training, so the SVM had seen both examples of the label as well as counterexamples. A tested image is then assigned the label whose SVM classifier seemed most "confident" that it partained to that label.

I did not alter the sampling frequency of the bag of SIFT words (as I had saved both the vocabulary and the image features from previous runs). I played around a lot with one parameter in the SVM training function, LAMBDA, which determined how strongly the function should penalise being closely aligned to the data set. Recognition rates varied significantly with orders of magnitude of this parameter, with recognition rates ranging from anywhere between 36.3 % to 56.7%. There is an apparent peak performance at values of lambda around 0.0001.

Interestingly, recognition also decreases when a higher vocabulary size is used, to approximately 52.6 %. There is a very elusive interplay between the different parameters and 56.2 % was the highest rate I was able to attain.

Conclusion

Even after playing around with the parameters a little, I determined I couldn't find the "perfect mix" of parameter finetuning to really bring the recognition up to scratch. I figured I was close enough to 60 % for it to seem reasonable. Probably adding spatial elements to the image feature representations would help improve recognition rates to higher levels.

Results visualisation

Accuracy (mean of diagonal of confusion matrix) is 0.567

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.620
LivingRoom
Bedroom
Store
LivingRoom

Store 0.350
Kitchen
InsideCity
InsideCity
Mountain

Bedroom 0.240
LivingRoom
TallBuilding
Kitchen
Suburb

LivingRoom 0.190
Kitchen
Industrial
InsideCity
Bedroom

Office 0.760
Bedroom
Bedroom
Coast
Kitchen

Industrial 0.170
LivingRoom
TallBuilding
Mountain
Street

Suburb 0.910
Mountain
Industrial
Store
Industrial

InsideCity 0.530
Highway
TallBuilding
Highway
TallBuilding

TallBuilding 0.670
Bedroom
Industrial
Coast
Kitchen

Street 0.610
TallBuilding
InsideCity
TallBuilding
Forest

Highway 0.740
Store
Coast
Coast
Coast

OpenCountry 0.260
Industrial
InsideCity
Forest
Highway

Coast 0.830
OpenCountry
OpenCountry
Bedroom
Suburb

Mountain 0.680
Store
OpenCountry
Bedroom
Street

Forest 0.950
Mountain
OpenCountry
Street
Bedroom

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Jonathan Adam ja11

CS 143 / Project 3 / Scene Recognition with Bag of Words

Introduction

Tiny images and Nearest Neighbor

Bag of SIFT words and k-Nearest Neighbor

Bag of SIFT words and Support Vector Machine Classifiers

Conclusion

Results visualisation

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.620			LivingRoom	Bedroom	Store	LivingRoom
Store	0.350			Kitchen	InsideCity	InsideCity	Mountain
Bedroom	0.240			LivingRoom	TallBuilding	Kitchen	Suburb
LivingRoom	0.190			Kitchen	Industrial	InsideCity	Bedroom
Office	0.760			Bedroom	Bedroom	Coast	Kitchen
Industrial	0.170			LivingRoom	TallBuilding	Mountain	Street
Suburb	0.910			Mountain	Industrial	Store	Industrial
InsideCity	0.530			Highway	TallBuilding	Highway	TallBuilding
TallBuilding	0.670			Bedroom	Industrial	Coast	Kitchen
Street	0.610			TallBuilding	InsideCity	TallBuilding	Forest
Highway	0.740			Store	Coast	Coast	Coast
OpenCountry	0.260			Industrial	InsideCity	Forest	Highway
Coast	0.830			OpenCountry	OpenCountry	Bedroom	Suburb
Mountain	0.680			Store	OpenCountry	Bedroom	Street
Forest	0.950			Mountain	OpenCountry	Street	Bedroom
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label