CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of this assignment was to perform scene recognition by using a bag of words model, based on a hand-labeled data set of 15 scene categories.

Algorithm

We move gradually from a baseline placeholder with low accuracy and then gradually implement more classification methods on top of our existing pipeline, ending up with a solid accuracy at the end. Our starting pipeline consists of two stages:

Obtaining tiny image feature representations for our training and test data - to do this we simply just shrink each image in our data to 16x16 resolution.
Nearest neighbor classification - since we have a ground truth set for the training data, we can just calculate the nearest neighbor from the training set for each tiny image feature in the test set, and use the ground truth from that neighbor as the label. We simply use L2 as our measure of distance.

Implementing this gave us the following results:

CS 143 Project 3 results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.191

The above image visualizes our results as a heat map. The y-axis represents the correct label for a given image, whereas the x-axis represents the label our algorithm has given it. A good implementation should, in other words, have a strong diagonal.

As we can see from the above image, this amount is enough to give us some improvement over random chance (which would have a roughly 6.7% accuracy), but not enough to be at all reliable.

Bag of SIFTs

To improve accuracy, we'll replace our tiny image representations of images with representations of 'bags of SIFT features.'

First, we need to establish a 'visual vocabulary.' We can form this vocabulary by computing SIFT features from our training data, and then clustering them with kmeans. The resulting clusters form our visual vocabulary, and we can categorize any new features we have by the closest cluster center to it. To improve performance, we'll take a random sample of the training data (in our case 30%) to cluster over, and use a step size of 8, to compute k=200 clusters, to form our visual vocabulary.

Now that we have a visual vocabulary, we can represent our images as 'bags of SIFT descriptors.' We find a set of SIFT descriptors for every image in the training and testing sets. Instead of saving them all, however, we instead store this representation as a histogram over the nearest 'vocab word,' or kmeans cluster centroid. We then normalize over the number of features found to prevent large images from skewing the results, and then re-use our nearest neighbor classifier over these histograms to label images in the test set.

Vocab SIFT size = 8, vocab size = 200, SIFT size = 4
Accuracy (mean of diagonal of confusion matrix) is 0.510

Linear SVMs

The last replacement in our pipeline is to train 1-vs-all linear SVMs. Simply put, instead of measuring by nearest neighbor, which equally weighs all 200 of our vocab words, we can instead downweight words that might be less relevant, or be present in all image types. Since linear SVMs are binary (either on one side of the hyperplane or the other), we have to compute one linear SVM for each of our training categories, and then assign to each test image the category with the best score. The final results of this can be seen below.

Final results

Vocab SIFT size = 8, vocab size = 200, SIFT size = 4
Accuracy (mean of diagonal of confusion matrix) is 0.634

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.610
Bedroom
Bedroom
Office
Office

Store 0.280
LivingRoom
Industrial
Kitchen
Bedroom

Bedroom 0.390
LivingRoom
InsideCity
LivingRoom
Office

LivingRoom 0.170
Kitchen
Bedroom
Bedroom
Kitchen

Office 0.980
LivingRoom
LivingRoom
Kitchen
Kitchen

Industrial 0.530
Bedroom
Store
TallBuilding
InsideCity

Suburb 0.950
Store
Coast
TallBuilding
LivingRoom

InsideCity 0.590
Kitchen
Kitchen
LivingRoom
Suburb

TallBuilding 0.820
Industrial
Store
Forest
Mountain

Street 0.490
TallBuilding
Mountain
Industrial
Industrial

Highway 0.780
Suburb
Industrial
Coast
Coast

OpenCountry 0.360
Coast
Highway
Coast
Suburb

Coast 0.800
Highway
OpenCountry
Mountain
Highway

Mountain 0.830
OpenCountry
OpenCountry
Suburb
Suburb

Forest 0.930
OpenCountry
OpenCountry
Street
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label