CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of this assignment was to perform scene recognition by using a bag of words model, based on a hand-labeled data set of 15 scene categories.

Algorithm

We move gradually from a baseline placeholder with low accuracy and then gradually implement more classification methods on top of our existing pipeline, ending up with a solid accuracy at the end. Our starting pipeline consists of two stages:

  1. Obtaining tiny image feature representations for our training and test data - to do this we simply just shrink each image in our data to 16x16 resolution.
  2. Nearest neighbor classification - since we have a ground truth set for the training data, we can just calculate the nearest neighbor from the training set for each tiny image feature in the test set, and use the ground truth from that neighbor as the label. We simply use L2 as our measure of distance.

Implementing this gave us the following results:

CS 143 Project 3 results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.191

The above image visualizes our results as a heat map. The y-axis represents the correct label for a given image, whereas the x-axis represents the label our algorithm has given it. A good implementation should, in other words, have a strong diagonal.

As we can see from the above image, this amount is enough to give us some improvement over random chance (which would have a roughly 6.7% accuracy), but not enough to be at all reliable.

Bag of SIFTs

To improve accuracy, we'll replace our tiny image representations of images with representations of 'bags of SIFT features.'

First, we need to establish a 'visual vocabulary.' We can form this vocabulary by computing SIFT features from our training data, and then clustering them with kmeans. The resulting clusters form our visual vocabulary, and we can categorize any new features we have by the closest cluster center to it. To improve performance, we'll take a random sample of the training data (in our case 30%) to cluster over, and use a step size of 8, to compute k=200 clusters, to form our visual vocabulary.

Now that we have a visual vocabulary, we can represent our images as 'bags of SIFT descriptors.' We find a set of SIFT descriptors for every image in the training and testing sets. Instead of saving them all, however, we instead store this representation as a histogram over the nearest 'vocab word,' or kmeans cluster centroid. We then normalize over the number of features found to prevent large images from skewing the results, and then re-use our nearest neighbor classifier over these histograms to label images in the test set.


Vocab SIFT size = 8, vocab size = 200, SIFT size = 4
Accuracy (mean of diagonal of confusion matrix) is 0.510

Linear SVMs

The last replacement in our pipeline is to train 1-vs-all linear SVMs. Simply put, instead of measuring by nearest neighbor, which equally weighs all 200 of our vocab words, we can instead downweight words that might be less relevant, or be present in all image types. Since linear SVMs are binary (either on one side of the hyperplane or the other), we have to compute one linear SVM for each of our training categories, and then assign to each test image the category with the best score. The final results of this can be seen below.

Final results


Vocab SIFT size = 8, vocab size = 200, SIFT size = 4
Accuracy (mean of diagonal of confusion matrix) is 0.634

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.610
Bedroom

Bedroom

Office

Office
Store 0.280
LivingRoom

Industrial

Kitchen

Bedroom
Bedroom 0.390
LivingRoom

InsideCity

LivingRoom

Office
LivingRoom 0.170
Kitchen

Bedroom

Bedroom

Kitchen
Office 0.980
LivingRoom

LivingRoom

Kitchen

Kitchen
Industrial 0.530
Bedroom

Store

TallBuilding

InsideCity
Suburb 0.950
Store

Coast

TallBuilding

LivingRoom
InsideCity 0.590
Kitchen

Kitchen

LivingRoom

Suburb
TallBuilding 0.820
Industrial

Store

Forest

Mountain
Street 0.490
TallBuilding

Mountain

Industrial

Industrial
Highway 0.780
Suburb

Industrial

Coast

Coast
OpenCountry 0.360
Coast

Highway

Coast

Suburb
Coast 0.800
Highway

OpenCountry

Mountain

Highway
Mountain 0.830
OpenCountry

OpenCountry

Suburb

Suburb
Forest 0.930
OpenCountry

OpenCountry

Street

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label