CS 143 / Project 3 / Scene Recognition with Bag of Words

Overview

The goal of this assignment was to recognize scenes using various methods. These methods included

  1. Tiny images representation and nearest neighbor classifier.
  2. Bag of SIFT representation and nearest neighbor classifier.
  3. Bag of SIFT representation and linear SVM classifier.

Tiny images and nearest neighbor

The tiny images representation is created by simply resizing the image to a smaller, fixed size (e.g. 16x16), and representing this smaller image as a vector. The nearest neighbor classifier consists of simply labelling the test image with the label of the training image whose representation is closest, in terms of Euclidean distance in our case, to the representation of the test image.

The accuracy of this combination was 19.1%. I did not add any optimizations to the code, though things like k-nearest neighbors instead of nearest neighbor would improve performance.

Bag of SIFT and nearest neighbor

The bag of SIFT representation is created by first finding SIFT descriptors for a dense sampling of local features, and then clustering the descriptors using k-means. The cluster centers form the vocabulary of visual words we'll use to classify test images. In my implementation, I sampled SIFT features with a step size of 100, using the 'fast' parameter of vl_dsift, and a vocabulary size of 400 (i.e. 400 cluster centers).

Once the vocabulary has been created, we represent an image by sampling SIFT features (this time I used a step size of 10) and forming a 400-dimensional histogram that counts the number of SIFT descriptors for an image lie in each cluster in the vocabulary. The histogram is then normalized so that the size of the image does not dramatically affect the bag of feature magnitude.

The nearest neighbor classifier is the same as described in the previous section.

The accuracy of this combination was 50.6%, showing that bag of SIFT is a much better representation of images than tiny images (which matches intuition).

Bag of SIFT and linear SVM

In this combination, we represent the images in the same way as described in the section above, but we classify them using 15 1-vs-all linear support vector machines (SVMs), one for each image category. For each linear SVM, the SIFT feature space is partitioned by a learned hyperplane that divides training images in the given category from images that aren't in the category. In my implementation I used vl_svmtrain with lambda = 0.0001. Both larger and smaller values of lambda yielded lower accuracies (I tested all powers of 10 between 0.00001 and 0.01).

Each test image is then evaluated with all 15 SVMs, and is given the label that matches the most confident SVM. Confidence is the distance from the margin measured by W*X + B, where '*' is the dot product, W and B are the learned hyperplane parameters, and X is the representation of the test image.

This combination yielded the best results, with an accuracy of 64.6%. A full report of the results are shown below:


Accuracy (mean of diagonal of confusion matrix) is 0.646

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.520
Bedroom

LivingRoom

LivingRoom

Bedroom
Store 0.460
Bedroom

Bedroom

LivingRoom

Mountain
Bedroom 0.530
Office

LivingRoom

Kitchen

LivingRoom
LivingRoom 0.310
InsideCity

Store

Store

Industrial
Office 0.840
LivingRoom

Kitchen

Kitchen

Kitchen
Industrial 0.440
Bedroom

Store

Highway

LivingRoom
Suburb 0.940
Mountain

Highway

TallBuilding

Industrial
InsideCity 0.530
Street

Kitchen

LivingRoom

Highway
TallBuilding 0.720
InsideCity

Suburb

InsideCity

Street
Street 0.680
LivingRoom

InsideCity

InsideCity

InsideCity
Highway 0.790
Street

Industrial

Coast

Street
OpenCountry 0.500
Street

Coast

Coast

Mountain
Coast 0.740
InsideCity

OpenCountry

Highway

Highway
Mountain 0.760
Store

Industrial

Bedroom

Industrial
Forest 0.930
OpenCountry

OpenCountry

Street

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label