CS 143 Project

CS 143 / Project 3 / Scene Recognition with Bag of Words

Overview

The goal of this assignment was to recognize scenes using various methods. These methods included

Tiny images representation and nearest neighbor classifier.
Bag of SIFT representation and nearest neighbor classifier.
Bag of SIFT representation and linear SVM classifier.

Tiny images and nearest neighbor

The tiny images representation is created by simply resizing the image to a smaller, fixed size (e.g. 16x16), and representing this smaller image as a vector. The nearest neighbor classifier consists of simply labelling the test image with the label of the training image whose representation is closest, in terms of Euclidean distance in our case, to the representation of the test image.

The accuracy of this combination was 19.1%. I did not add any optimizations to the code, though things like k-nearest neighbors instead of nearest neighbor would improve performance.

Bag of SIFT and nearest neighbor

The bag of SIFT representation is created by first finding SIFT descriptors for a dense sampling of local features, and then clustering the descriptors using k-means. The cluster centers form the vocabulary of visual words we'll use to classify test images. In my implementation, I sampled SIFT features with a step size of 100, using the 'fast' parameter of vl_dsift, and a vocabulary size of 400 (i.e. 400 cluster centers).

Once the vocabulary has been created, we represent an image by sampling SIFT features (this time I used a step size of 10) and forming a 400-dimensional histogram that counts the number of SIFT descriptors for an image lie in each cluster in the vocabulary. The histogram is then normalized so that the size of the image does not dramatically affect the bag of feature magnitude.

The nearest neighbor classifier is the same as described in the previous section.

The accuracy of this combination was 50.6%, showing that bag of SIFT is a much better representation of images than tiny images (which matches intuition).

Bag of SIFT and linear SVM

In this combination, we represent the images in the same way as described in the section above, but we classify them using 15 1-vs-all linear support vector machines (SVMs), one for each image category. For each linear SVM, the SIFT feature space is partitioned by a learned hyperplane that divides training images in the given category from images that aren't in the category. In my implementation I used vl_svmtrain with lambda = 0.0001. Both larger and smaller values of lambda yielded lower accuracies (I tested all powers of 10 between 0.00001 and 0.01).

Each test image is then evaluated with all 15 SVMs, and is given the label that matches the most confident SVM. Confidence is the distance from the margin measured by W*X + B, where '*' is the dot product, W and B are the learned hyperplane parameters, and X is the representation of the test image.

This combination yielded the best results, with an accuracy of 64.6%. A full report of the results are shown below:

Accuracy (mean of diagonal of confusion matrix) is 0.646

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.520
Bedroom
LivingRoom
LivingRoom
Bedroom

Store 0.460
Bedroom
Bedroom
LivingRoom
Mountain

Bedroom 0.530
Office
LivingRoom
Kitchen
LivingRoom

LivingRoom 0.310
InsideCity
Store
Store
Industrial

Office 0.840
LivingRoom
Kitchen
Kitchen
Kitchen

Industrial 0.440
Bedroom
Store
Highway
LivingRoom

Suburb 0.940
Mountain
Highway
TallBuilding
Industrial

InsideCity 0.530
Street
Kitchen
LivingRoom
Highway

TallBuilding 0.720
InsideCity
Suburb
InsideCity
Street

Street 0.680
LivingRoom
InsideCity
InsideCity
InsideCity

Highway 0.790
Street
Industrial
Coast
Street

OpenCountry 0.500
Street
Coast
Coast
Mountain

Coast 0.740
InsideCity
OpenCountry
Highway
Highway

Mountain 0.760
Store
Industrial
Bedroom
Industrial

Forest 0.930
OpenCountry
OpenCountry
Street
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.520			Bedroom	LivingRoom	LivingRoom	Bedroom
Store	0.460			Bedroom	Bedroom	LivingRoom	Mountain
Bedroom	0.530			Office	LivingRoom	Kitchen	LivingRoom
LivingRoom	0.310			InsideCity	Store	Store	Industrial
Office	0.840			LivingRoom	Kitchen	Kitchen	Kitchen
Industrial	0.440			Bedroom	Store	Highway	LivingRoom
Suburb	0.940			Mountain	Highway	TallBuilding	Industrial
InsideCity	0.530			Street	Kitchen	LivingRoom	Highway
TallBuilding	0.720			InsideCity	Suburb	InsideCity	Street
Street	0.680			LivingRoom	InsideCity	InsideCity	InsideCity
Highway	0.790			Street	Industrial	Coast	Street
OpenCountry	0.500			Street	Coast	Coast	Mountain
Coast	0.740			InsideCity	OpenCountry	Highway	Highway
Mountain	0.760			Store	Industrial	Bedroom	Industrial
Forest	0.930			OpenCountry	OpenCountry	Street	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Liz Neu (eneu)

CS 143 / Project 3 / Scene Recognition with Bag of Words

Overview

Tiny images and nearest neighbor

Bag of SIFT and nearest neighbor

Bag of SIFT and linear SVM