CS 143 / Project 3 / Scene Recognition with Bag of Words

Algorithms & Decisions

For this project, I stuck to the baseline implementation, but tried to thoroughly tune my algorithm's parameters. I was able to achieve an accuracy of around 70%. Below are descriptions of each algorithm used in the project, and how I chose to implement them.

Tiny Images

This algorithm builds descriptors by simply downscaling images and vectorizing them.

To implement it, I used the imresize function to downscale the images to 16 by 16 pixels, reshape to turn the 16 by 16 matrix into a vector, then normalized the vector and made it zero-mean (by subtracting subtracting the mean of the vector from itself).

Nearest Neighbor Classifier

This algorithm simply finds the closest training descriptor in feature space to each test feature, and assigns the test feature its category. It requires no training (unlike the SVM classifier), but it gives equal weight to each parameter being considered, which can be ineffective.

I used the vl_alldist2 function to find the distances between all of the training and test features and the min function to find each test feature's nearest neighbor. I then assigned each test feature the category of its nearest neighbor.

Bag of SIFTs

This algorithm builds descriptors for images based on the frequency of occurences of visual "words" in them.

First, I needed to build a vocabulary. For each image, I densely calculated SIFT features with a step size of 28 and bin size of 4 using the vl_dsift function. These features were all put into a cell vector (with one cell for each image). I then used the cell2mat function to concatenate all of the feature matrices in these cells into one large matrix, which served as my respresentative set of SIFT features for the training data. I then clustered these features using vl_kmeans. I used vocabulary sizes of 400 for debugging, and then 800 and 2000 when aiming for maximum performance.

Next, I needed to build a descriptor for each image based on the vocabulary. For each image, I computed SIFT features with a step size of 8 and a bin size of 4. Then I used vl_alldist2 to compute the distances between all of the features and words in the vocabulary, assigned each feature to the nearest word, and computed a histogram based on the frequency of each visual word's appearance in the image. The final step was to normalize this histogram so that descriptor was invariant to the size of the image.

One vs. All Linear SVMs

This algorithm trains a linear SVM for each category, dividing the data set into members and nonmembers of that category. Each feature is then classified using each SVM, and the SVM that classifies it most strongly as a member of its category labels it.

For each category, I built a vector that divided the training data into member and nonmember portions (coded as ones and negative ones). I then used vl_svmtrain to train an SVM on the labeled data. I found that a lambda value of 0.0002 maximized my accuracy.

Then, for each image, I scored it used the calculated SVM parameters. The SVM that provided the maximum 'score' for the image (i.e. that which was most confident that the image was on the member side of the hyperplane it computed) labeled the image with its category.

Results

Tiny Images and Nearest Neighbor

Accuracy: 0.199

Important Parameters:

Bag of SIFTs and Nearest Neighbor

Accuracy: 0.399

Important Parameters:

Bag of SIFTs and SVMs

Accuracy: 0.695

Important Parameters:

Confusion Matrix

Accuracy (mean of diagonal of confusion matrix) is 0.695

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.560
LivingRoom

Store

Bedroom

Store
Store 0.560
InsideCity

Kitchen

Kitchen

Bedroom
Bedroom 0.470
LivingRoom

LivingRoom

LivingRoom

LivingRoom
LivingRoom 0.480
Kitchen

Office

TallBuilding

Bedroom
Office 0.840
LivingRoom

Kitchen

Kitchen

Kitchen
Industrial 0.540
InsideCity

Street

Store

LivingRoom
Suburb 0.990
Industrial

InsideCity

Coast
InsideCity 0.460
LivingRoom

Store

Industrial

Kitchen
TallBuilding 0.790
Street

InsideCity

LivingRoom

Mountain
Street 0.660
TallBuilding

Forest

Industrial

Mountain
Highway 0.860
Industrial

Store

LivingRoom

Suburb
OpenCountry 0.590
Coast

Coast

Forest

Highway
Coast 0.820
OpenCountry

OpenCountry

Highway

OpenCountry
Mountain 0.880
Highway

OpenCountry

Suburb

Forest
Forest 0.930
OpenCountry

Store

Mountain

OpenCountry
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label