CS 143 Project

CS 143 / Project 3 / Scene Recognition with Bag of Words

Algorithms & Decisions

For this project, I stuck to the baseline implementation, but tried to thoroughly tune my algorithm's parameters. I was able to achieve an accuracy of around 70%. Below are descriptions of each algorithm used in the project, and how I chose to implement them.

Tiny Images

This algorithm builds descriptors by simply downscaling images and vectorizing them.

To implement it, I used the imresize function to downscale the images to 16 by 16 pixels, reshape to turn the 16 by 16 matrix into a vector, then normalized the vector and made it zero-mean (by subtracting subtracting the mean of the vector from itself).

Nearest Neighbor Classifier

This algorithm simply finds the closest training descriptor in feature space to each test feature, and assigns the test feature its category. It requires no training (unlike the SVM classifier), but it gives equal weight to each parameter being considered, which can be ineffective.

I used the vl_alldist2 function to find the distances between all of the training and test features and the min function to find each test feature's nearest neighbor. I then assigned each test feature the category of its nearest neighbor.

Bag of SIFTs

This algorithm builds descriptors for images based on the frequency of occurences of visual "words" in them.

First, I needed to build a vocabulary. For each image, I densely calculated SIFT features with a step size of 28 and bin size of 4 using the vl_dsift function. These features were all put into a cell vector (with one cell for each image). I then used the cell2mat function to concatenate all of the feature matrices in these cells into one large matrix, which served as my respresentative set of SIFT features for the training data. I then clustered these features using vl_kmeans. I used vocabulary sizes of 400 for debugging, and then 800 and 2000 when aiming for maximum performance.

Next, I needed to build a descriptor for each image based on the vocabulary. For each image, I computed SIFT features with a step size of 8 and a bin size of 4. Then I used vl_alldist2 to compute the distances between all of the features and words in the vocabulary, assigned each feature to the nearest word, and computed a histogram based on the frequency of each visual word's appearance in the image. The final step was to normalize this histogram so that descriptor was invariant to the size of the image.

One vs. All Linear SVMs

This algorithm trains a linear SVM for each category, dividing the data set into members and nonmembers of that category. Each feature is then classified using each SVM, and the SVM that classifies it most strongly as a member of its category labels it.

For each category, I built a vector that divided the training data into member and nonmember portions (coded as ones and negative ones). I then used vl_svmtrain to train an SVM on the labeled data. I found that a lambda value of 0.0002 maximized my accuracy.

Then, for each image, I scored it used the calculated SVM parameters. The SVM that provided the maximum 'score' for the image (i.e. that which was most confident that the image was on the member side of the hyperplane it computed) labeled the image with its category.

Results

Tiny Images and Nearest Neighbor

Accuracy: 0.199

Important Parameters:

distance metric: L2

Bag of SIFTs and Nearest Neighbor

Accuracy: 0.399

Important Parameters:

vocab size: 2000
vocab bin size: 4
vocab step size: 28
bag of sifts bin size: 4
bag of sifts step size: 8
distance metric: CHI2

Bag of SIFTs and SVMs

Accuracy: 0.695

Important Parameters:

lambda: 0.0002
vocab size: 2000
vocab bin size: 4
vocab step size: 28
bag of sifts bin size: 4
bag of sifts step size: 8

Confusion Matrix

Accuracy (mean of diagonal of confusion matrix) is 0.695

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.560
LivingRoom
Store
Bedroom
Store

Store 0.560
InsideCity
Kitchen
Kitchen
Bedroom

Bedroom 0.470
LivingRoom
LivingRoom
LivingRoom
LivingRoom

LivingRoom 0.480
Kitchen
Office
TallBuilding
Bedroom

Office 0.840
LivingRoom
Kitchen
Kitchen
Kitchen

Industrial 0.540
InsideCity
Street
Store
LivingRoom

Suburb 0.990
Industrial
InsideCity
Coast

InsideCity 0.460
LivingRoom
Store
Industrial
Kitchen

TallBuilding 0.790
Street
InsideCity
LivingRoom
Mountain

Street 0.660
TallBuilding
Forest
Industrial
Mountain

Highway 0.860
Industrial
Store
LivingRoom
Suburb

OpenCountry 0.590
Coast
Coast
Forest
Highway

Coast 0.820
OpenCountry
OpenCountry
Highway
OpenCountry

Mountain 0.880
Highway
OpenCountry
Suburb
Forest

Forest 0.930
OpenCountry
Store
Mountain
OpenCountry

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.560			LivingRoom	Store	Bedroom	Store
Store	0.560			InsideCity	Kitchen	Kitchen	Bedroom
Bedroom	0.470			LivingRoom	LivingRoom	LivingRoom	LivingRoom
LivingRoom	0.480			Kitchen	Office	TallBuilding	Bedroom
Office	0.840			LivingRoom	Kitchen	Kitchen	Kitchen
Industrial	0.540			InsideCity	Street	Store	LivingRoom
Suburb	0.990			Industrial	InsideCity	Coast
InsideCity	0.460			LivingRoom	Store	Industrial	Kitchen
TallBuilding	0.790			Street	InsideCity	LivingRoom	Mountain
Street	0.660			TallBuilding	Forest	Industrial	Mountain
Highway	0.860			Industrial	Store	LivingRoom	Suburb
OpenCountry	0.590			Coast	Coast	Forest	Highway
Coast	0.820			OpenCountry	OpenCountry	Highway	OpenCountry
Mountain	0.880			Highway	OpenCountry	Suburb	Forest
Forest	0.930			OpenCountry	Store	Mountain	OpenCountry
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Jonathan Schear (jschear)

CS 143 / Project 3 / Scene Recognition with Bag of Words

Algorithms & Decisions

Tiny Images

Nearest Neighbor Classifier

Bag of SIFTs

One vs. All Linear SVMs

Results

Tiny Images and Nearest Neighbor

Bag of SIFTs and Nearest Neighbor

Bag of SIFTs and SVMs

Confusion Matrix