CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of this project is scene recognition. There are two parts in the algorithm: getting features and classifying. The following list shows the strategies I used in this project. Training data were classified to 15 categories.

The tiny image method is simple. It resizes each training image to 16*16, subtracts mean and divides by norm. This is the baseline implementation for features.

The bag of words method first calculates vocabulary and builds a hitogram base on the vocabulary. To build a vocabulary, extract SIFT features from all training images and find the kmean of them. Use this vocabulary, for each image, extract more SIFT features and classify all SIFT features base on the vocabulary by finding the nearest neighbour kmeans centroid for each SIFT feature. The histogram represents how many SIFT features fall in one cluster.

The nearest neighbour classifying method takes in the features from both training and testing images, finds the nearest training sample for every testing image.

The 1-vs-all linear support vector machine method trains the machine to recognise "forest" vs "non-forest", "bedroom" vs "non-bedroom". Each test cases will be evaluated with all 15 classifiers and the result will be the classifier that returns the highest score.

Here is the list of accuracy of the algorithm (with vocabulary size 400):

Parameters

There are many free parameters in this project. I experimented with LAMBDA in SVM from 0.00001 to 10. I found that 0.00001 gives me the best result with vocabulary size 50.

I also experimented with the size of vocabulary. I will report the result at the end of this webpage.

Result page

Results visualization for tiny image + nearest neighbour


Accuracy (mean of diagonal of confusion matrix) is 0.225

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.080
LivingRoom

Office

TallBuilding

Forest
Store 0.020
Office

TallBuilding

Forest

Bedroom
Bedroom 0.180
Kitchen

Office

Kitchen

Mountain
LivingRoom 0.100
Bedroom

TallBuilding

Suburb

TallBuilding
Office 0.180
LivingRoom

Forest

Industrial

Mountain
Industrial 0.130
Kitchen

InsideCity

Street

Suburb
Suburb 0.370
Coast

Street

Coast

Street
InsideCity 0.060
Bedroom

Mountain

Industrial

Suburb
TallBuilding 0.220
Street

Office

Suburb

Street
Street 0.420
LivingRoom

Mountain

Industrial

Forest
Highway 0.560
InsideCity

LivingRoom

OpenCountry

Coast
OpenCountry 0.350
TallBuilding

Highway

Coast

Highway
Coast 0.390
Forest

Highway

Highway

Forest
Mountain 0.180
Street

Bedroom

Highway

Coast
Forest 0.130
Bedroom

Coast

Industrial

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Results visualization for bag of sift + nearest


Accuracy (mean of diagonal of confusion matrix) is 0.544

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.420
Bedroom

InsideCity

LivingRoom

Bedroom
Store 0.560
Street

Suburb

Forest

Bedroom
Bedroom 0.290
TallBuilding

LivingRoom

Kitchen

Industrial
LivingRoom 0.250
Office

Store

Office

TallBuilding
Office 0.820
LivingRoom

Store

Kitchen

InsideCity
Industrial 0.350
TallBuilding

TallBuilding

Bedroom

LivingRoom
Suburb 0.870
Forest

OpenCountry

Highway

Store
InsideCity 0.290
Industrial

Street

Suburb

Industrial
TallBuilding 0.450
Industrial

OpenCountry

Industrial

Coast
Street 0.520
TallBuilding

TallBuilding

Store

Highway
Highway 0.790
Industrial

Bedroom

Coast

InsideCity
OpenCountry 0.430
Industrial

Coast

TallBuilding

Highway
Coast 0.600
OpenCountry

Industrial

OpenCountry

Highway
Mountain 0.600
Industrial

OpenCountry

Forest

Forest
Forest 0.920
Mountain

Mountain

OpenCountry

Suburb
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Results visualization for bag of sift + svm.


Accuracy (mean of diagonal of confusion matrix) is 0.649

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.510
InsideCity

Store

Store

InsideCity
Store 0.510
Kitchen

Street

Mountain

Street
Bedroom 0.440
Store

Kitchen

Kitchen

Kitchen
LivingRoom 0.350
Kitchen

Forest

Bedroom

TallBuilding
Office 0.810
Kitchen

Bedroom

LivingRoom

Suburb
Industrial 0.540
Store

LivingRoom

LivingRoom

InsideCity
Suburb 0.910
OpenCountry

Mountain

Bedroom

TallBuilding
InsideCity 0.550
Store

TallBuilding

LivingRoom

TallBuilding
TallBuilding 0.760
Suburb

Mountain

Store

Bedroom
Street 0.620
OpenCountry

Industrial

Industrial

Store
Highway 0.800
Coast

Street

Street

Street
OpenCountry 0.490
Mountain

Forest

Bedroom

Coast
Coast 0.770
Mountain

OpenCountry

OpenCountry

Mountain
Mountain 0.760
OpenCountry

Street

Coast

Coast
Forest 0.920
OpenCountry

Bedroom

LivingRoom

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

EXTRA CREDIT: Below is a graph showing the accuracy with different sizes of vocabulary:

I think when the vocabulary size is relatively small, the change of the size will make a big difference on the accuracy. But as the size grows bigger, the influence of the vocabulary size will become smaller. Then changes in other parameter is needed in order to improve the performance, or new algorithm is needed.