CS 143 / Project 3 / Scene Recognition with Bag of Words

This project implements scene recognition using various different approaches, and evaluates these approaches using confusion matrices and accuracy scores. Specifcally, tiny image features and bag of sift features are used for image representation, while support vector machine and k-nearest neighbor classification are used for image classification. All four combinations of these approaches are evaluated and results are shown below.

Parameters

There are lots of tune-able parameters for this project. I've experiemented with a variety of them and come up with parameters that seem to produce the best results for this particular implementation. For tiny image features, images are resized to 16 x 16 resolution, vectorized, and then normalized. For k-NN, k = 1 actually performs very well (a lot of times even better than values of k greater than 1), and k values around 10 also does well. For bag of sift features, I'm using a step size of 8 and bin size of 4. For SVM, performance differs as different image representations are used. Lambda values around 10^-4 seems to do well with bag of sift features.

Tiny Image with k-NN

Accuracy varied between 0.18 and 0.23. At k = 1 the accuracy is 0.225. At k = 20 it is 0.215, and at 40 it is 0.213.

Tiny Image with SVM

The best accuracy I could get with linear SVM was 0.186 (I also tried radial kernel SVM but the result was worse). This was under lambda = 1e-4.

Bag of Sift with k-NN

With k = 1 the accuracy is 0.524. At k = 10 it is improved to 0.531, which is also the best I could get with this configuration. The confusion matrix and table of classifier results are included here:

Bag of Sift with k-NN, k = 10


Accuracy (mean of diagonal of confusion matrix) is 0.531

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.460
Bedroom

InsideCity

Office

Store
Store 0.520
LivingRoom

InsideCity

Office

Office
Bedroom 0.330
Office

Kitchen

Kitchen

Office
LivingRoom 0.150
Office

Bedroom

Office

Office
Office 0.780
Bedroom

Bedroom

InsideCity

Bedroom
Industrial 0.300
Store

LivingRoom

Highway

Office
Suburb 0.910
Mountain

OpenCountry

InsideCity

Bedroom
InsideCity 0.420
Suburb

TallBuilding

Coast

Forest
TallBuilding 0.330
LivingRoom

Bedroom

Store

Street
Street 0.560
TallBuilding

Kitchen

Store

InsideCity
Highway 0.770
Coast

Coast

Coast

Coast
OpenCountry 0.260
Bedroom

Industrial

Mountain

Coast
Coast 0.670
Industrial

Suburb

Forest

Highway
Mountain 0.550
Kitchen

OpenCountry

Forest

Forest
Forest 0.960
Bedroom

OpenCountry

Mountain

Suburb
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Bag of Sift with SVM

I was able to get an accuracy of 0.678 with this combination. This is achieved at lambda = 10^-4 (0.0001). Detailed results are shown below:

Bag of Sift with Linear SVM, lambda = 0.0001


Accuracy (mean of diagonal of confusion matrix) is 0.678

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.600
Office

Bedroom

Office

LivingRoom
Store 0.580
Bedroom

InsideCity

Kitchen

LivingRoom
Bedroom 0.440
Office

OpenCountry

LivingRoom

LivingRoom
LivingRoom 0.330
Industrial

Bedroom

Office

Store
Office 0.840
Kitchen

Bedroom

Bedroom

Store
Industrial 0.490
Kitchen

InsideCity

LivingRoom

Highway
Suburb 0.940
Industrial

Kitchen

TallBuilding

Coast
InsideCity 0.500
Industrial

Store

Store

Kitchen
TallBuilding 0.760
Office

InsideCity

Forest

OpenCountry
Street 0.730
TallBuilding

Industrial

Industrial

Highway
Highway 0.840
OpenCountry

Street

Industrial

Coast
OpenCountry 0.540
Coast

TallBuilding

Suburb

Highway
Coast 0.770
OpenCountry

Industrial

Highway

OpenCountry
Mountain 0.870
LivingRoom

OpenCountry

OpenCountry

Forest
Forest 0.940
Store

Store

Mountain

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Other than linear SVM, I've also tried to use radial kernel SVM for classifying the features. However, radial kernels do not seem to play well with this task. After trying a dozen of different parameters, I was only able to get to an accuracy of about 0.55. This can be achieved under several different settings. Two possible ones are sigma = 2 and lambda = 1, or sigma = 2^5 and lambda = 1e-2 (sigma is the radial kernel parameter). The confusion matrix for sigma = 2^5 and lambda = 1e-2 is shown below (it seems that store has much higher possibility of showing up in the result than other scenes):

Bag of Sift with Radial Kernel, sigma = 32 and lambda = 0.01


Accuracy (mean of diagonal of confusion matrix) is 0.554