CS 143 / Project 3 / Scene Recognition with Bag of Words

Overview

The aim of this project is to do simple scene recognition using a variety of feature descriptors including (i) tiny images and (ii) histograms of quantized local features, and a variety of classifiers including (i) nearest neighbor classifiers and (ii) linear classifiers learned by suppoer vector machines.

For this project, we were supposed to implement them in the following order

Tiny images feature representaiton + Nearest neighbor classifier
Bag of SIFT feature representaiton + Nearest neighbor classifier
Bag of SIFT feature representation + linear SVM classifier

Tiny Images

The algorithm for tiny images is fairly simple. It merely involves resizing the image using imresize() into a 16 x 16 image, and then normalizing it by subtracting the mean of image, and then dividing by the squareroot of the sum of all squared elements. The normalization at the end is necessary basically to ensure that the image patches will be brightness invariant.

It's a good initial method because it assumes that images of the same scene have realtively similar global properties. This may work well for example for images of mountain peaks, there being some triangular shape in the central region of the image. However, this discounts a lot of things like perspective, different subject-background composition all together.(which leads us to)

Bag of SIFT

This image representation is quantizing the occurence of local image features in the image and comparing these histograms. Creating this image representation requires two steps. First it requires creating a vocabulary, which you can think of as motif descriptors, SIFT descriptors in this case. This can be found by grabbing all SIFT descriptors from the training images and then running kmeans to find the 400 (or any number) centroids which will best summarize our distribution of SIFT descriptors.

Then, we have to sample SIFT descriptors from each training image. For each descriptor, find the closest centroid, then count this as an occurence of this centroid in the image. Create a histogram of the nearest neighbors of all the SIFT descriptors for a training image. Then, divide by the lengthof the histogram for normalization.

The downside of this representation is that it doesn't encode any spatial features, or the relationship between the different descriptors in the image which may be important, such as in scenes of streets, cars are almost always on the bottom of the screen and below buildlings for example.

Nearest Neighbor Classifier

The idea behind it is fairly simple. Imagine a cluster of points that are the tiny image vectors of all the training images. For each test image, find its nearest neighbor in this cluster of points, and make the label of that nearest neighbor the label of the test image. A simple extension of this would be to use k-nearest neighbors, labelling the test image with the majority label.

SVM linear classifier

Simply put, a linear classifier learns a hyperplane that best separates all the positive and negative trained examples. To use SVM linear classifier for a recognition class like this, we have to run the SVM classifier for each scene on the training image, and label the training image with the same category as the classifier that returns the highest confidence.

Most accurate parameters

lambda = 0.00001
vocab size = 100
step size = 3 (for finding SIFT descriptors in both training and test images)
sample size = 500 (I subsample descriptors from training image because I want equal contribution from each training image)

Results in a table

Tiny Image + NN

Accuracy (mean of diagonal of confusion matrix) is 0.225

Bag of SIFT + NN

Accuracy (mean of diagonal of confusion matrix) is 0.457

Bag of SIFT + SVM

Accuracy (mean of diagonal of confusion matrix) is 0.623

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.480
LivingRoom
LivingRoom
Office
Bedroom

Store 0.440
LivingRoom
LivingRoom
Forest
Office

Bedroom 0.340
LivingRoom
LivingRoom
Suburb
Store

LivingRoom 0.260
Industrial
Bedroom
Kitchen
Bedroom

Office 0.950
Industrial
Store
Kitchen
Kitchen

Industrial 0.310
Street
InsideCity
Street
TallBuilding

Suburb 0.890
InsideCity
OpenCountry
InsideCity
Store

InsideCity 0.710
Kitchen
Street
LivingRoom
Kitchen

TallBuilding 0.830
Bedroom
Mountain
Coast
Bedroom

Street 0.580
InsideCity
Industrial
InsideCity
InsideCity

Highway 0.720
Industrial
Industrial
Coast
Industrial

OpenCountry 0.320
Coast
Coast
InsideCity
Highway

Coast 0.760
Highway
Highway
OpenCountry
InsideCity

Mountain 0.830
LivingRoom
OpenCountry
Suburb
Highway

Forest 0.930
OpenCountry
Mountain
Street
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.480			LivingRoom	LivingRoom	Office	Bedroom
Store	0.440			LivingRoom	LivingRoom	Forest	Office
Bedroom	0.340			LivingRoom	LivingRoom	Suburb	Store
LivingRoom	0.260			Industrial	Bedroom	Kitchen	Bedroom
Office	0.950			Industrial	Store	Kitchen	Kitchen
Industrial	0.310			Street	InsideCity	Street	TallBuilding
Suburb	0.890			InsideCity	OpenCountry	InsideCity	Store
InsideCity	0.710			Kitchen	Street	LivingRoom	Kitchen
TallBuilding	0.830			Bedroom	Mountain	Coast	Bedroom
Street	0.580			InsideCity	Industrial	InsideCity	InsideCity
Highway	0.720			Industrial	Industrial	Coast	Industrial
OpenCountry	0.320			Coast	Coast	InsideCity	Highway
Coast	0.760			Highway	Highway	OpenCountry	InsideCity
Mountain	0.830			LivingRoom	OpenCountry	Suburb	Highway
Forest	0.930			OpenCountry	Mountain	Street	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Jason Shum (jshum)