CS 143 / Project 3 / Scene Recognition with Bag of Words

Overview

The aim of this project is to do simple scene recognition using a variety of feature descriptors including (i) tiny images and (ii) histograms of quantized local features, and a variety of classifiers including (i) nearest neighbor classifiers and (ii) linear classifiers learned by suppoer vector machines.

For this project, we were supposed to implement them in the following order

  1. Tiny images feature representaiton + Nearest neighbor classifier
  2. Bag of SIFT feature representaiton + Nearest neighbor classifier
  3. Bag of SIFT feature representation + linear SVM classifier

Tiny Images

The algorithm for tiny images is fairly simple. It merely involves resizing the image using imresize() into a 16 x 16 image, and then normalizing it by subtracting the mean of image, and then dividing by the squareroot of the sum of all squared elements. The normalization at the end is necessary basically to ensure that the image patches will be brightness invariant.

It's a good initial method because it assumes that images of the same scene have realtively similar global properties. This may work well for example for images of mountain peaks, there being some triangular shape in the central region of the image. However, this discounts a lot of things like perspective, different subject-background composition all together.(which leads us to)

Bag of SIFT

This image representation is quantizing the occurence of local image features in the image and comparing these histograms. Creating this image representation requires two steps. First it requires creating a vocabulary, which you can think of as motif descriptors, SIFT descriptors in this case. This can be found by grabbing all SIFT descriptors from the training images and then running kmeans to find the 400 (or any number) centroids which will best summarize our distribution of SIFT descriptors.

Then, we have to sample SIFT descriptors from each training image. For each descriptor, find the closest centroid, then count this as an occurence of this centroid in the image. Create a histogram of the nearest neighbors of all the SIFT descriptors for a training image. Then, divide by the lengthof the histogram for normalization.

The downside of this representation is that it doesn't encode any spatial features, or the relationship between the different descriptors in the image which may be important, such as in scenes of streets, cars are almost always on the bottom of the screen and below buildlings for example.

Nearest Neighbor Classifier

The idea behind it is fairly simple. Imagine a cluster of points that are the tiny image vectors of all the training images. For each test image, find its nearest neighbor in this cluster of points, and make the label of that nearest neighbor the label of the test image. A simple extension of this would be to use k-nearest neighbors, labelling the test image with the majority label.

SVM linear classifier

Simply put, a linear classifier learns a hyperplane that best separates all the positive and negative trained examples. To use SVM linear classifier for a recognition class like this, we have to run the SVM classifier for each scene on the training image, and label the training image with the same category as the classifier that returns the highest confidence.

Most accurate parameters

  1. lambda = 0.00001
  2. vocab size = 100
  3. step size = 3 (for finding SIFT descriptors in both training and test images)
  4. sample size = 500 (I subsample descriptors from training image because I want equal contribution from each training image)

Results in a table

Tiny Image + NN


Accuracy (mean of diagonal of confusion matrix) is 0.225

Bag of SIFT + NN


Accuracy (mean of diagonal of confusion matrix) is 0.457

Bag of SIFT + SVM


Accuracy (mean of diagonal of confusion matrix) is 0.623

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.480
LivingRoom

LivingRoom

Office

Bedroom
Store 0.440
LivingRoom

LivingRoom

Forest

Office
Bedroom 0.340
LivingRoom

LivingRoom

Suburb

Store
LivingRoom 0.260
Industrial

Bedroom

Kitchen

Bedroom
Office 0.950
Industrial

Store

Kitchen

Kitchen
Industrial 0.310
Street

InsideCity

Street

TallBuilding
Suburb 0.890
InsideCity

OpenCountry

InsideCity

Store
InsideCity 0.710
Kitchen

Street

LivingRoom

Kitchen
TallBuilding 0.830
Bedroom

Mountain

Coast

Bedroom
Street 0.580
InsideCity

Industrial

InsideCity

InsideCity
Highway 0.720
Industrial

Industrial

Coast

Industrial
OpenCountry 0.320
Coast

Coast

InsideCity

Highway
Coast 0.760
Highway

Highway

OpenCountry

InsideCity
Mountain 0.830
LivingRoom

OpenCountry

Suburb

Highway
Forest 0.930
OpenCountry

Mountain

Street

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label