CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of this project is to recognize scenes given a set of training data. This was accomplished using a variety of methods broken down into two categories: feature extraction and classification. For feature extraction, I implemented the "tiny images" method and the "bag of words (SIFT descriptor)" method. For classification, I implemented the KNN, or K nearest neighbors, algorithm, as well as SVM classification. The steps taken to implement these methods are detailed below.

Tiny Images Feature Representation

The "Tiny Images" method of feature extraction takes each image and shrinks it down to a small size. In my program, images are shrunken down to 16x16px. This method of feature extraction accomplishes two things:

  1. The amount of memory used to store an image's set of features is small (the number of pixels in the tiny version of the image)
  2. Scaling down an image keeps the most prominent features of the image while discarding the noisier points (this is general; in practice, this doesn't always work)

Bag of SIFTS Feature Representation

I first formed a "vocabulary" of visual words by sampling features (SIFT descriptors) from the training set and clustering them via K-means. This resulted in a set of cluster means.
When looking at the test images, I proceeded to sample features (SIFT descriptors) from the image. For each feature, I determined which cluster the feature fell into by finding the closest cluster mean to that feature. I did this for each SIFT feature to create a histogram of the number of SIFT features that fell into each cluster. This histogram is the feature representation that gets used during classification.

KNN (K Nearest Neighbors) Classification

The KNN method of classification is as follows:

SVM Classification

To train 1-vs-all linear SVMS, we determined the linear classification lines' w and b parameters in the equation y = wx+b, where x is the training features. W and B represent the parameters which we tune to best draw a line between the test images that fell into class "X" from all other classes. This was done for all classes "X".
Next, for each test image, we determined which line the test image was most strongly classified as class "X". We then classify this test image as class X.

Accuracy

The accuracy of my program is as follows:
Tiny Images & KNN: 0.204
Bag of SIFTs & KNN: 0.512
Bag of SIFTS & SVM: 0.639

System with Highest Accuracy

Bag of SIFTS & SVM

CS 143 Project 3 results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.639

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.530
Office

InsideCity

Industrial

Bedroom
Store 0.500
LivingRoom

TallBuilding

LivingRoom

Industrial
Bedroom 0.390
LivingRoom

Store

Industrial

Office
LivingRoom 0.480
TallBuilding

Industrial

Bedroom

Store
Office 0.780
Store

LivingRoom

LivingRoom

Kitchen
Industrial 0.540
InsideCity

InsideCity

Kitchen

Store
Suburb 0.920
Industrial

Industrial

InsideCity

TallBuilding
InsideCity 0.520
Store

Store

TallBuilding

LivingRoom
TallBuilding 0.620
Highway

Street

LivingRoom

LivingRoom
Street 0.700
Coast

Highway

InsideCity

Industrial
Highway 0.790
OpenCountry

OpenCountry

Store

Mountain
OpenCountry 0.430
Coast

Highway

Store

Mountain
Coast 0.720
OpenCountry

Industrial

Street

Highway
Mountain 0.770
OpenCountry

OpenCountry

Bedroom

TallBuilding
Forest 0.900
TallBuilding

OpenCountry

Store

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label