The goal of this project is to recognize scenes given a set of training data. This was accomplished using a variety of methods broken down into two categories: feature extraction and classification. For feature extraction, I implemented the "tiny images" method and the "bag of words (SIFT descriptor)" method. For classification, I implemented the KNN, or K nearest neighbors, algorithm, as well as SVM classification. The steps taken to implement these methods are detailed below.

Tiny Images Feature Representation

The "Tiny Images" method of feature extraction takes each image and shrinks it down to a small size. In my program, images are shrunken down to 16x16px. This method of feature extraction accomplishes two things:

The amount of memory used to store an image's set of features is small (the number of pixels in the tiny version of the image)
Scaling down an image keeps the most prominent features of the image while discarding the noisier points (this is general; in practice, this doesn't always work)

Bag of SIFTS Feature Representation

I first formed a "vocabulary" of visual words by sampling features (SIFT descriptors) from the training set and clustering them via K-means. This resulted in a set of cluster means.
When looking at the test images, I proceeded to sample features (SIFT descriptors) from the image. For each feature, I determined which cluster the feature fell into by finding the closest cluster mean to that feature. I did this for each SIFT feature to create a histogram of the number of SIFT features that fell into each cluster. This histogram is the feature representation that gets used during classification.

KNN (K Nearest Neighbors) Classification

The KNN method of classification is as follows:

For each test image:
Determine the training image that has the closest set of features to this test image
Classify the test image as the same class as this training image

SVM Classification

To train 1-vs-all linear SVMS, we determined the linear classification lines' w and b parameters in the equation y = wx+b, where x is the training features. W and B represent the parameters which we tune to best draw a line between the test images that fell into class "X" from all other classes. This was done for all classes "X".
Next, for each test image, we determined which line the test image was most strongly classified as class "X". We then classify this test image as class X.

Accuracy

The accuracy of my program is as follows:
Tiny Images & KNN: 0.204
Bag of SIFTs & KNN: 0.512
Bag of SIFTS & SVM: 0.639

System with Highest Accuracy

Bag of SIFTS & SVM

CS 143 Project 3 results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.639

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.530
Office
InsideCity
Industrial
Bedroom

Store 0.500
LivingRoom
TallBuilding
LivingRoom
Industrial

Bedroom 0.390
LivingRoom
Store
Industrial
Office

LivingRoom 0.480
TallBuilding
Industrial
Bedroom
Store

Office 0.780
Store
LivingRoom
LivingRoom
Kitchen

Industrial 0.540
InsideCity
InsideCity
Kitchen
Store

Suburb 0.920
Industrial
Industrial
InsideCity
TallBuilding

InsideCity 0.520
Store
Store
TallBuilding
LivingRoom

TallBuilding 0.620
Highway
Street
LivingRoom
LivingRoom

Street 0.700
Coast
Highway
InsideCity
Industrial

Highway 0.790
OpenCountry
OpenCountry
Store
Mountain

OpenCountry 0.430
Coast
Highway
Store
Mountain

Coast 0.720
OpenCountry
Industrial
Street
Highway

Mountain 0.770
OpenCountry
OpenCountry
Bedroom
TallBuilding

Forest 0.900
TallBuilding
OpenCountry
Store
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Flora Jin (fjin)

CS 143 / Project 3 / Scene Recognition with Bag of Words