CS 143 / Project 3 / Scene Recognition with Bag of Words

The object of this project was to create a 15 category scene recogition pipeline after Lazebnik et al. 2006. A tiny image pipline was also developed as a jumping-off point. A 400 word vocabulary was built from training images by extracting dense SIFT features (using 'fast' vl_dsift with a step size of 20 or 30) and clustering these features using k-means. From there, a bag of SIFTs (again, using 'fast' vl_dsift with a step of 10) was assembled and binned into a histogram of the k-means categories for each test image. A kNN classifier and a linear SVM classifier (trained on 100 test images) were developed to assign test images to categories. This pipline classified images with about 64% accuracy. From this basic pipeline, other options were explored. Soft binning was attempted, but this actually decreased performance to 50% if 3 nearest neighbors were considered, and to 40% if 15 nearest neighbors were considered. This could very well be a result of a bad implementation, but this approach was abandoned. Using vl_dsift without the 'fast' parameter was also attempted, but this showed no appreciable improvement in performance, and in fact slight decrease, so this too was abandoned. Next, a 512 dimensional GIST vector was obtained (using LMgist by Aude Oliva, Antonio Torralba) from each image and appended to the 400 dimensonal histograms of images, which resulted in a great increase in performance. Finally, the bag of SIFTs was binned spatially, which created a 400*bins + 512 dimensional vector in total, which also increased performance. However, these increases in performance are not very elegant and come at the cost of increased computation time and memory use.

Performance of feature representations

Gist Features Spatial Binning Num Spatial Bins Performance
No No 1 0.64
Yes No 1 0.726
Yes Yes 4 0.761
Yes Yes 16 0.777
No Yes 16 0.724

The highest performance, 78%, was seen when binning each image into 16 spatial bins and including the GIST features. The confusion matrix and example classifications are below.

CS 143 Project 3 results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.777

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.620
Bedroom

LivingRoom

Bedroom

Bedroom
Store 0.640
TallBuilding

InsideCity

TallBuilding

InsideCity
Bedroom 0.610
LivingRoom

LivingRoom

LivingRoom

LivingRoom
LivingRoom 0.590
Industrial

Suburb

Office

Store
Office 0.950
LivingRoom

Store

Bedroom

Store
Industrial 0.690
Store

TallBuilding

Store

TallBuilding
Suburb 0.990
Industrial

Industrial

LivingRoom
InsideCity 0.770
Highway

Industrial

Kitchen

Kitchen
TallBuilding 0.830
InsideCity

InsideCity

Store

Store
Street 0.910
OpenCountry

InsideCity

Highway

InsideCity
Highway 0.870
OpenCountry

Kitchen

Street

Coast
OpenCountry 0.680
Coast

Coast

Highway

Coast
Coast 0.750
OpenCountry

OpenCountry

OpenCountry

OpenCountry
Mountain 0.830
OpenCountry

OpenCountry

Forest

Street
Forest 0.930
OpenCountry

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label