CS 143 / Project 3 / Scene Recognition with Bag of Words

The object of this project was to create a 15 category scene recogition pipeline after Lazebnik et al. 2006. A tiny image pipline was also developed as a jumping-off point. A 400 word vocabulary was built from training images by extracting dense SIFT features (using 'fast' vl_dsift with a step size of 20 or 30) and clustering these features using k-means. From there, a bag of SIFTs (again, using 'fast' vl_dsift with a step of 10) was assembled and binned into a histogram of the k-means categories for each test image. A kNN classifier and a linear SVM classifier (trained on 100 test images) were developed to assign test images to categories. This pipline classified images with about 64% accuracy. From this basic pipeline, other options were explored. Soft binning was attempted, but this actually decreased performance to 50% if 3 nearest neighbors were considered, and to 40% if 15 nearest neighbors were considered. This could very well be a result of a bad implementation, but this approach was abandoned. Using vl_dsift without the 'fast' parameter was also attempted, but this showed no appreciable improvement in performance, and in fact slight decrease, so this too was abandoned. Next, a 512 dimensional GIST vector was obtained (using LMgist by Aude Oliva, Antonio Torralba) from each image and appended to the 400 dimensonal histograms of images, which resulted in a great increase in performance. Finally, the bag of SIFTs was binned spatially, which created a 400*bins + 512 dimensional vector in total, which also increased performance. However, these increases in performance are not very elegant and come at the cost of increased computation time and memory use.

Performance of feature representations

Gist Features	Spatial Binning	Num Spatial Bins	Performance
No	No	1	0.64
Yes	No	1	0.726
Yes	Yes	4	0.761
Yes	Yes	16	0.777
No	Yes	16	0.724

The highest performance, 78%, was seen when binning each image into 16 spatial bins and including the GIST features. The confusion matrix and example classifications are below.

CS 143 Project 3 results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.777

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.620
Bedroom
LivingRoom
Bedroom
Bedroom

Store 0.640
TallBuilding
InsideCity
TallBuilding
InsideCity

Bedroom 0.610
LivingRoom
LivingRoom
LivingRoom
LivingRoom

LivingRoom 0.590
Industrial
Suburb
Office
Store

Office 0.950
LivingRoom
Store
Bedroom
Store

Industrial 0.690
Store
TallBuilding
Store
TallBuilding

Suburb 0.990
Industrial
Industrial
LivingRoom

InsideCity 0.770
Highway
Industrial
Kitchen
Kitchen

TallBuilding 0.830
InsideCity
InsideCity
Store
Store

Street 0.910
OpenCountry
InsideCity
Highway
InsideCity

Highway 0.870
OpenCountry
Kitchen
Street
Coast

OpenCountry 0.680
Coast
Coast
Highway
Coast

Coast 0.750
OpenCountry
OpenCountry
OpenCountry
OpenCountry

Mountain 0.830
OpenCountry
OpenCountry
Forest
Street

Forest 0.930
OpenCountry
OpenCountry
Mountain
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.620			Bedroom	LivingRoom	Bedroom	Bedroom
Store	0.640			TallBuilding	InsideCity	TallBuilding	InsideCity
Bedroom	0.610			LivingRoom	LivingRoom	LivingRoom	LivingRoom
LivingRoom	0.590			Industrial	Suburb	Office	Store
Office	0.950			LivingRoom	Store	Bedroom	Store
Industrial	0.690			Store	TallBuilding	Store	TallBuilding
Suburb	0.990			Industrial	Industrial	LivingRoom
InsideCity	0.770			Highway	Industrial	Kitchen	Kitchen
TallBuilding	0.830			InsideCity	InsideCity	Store	Store
Street	0.910			OpenCountry	InsideCity	Highway	InsideCity
Highway	0.870			OpenCountry	Kitchen	Street	Coast
OpenCountry	0.680			Coast	Coast	Highway	Coast
Coast	0.750			OpenCountry	OpenCountry	OpenCountry	OpenCountry
Mountain	0.830			OpenCountry	OpenCountry	Forest	Street
Forest	0.930			OpenCountry	OpenCountry	Mountain	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

James Besancon (jbesanco)

CS 143 / Project 3 / Scene Recognition with Bag of Words

Performance of feature representations

CS 143 Project 3 results visualization