CS143, Project 3:

Scene Recognition with Bag of Words

by Betsy Hilliard (betsy), 10/24/11

Background and Motivation

Scene recognition is a major open project in computer vision research that has been the topic of a lot of current work. The goal is to be able to linguistically categorize an image as humans so naturally and accurately do every day.

Algorithm

One algorithm for attempting scene recognition is the bag-of-features model inspired by the bag-of-words model taken from linguistics research. The pipeline is to first, collect features from a set of training images, sample those features and use k-means to create a "vocabulary" of features. Second, create a histogram of the frequency that the approximate words are found in each training image. Third, put these histograms through a support vector machine and finally build histograms for test images and then, using the trained SVM, classify the test images.

Baseline Results

Vocab=200; 100 training images; 110 testing images; sample 75 features from each image

Accuracy = 63.5%	Accuracy = 63.3%

Category	\|-----------------------	Correct Answers	-----------------------\|	---	\|-----------------------	False Positives	-----------------------\|	---	\|-----------------------	False Negatives	-----------------------\|
suburb				---				---
coast				---				---
forest				---				---
highway				---				---
inside city				---				---
mountain				---				---
open country				---				---
street				---				---
tall building				---				---
office				---				---
bedroom				---				---
industrial				---				---
kitchen				---				---
living room				---				---
store				---				---

Extra Credit

Sampling Spatially

I decided to see if just forcing spatial sampling would be enough to improve performance. I sampled evenly from each of the 4 quadrants of each image to create the vocabulary. Since the vocab is created with only a sample of features then I hypothesized that sampling spatially might create a more descriptive vocabulary. However, the performance was actually about the same.

Accuracy = 62.3%

Vocabulary Size Effects

Considering what size vocabulary to use in bag-of-features recognition is important. I ran my baseline algorithm with different vocabulary sizes and found that with this image set a vocabulary of 200-400 words is reasonable. Note: the axis labels are incorrect in this section.

vocab = 10, accuracy =44.3%	vocab = 20, accuracy =52.67%

vocab = 50, accuracy =58.91%	vocab = 100, accuracy =59.82%

vocab = 200, accuracy =62.55%	vocab = 400, accuracy =62.97%

vocab = 1000, accuracy = 62.06%	vocab = 5000, accuracy = 58.55

Indoor/Outdoor vs. Natural/Outdoor vs. Man-made

I wanted to see if categorizing the images twice would help with performance. My intuition for this came from the Boosted Decision Tree concept we discussed in class. I hypothesized that if my classifier could accurately divide the images into indoor and outdoor natural and outdoor man-made scenes then maybe training an SVM for 1 v all of the same basic type might be more accurate because it would remove the possibly misleading vocabulary from the other basic categories. To do this I took 20 images from each of the inside categories(office, bedroom, kitchen, living room and store) and 25 images from each of the outside natural categories(forest, open country, mountain and coast) and 16 (+1 extra from all but street and highway) images from each of the outside man-made categories (suburb, industrial, tall building, inside city, highway, and street) for a total of 100 images in each category.

I then tested to see if the classifier was accurate within the basic categories and found that the classifier was 80.7% correct(3 trials). The classifier was very good(91%) at picking out the natural images and pretty good at picking inside images(85%). The lower overall success rate comes from the poor classifier for outside but man-made scenes. This might be because some of the scenes are really only partially outdoors.

Test 1	Test 2	Test 3

Next, I classified all of the sub-categories 1 vs the rest of their category. I found that these classifiers worked with mixed success. Though the classifier was good at deciding an image was indoors, it was bad at deciding what exactly the inside scenes are.

	Inside	Outside(man-made)	Outside(natural)
Test 1
Test 2
Test 3

Scenes from my life

It is possible that databases of scenes are contrived and this brings into question the practicality of a trained algorithm. In order to see if my scene recognizer would classify some of the scenes from my life. I tried to pick images that fell into different categories. Some of them are a bit more ambiguous and the classifier picked up on that in many cases. (i.e. Is the boston skyline from the harbor tall buildings? industrial? coast?; Are the Cliffs of Moher coast line or a mountain?; Can there be trees in front of tall buildings or are you really in a forest?; Can there be a mountain right next to streets?; If you are short like me will short buildings seem tall?) In some cases the categorization was a bit ironic...like my mother's office being the kitchen. I was surprised that it didn't get any of the offices correct and that it did get the store correct.

Image		Where?		Labeled
		My Parents' house		forest/natural
		Cliffs of Moher, Ireland (aka Cliffs of Insanity!)		coast, mountain/natural
		Coast of St Andrews, Scotland		coast/natural
		Puck's Glen, Argyll, Scotland		forest/natural
		Forest in MA		forest/man-made
		Parents' Backyard		store/inside
		Victoria St., Edinburgh, Scotland		street/man-made
		The Elephant House, Edinburgh, Scotland (aka where Harry Potter was written)		inside city/inside
		Gallway, Ireland		tall building, mountain/inside
		Glencoe, Scotland		suburb/natural
		Arthur's Seat, Scotland		street/man-made
		View from Wallace Monument, Scotland		coast/natural
		Irish countryside		coast, open country/natural
		Tall building in San Antonio, TX		forest/natural
		Boston Skyline		industrial, coast/natural
		My Office		highway, industrial/inside
		My dad's office		forest/natural
		My boyfriend's office		forest/natural
		My Mom's office		kitchen/inside
		My bedroom		highway/natural
		Parents' guest bedroom		highway/natural
		4th floor kitchen		industrial/inside
		3rd floor kitcken		industrial/inside
		My kitchen		kitchen/inside
		My Parents' kitchen		forest/natural
		My Parents' living room		industrial/natural
		A grocery store in NYC		store/inside

Discussion of Results

The living room classifier performs extremely badly. This may be because the definition of a living room is very broad. Many of these images were classified as offices or stores. It makes sense that a living room, which contains chairs and tables, would look like an office. I am not sure why it might look just as similar to a store. Also, the classifier was bad at deciding an image was a street. Many of the streets were classified as inside city. This seems very reasonable as many of the inside city images are along streets. The classification of natural images was much more successful. Open country was the worst, however. Most of the images that were classified wrong were classified as coast or mountain instead. I found this confusion with my scenes as well. I think that since all three have large, basically flat regions they could be confused easily.

Overall, I think it might be beneficial to double classify. This is a natural human step as well. Humans, even without the sense of sight, are very good at realizing if they are outside or inside. Therefore we don't even consider that the room we just walked could possibly be a highway. The pipeline could be edited to classify the image once and then classify within that class. One could also conceive of a way to not hard classify an image that is not particularly confidently placed in an higher level class. With this concept, the classifier could reserve judgement on an image it wasn't sure about and run it through the two most confident classification sub-classifers.

Page made by Betsy Hilliard, 10/21/2011