Scene recognition is a major open project in computer vision research that has been the topic of a lot of current work. The goal is to be able to linguistically categorize an image as humans so naturally and accurately do every day.
One algorithm for attempting scene recognition is the bag-of-features model inspired by the bag-of-words model taken from linguistics research. The pipeline is to first, collect features from a set of training images, sample those features and use k-means to create a "vocabulary" of features. Second, create a histogram of the frequency that the approximate words are found in each training image. Third, put these histograms through a support vector machine and finally build histograms for test images and then, using the trained SVM, classify the test images.
| Accuracy = 63.5% | Accuracy = 63.3% | 
|---|---|
|  |  | 
| Category | |----------------------- | Correct Answers | -----------------------| | --- | |----------------------- | False Positives | -----------------------| | --- | |----------------------- | False Negatives | -----------------------| | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| suburb |  |  |  | --- |  |  |  | --- |  |  |  | 
| coast |  |  |  | --- |  |  |  | --- |  |  |  | 
| forest |  |  |  | --- |  |  |  | --- |  |  |  | 
| highway |  |  |  | --- |  |  |  | --- |  |  |  | 
| inside city |  |  |  | --- |  |  |  | --- |  |  |  | 
| mountain |  |  |  | --- |  |  |  | --- |  |  |  | 
| open country |  |  |  | --- |  |  |  | --- |  |  |  | 
| street |  |  |  | --- |  |  |  | --- |  |  |  | 
| tall building |  |  |  | --- |  |  |  | --- |  |  |  | 
| office |  |  |  | --- |  |  |  | --- |  |  |  | 
| bedroom |  |  |  | --- |  |  |  | --- |  |  |  | 
| industrial |  |  |  | --- |  |  |  | --- |  |  |  | 
| kitchen |  |  |  | --- |  |  |  | --- |  |  |  | 
| living room |  |  |  | --- |  |  |  | --- |  |  |  | 
| store |  |  |  | --- |  |  |  | --- |  |  |  | 
I decided to see if just forcing spatial sampling would be enough to improve performance. I sampled evenly from each of the 4 quadrants of each image to create the vocabulary. Since the vocab is created with only a sample of features then I hypothesized that sampling spatially might create a more descriptive vocabulary. However, the performance was actually about the same.
| Accuracy = 62.3% | 
|---|
|  | 
Considering what size vocabulary to use in bag-of-features recognition is important. I ran my baseline algorithm with different vocabulary sizes and found that with this image set a vocabulary of 200-400 words is reasonable. Note: the axis labels are incorrect in this section.
| vocab = 10, accuracy =44.3% | vocab = 20, accuracy =52.67% | 
|---|---|
|  |  | 
| vocab = 50, accuracy =58.91% | vocab = 100, accuracy =59.82% | 
|---|---|
|  |  | 
| vocab = 200, accuracy =62.55% | vocab = 400, accuracy =62.97% | 
|---|---|
|  |  | 
| vocab = 1000, accuracy = 62.06% | vocab = 5000, accuracy = 58.55 | 
|---|---|
|  |  | 
I wanted to see if categorizing the images twice would help with performance. My intuition for this came from the Boosted Decision Tree concept we discussed in class. I hypothesized that if my classifier could accurately divide the images into indoor and outdoor natural and outdoor man-made scenes then maybe training an SVM for 1 v all of the same basic type might be more accurate because it would remove the possibly misleading vocabulary from the other basic categories. To do this I took 20 images from each of the inside categories(office, bedroom, kitchen, living room and store) and 25 images from each of the outside natural categories(forest, open country, mountain and coast) and 16 (+1 extra from all but street and highway) images from each of the outside man-made categories (suburb, industrial, tall building, inside city, highway, and street) for a total of 100 images in each category.
I then tested to see if the classifier was accurate within the basic categories and found that the classifier was 80.7% correct(3 trials). The classifier was very good(91%) at picking out the natural images and pretty good at picking inside images(85%). The lower overall success rate comes from the poor classifier for outside but man-made scenes. This might be because some of the scenes are really only partially outdoors.
| Test 1 | Test 2 | Test 3 | 
|---|---|---|
|  |  |  | 
Next, I classified all of the sub-categories 1 vs the rest of their category. I found that these classifiers worked with mixed success. Though the classifier was good at deciding an image was indoors, it was bad at deciding what exactly the inside scenes are.
| Inside | Outside(man-made) | Outside(natural) | |
|---|---|---|---|
| Test 1 |  |  |  | 
| Test 2 |  |  |  | 
| Test 3 |  |  |  | 
It is possible that databases of scenes are contrived and this brings into question the practicality of a trained algorithm. In order to see if my scene recognizer would classify some of the scenes from my life. I tried to pick images that fell into different categories. Some of them are a bit more ambiguous and the classifier picked up on that in many cases. (i.e. Is the boston skyline from the harbor tall buildings? industrial? coast?; Are the Cliffs of Moher coast line or a mountain?; Can there be trees in front of tall buildings or are you really in a forest?; Can there be a mountain right next to streets?; If you are short like me will short buildings seem tall?) In some cases the categorization was a bit ironic...like my mother's office being the kitchen. I was surprised that it didn't get any of the offices correct and that it did get the store correct.
| Image | Where? | Labeled | ||
|---|---|---|---|---|
|  | My Parents' house | forest/natural | ||
|  | Cliffs of Moher, Ireland (aka Cliffs of Insanity!) | coast, mountain/natural | ||
|  | Coast of St Andrews, Scotland | coast/natural | ||
|  | Puck's Glen, Argyll, Scotland | forest/natural | ||
|  | Forest in MA | forest/man-made | ||
|  | Parents' Backyard | store/inside | ||
|  | Victoria St., Edinburgh, Scotland | street/man-made | ||
|  | The Elephant House, Edinburgh, Scotland (aka where Harry Potter was written) | inside city/inside | ||
|  | Gallway, Ireland | tall building, mountain/inside | ||
|  | Glencoe, Scotland | suburb/natural | ||
|  | Arthur's Seat, Scotland | street/man-made | ||
|  | View from Wallace Monument, Scotland | coast/natural | ||
|  | Irish countryside | coast, open country/natural | ||
|  | Tall building in San Antonio, TX | forest/natural | ||
|  | Boston Skyline | industrial, coast/natural | ||
|  | My Office | highway, industrial/inside | ||
|  | My dad's office | forest/natural | ||
|  | My boyfriend's office | forest/natural | ||
|  | My Mom's office | kitchen/inside | ||
|  | My bedroom | highway/natural | ||
|  | Parents' guest bedroom | highway/natural | ||
|  | 4th floor kitchen | industrial/inside | ||
|  | 3rd floor kitcken | industrial/inside | ||
|  | My kitchen | kitchen/inside | ||
|  | My Parents' kitchen | forest/natural | ||
|  | My Parents' living room | industrial/natural | ||
|  | A grocery store in NYC | store/inside | 
The living room classifier performs extremely badly. This may be because the definition of a living room is very broad. Many of these images were classified as offices or stores. It makes sense that a living room, which contains chairs and tables, would look like an office. I am not sure why it might look just as similar to a store. Also, the classifier was bad at deciding an image was a street. Many of the streets were classified as inside city. This seems very reasonable as many of the inside city images are along streets. The classification of natural images was much more successful. Open country was the worst, however. Most of the images that were classified wrong were classified as coast or mountain instead. I found this confusion with my scenes as well. I think that since all three have large, basically flat regions they could be confused easily.
Overall, I think it might be beneficial to double classify. This is a natural human step as well. Humans, even without the sense of sight, are very good at realizing if they are outside or inside. Therefore we don't even consider that the room we just walked could possibly be a highway. The pipeline could be edited to classify the image once and then classify within that class. One could also conceive of a way to not hard classify an image that is not particularly confidently placed in an higher level class. With this concept, the classifier could reserve judgement on an image it wasn't sure about and run it through the two most confident classification sub-classifers.
Page made by Betsy Hilliard, 10/21/2011