Project 3: Scene recognition with bag of words
Lu Zeng (lz7) - Fall 2011
Categorization is a problem germane to many domains. For writers of email applications, the problem was determining whether a given email message is spam, ordinary, or "important". For students in introductory Jazz courses, the problem might be determining if a 30 second clip is the work of Jelly Roll Morton or Dizzy Gillespie. In this project, the problem is determining which of 15 scene categories a given image belongs to.


The process is the same for each of the problems above. An email, clip, or scene is reduced to a set of features -- for the jazz clip, this might amount to a student identifying the tempo and instrumentation. For emails, features are single words or n-grams ("buy now" "Nigeria").
The music appreciation student is probably taught what to listen for. For images, we don't know what the features are, a priori. Instead, we can extract a great number of features, group them into similar ones using k-means, and use those as our 'vocabulary.' Features are extracted by the SIFT algorithm. In my implementation, I extracted a dense set using vl_dsift, taking a size 4 measurement every 8 pixels, and grouped these features into 200 words for the vocabulary. To speed up calculations, I randomly sample only half the features frome each image.
Vocabulary in hand, we represent each training image as a histogram -- for each feature extracted, we decide which word in our vocabulary it is closest to, and add a vote to that bin. These histograms for our labeled training images are used to train an SVM one-vs-all classifier.
With the baseline implementation, we have an accuracy of 60.87%. Notable is that the forest and suburb images were very easy to classify, and that living room, kitchen, and bedroom images were often confused. This is not a surprise, as many bedrooms functionally serve as kitchens, kitchens as living rooms, living rooms as bedrooms, etc. InsideCity also seems to be confused with street; semantically, street is inside cities, and looking at some images, both groups feature tall buildings / windows.