Project 3: Scene recognition with bag of words

Lu Zeng (lz7) - Fall 2011

Categorization is a problem germane to many domains. For writers of email applications, the problem was determining whether a given email message is spam, ordinary, or "important". For students in introductory Jazz courses, the problem might be determining if a 30 second clip is the work of Jelly Roll Morton or Dizzy Gillespie. In this project, the problem is determining which of 15 scene categories a given image belongs to.

Forest scene and mountain scene from Francis Parker School. Or are they all 'indoor'?

The process is the same for each of the problems above. An email, clip, or scene is reduced to a set of features -- for the jazz clip, this might amount to a student identifying the tempo and instrumentation. For emails, features are single words or n-grams ("buy now" "Nigeria").

The music appreciation student is probably taught what to listen for. For images, we don't know what the features are, a priori. Instead, we can extract a great number of features, group them into similar ones using k-means, and use those as our 'vocabulary.' Features are extracted by the SIFT algorithm. In my implementation, I extracted a dense set using vl_dsift, taking a size 4 measurement every 8 pixels, and grouped these features into 200 words for the vocabulary. To speed up calculations, I randomly sample only half the features frome each image.

Vocabulary in hand, we represent each training image as a histogram -- for each feature extracted, we decide which word in our vocabulary it is closest to, and add a vote to that bin. These histograms for our labeled training images are used to train an SVM one-vs-all classifier.

With the baseline implementation, we have an accuracy of 60.87%. Notable is that the forest and suburb images were very easy to classify, and that living room, kitchen, and bedroom images were often confused. This is not a surprise, as many bedrooms functionally serve as kitchens, kitchens as living rooms, living rooms as bedrooms, etc. InsideCity also seems to be confused with street; semantically, street is inside cities, and looking at some images, both groups feature tall buildings / windows.