Project 3 (Scene recognition with bag of words)
- Emanuel Zgraggen (ez), October 24th, 2011
Introduction
The task given by this assignment is to implement the baseline method for scene recognition presented in Lazebnik et al. 2006. The algorithm will try to classify scenes into one of 15 categories by training and testing on the 15 scene database introduced by Lazebnik et al. 2006.
Approach
Baseline
The baseline algorithm collects many densely sampled SIFT features for all the training images.
By running k-means, these features are clustered into M visual words, the vocabulary. For each of the
training images a histogram indicating how often each visual word is occurring gets computed. This gives us a M-dimensional
representation ("bag of words") of all training images. We then train 1-vs-all classifiers (linear SVM) for each scene category
based on the observed "bags of words" representation for all the training images. To classify a test image we calculate its "bag of words"
representation and evaluate all of our trained classifiers. The test image is assigned to the scene category of the classifier
with the highest confidence.
The general idea is that similar images will have similar distributions of visual words.
The vocabulary size parameter M has a significant influence on the performance of this approach. Different values have been tested to find a good trade-off between speed and accuracy.
GIST
This approach uses the GIST descriptor as a global representation of an image. The algorithm
does not compute a vocabulary, but instead trains 1-vs-all classifiers (linear SVM) for each scene category
based on the GIST representation of all the training images. To classify a test image we calculate its GIST descriptor
and evaluate all of our trained classifiers.
Spatial Pyramid
This approach works similar to the baseline method. The only difference is how the histograms of images
are computed. Instead of computing one overall histogram, we subdivide the image into
different frames and compute histograms for each spatial bin. The algorithm weights all the
histograms according to their level and concatenates them into one large vector, which will be used
to train our 1-vs-all classifiers. The weight of each histogram depends on its level and is calculated with
the following formula: 1 / 2^(L - l) where L is the number of levels and l is the current level.
The dimension of the final concatenated histogram is: M * 1/3 * (4^(L + 1) -1).

Combination of Approaches
Different combination of approaches have been tested. To combine two methods, the algorithm
trains tow sets of 1-vs-all classifiers, one for each method.
A test image is assigned to the scene category where the mean of the classifier confidences is highest.
Results
All experiments were performed on 4 random splits of the 15 scene database (100 training images / 100 test images).