Project 3 (Scene recognition with bag of words)

- Emanuel Zgraggen (ez), October 24th, 2011

Introduction

The task given by this assignment is to implement the baseline method for scene recognition presented in Lazebnik et al. 2006. The algorithm will try to classify scenes into one of 15 categories by training and testing on the 15 scene database introduced by Lazebnik et al. 2006.

Approach

Baseline
The baseline algorithm collects many densely sampled SIFT features for all the training images. By running k-means, these features are clustered into M visual words, the vocabulary. For each of the training images a histogram indicating how often each visual word is occurring gets computed. This gives us a M-dimensional representation ("bag of words") of all training images. We then train 1-vs-all classifiers (linear SVM) for each scene category based on the observed "bags of words" representation for all the training images. To classify a test image we calculate its "bag of words" representation and evaluate all of our trained classifiers. The test image is assigned to the scene category of the classifier with the highest confidence. The general idea is that similar images will have similar distributions of visual words.

Example images and their "bag of words" representation (M = 200).
       
       
       
           

The vocabulary size parameter M has a significant influence on the performance of this approach. Different values have been tested to find a good trade-off between speed and accuracy.

Average accuracies with different M values (4 random splits, 100 training images / 100 test images).

GIST
This approach uses the GIST descriptor as a global representation of an image. The algorithm does not compute a vocabulary, but instead trains 1-vs-all classifiers (linear SVM) for each scene category based on the GIST representation of all the training images. To classify a test image we calculate its GIST descriptor and evaluate all of our trained classifiers.

Spatial Pyramid
This approach works similar to the baseline method. The only difference is how the histograms of images are computed. Instead of computing one overall histogram, we subdivide the image into different frames and compute histograms for each spatial bin. The algorithm weights all the histograms according to their level and concatenates them into one large vector, which will be used to train our 1-vs-all classifiers. The weight of each histogram depends on its level and is calculated with the following formula: 1 / 2^(L - l) where L is the number of levels and l is the current level. The dimension of the final concatenated histogram is: M * 1/3 * (4^(L + 1) -1).

Example pyramid with L = 2. The final histogram will be of dimension M * 21.

Combination of Approaches
Different combination of approaches have been tested. To combine two methods, the algorithm trains tow sets of 1-vs-all classifiers, one for each method. A test image is assigned to the scene category where the mean of the classifier confidences is highest.

Results

All experiments were performed on 4 random splits of the 15 scene database (100 training images / 100 test images).

Approach Mean Accuracy Mean Confusion Matrix Comments
Baseline 64.02 ± 0.45 Vocabulary size = 200
GIST 71.15 ± 0.76
Spatial Pyramid 67.45 ± 2.85 L = 2
Baseline + GIST 76.12 ± 1.17 Vocabulary size = 200
GIST + Spatial Pyramid 74.67 ± 1.83 Vocabulary size = 200, L = 2