Project 3: Scene recognition with bag of words

CS 143: Introduction to Computer Vision

Li Sun

 

Overview

In this project, I implemented a bag of words model for recognizing natrual scene. The Spatial Pyramid Matching technique described in Lazebnik et al 2006 is applied, and it improves the performance of our scene classifier a lot.

A basic flow of our scene classifier is as follows:

  1. SIFT features are extracted from dense distributed keypoints of all training images.
  2. All extracted features are clustered by k-mean to get a visual vocabulary.
  3. A histogram of word frequency is built for each of the training images.
  4. Feed all these histograms to an SVM
  5. Histograms are built for all testing image with the same method, and classify them with the trained SVM

Performance

Cross-Validation

An easy way to test the performance of our scene classifier is to train it with a fixed set of images and then test the accuracy with another fixed set of images. However, with this testing method, the result we get may not reflect the "actual" performance of our scene classifier because the result may vary with different training sets and testing sets.

In order to get a better testing result which can reflect the actual performance of our scene classifier more, I choose to use the cross-validation technique. The performance is tested in 4 iterations in total, and 100 training and 100 testing images are randomly picked for each iteration. We measure the performance by the average performance and standard deviations.

Result

lambda=0.01, accuracy =0.7525, standard_deviation =1.2005

Discussion

Spatial Pyramid Matching

The Spatial Pyramid Matching technique is proposed in Lazebnik et al 2006. Its performance in scene recognition is superior than that of a basic bag of word model because it retains some degree of geometric information of the images which is usually discarded by basic bag of word models. Details about this technique can be found in the original paper Lazebnik et al 2006.

From the following two diagrams, we can see that after applying Spatial Pyramid Matching technique, the accuracy of our classifier is improved significantly.

Basic bag of words model (accuracy = 0.6182)
Spatial Pyramid Matching (accuracy = 0.7100)

 

Different lambdas in SVM

As we use a linear primal SVM to classify scenes, we need to choose an approapriate parameter lambda so that the SVM can work well in our scene recognition task.

First, we need to make clear the meaning of the parameter lambda. Lambda is defined by the following formula

lambda = 1/C,

in which C means the penalty of each misclassified datum in the training set. Therefore, the smaller lambda is, the more penalty the SVM needs to pay for each misclassified datum, and thus the SVM will tend to avoid misclassification.

I experimented with different lambda values 1, 0.1, 0.05, 0.02, 0.01, 0.005, 0.001, 0.0001, 0.00001, as the following diagrams show:

lambda=1, accuracy =0.7149, standard_deviation =0.8308
lambda=0.1, accuracy = 0.7100, standard_deviation =1.0927
lambda=0.05, accuracy =0.7355, standard_deviation =0.5781
lambda=0.02, accuracy=0.7519, standard_deviation =0.1089
lambda=0.01, accuracy =0.7525, standard_deviation =1.2005
lambda=0.005, accuracy =0.7463, standard_deviation = 1.3306
lambda=0.001, accuracy =0.7395, standard_deviation =0.9559
lambda=0.0001, accuracy =0.7300, standard_deviation =1.0513
 
lambda=0.00001, accuracy =0.7300, standard_deviation =1.0470
 

We can see that when lambda is approximately 0.01, the scene classifier has the best performance.