In this project, I implemented a bag of words model for recognizing natrual scene. The Spatial Pyramid Matching technique described in Lazebnik et al 2006 is applied, and it improves the performance of our scene classifier a lot.
A basic flow of our scene classifier is as follows:
An easy way to test the performance of our scene classifier is to train it with a fixed set of images and then test the accuracy with another fixed set of images. However, with this testing method, the result we get may not reflect the "actual" performance of our scene classifier because the result may vary with different training sets and testing sets.
In order to get a better testing result which can reflect the actual performance of our scene classifier more, I choose to use the cross-validation technique. The performance is tested in 4 iterations in total, and 100 training and 100 testing images are randomly picked for each iteration. We measure the performance by the average performance and standard deviations.
lambda=0.01, accuracy =0.7525, standard_deviation =1.2005
The Spatial Pyramid Matching technique is proposed in Lazebnik et al 2006. Its performance in scene recognition is superior than that of a basic bag of word model because it retains some degree of geometric information of the images which is usually discarded by basic bag of word models. Details about this technique can be found in the original paper Lazebnik et al 2006.
From the following two diagrams, we can see that after applying Spatial Pyramid Matching technique, the accuracy of our classifier is improved significantly.
![]() |
![]() |
Basic bag of words model (accuracy = 0.6182) |
Spatial Pyramid Matching (accuracy = 0.7100) |
As we use a linear primal SVM to classify scenes, we need to choose an approapriate parameter lambda so that the SVM can work well in our scene recognition task.
First, we need to make clear the meaning of the parameter lambda. Lambda is defined by the following formula
lambda = 1/C,
in which C means the penalty of each misclassified datum in the training set. Therefore, the smaller lambda is, the more penalty the SVM needs to pay for each misclassified datum, and thus the SVM will tend to avoid misclassification.
I experimented with different lambda values 1, 0.1, 0.05, 0.02, 0.01, 0.005, 0.001, 0.0001, 0.00001, as the following diagrams show:
![]() |
![]() |
lambda=1, accuracy =0.7149, standard_deviation =0.8308 |
lambda=0.1,
accuracy =
0.7100, standard_deviation =1.0927
|
![]() |
![]() |
lambda=0.05, accuracy =0.7355, standard_deviation =0.5781 |
lambda=0.02, accuracy=0.7519, standard_deviation =0.1089 |
![]() |
![]() |
lambda=0.01, accuracy =0.7525, standard_deviation =1.2005 |
lambda=0.005, accuracy =0.7463, standard_deviation =
1.3306
|
![]() |
![]() |
lambda=0.001, accuracy =0.7395,
standard_deviation =0.9559 |
lambda=0.0001, accuracy =0.7300, standard_deviation =1.0513 |
![]() |
|
lambda=0.00001, accuracy =0.7300,
standard_deviation =1.0470 |
We can see that when lambda is approximately 0.01, the scene classifier has the best performance.