For this project we will be implementing a basic bag of words model to classify scenes into one of 15 categories by training and testing on the 15 scene database introduced in Lazebnik et al. 2006.
Algorithm
The algorithm used in practice is simplified into five steps as below:
- Collect many local features and cluster them into a vocabulary of visual words.
- Represent each training image as a distribution of visual words.
- Train 1-vs-all classifiers for each scene category based on observed bags of words in training data.
- Classify each test image by converting to bag of words representation and evaluating all 15 classifiers on the query.
- Build a confusion matrix and measure accuracy.
Raw Result
The result confusion matrix without any additional features and tuned parameters is as below:
And the accuracy is 61.53%.
Experiments with Vocabulary Sizes
Different vocabulary sizes 10, 20, 50, 100, 200, 400 and 1000 are tried and the performance is reported as below:
where we see that when we increase the size from a small number, the accuracy is also increasing. However, after we reach a top, the accuracy will decrease as the size is increasing.
Cross-validation and Parameters Tuning
The original training and testing data are combined in the code and the new training and testing data are picked randomly from the whole data set for each iteration. In each of the following experiments, 5 iterations are run and the average performance and standard deviations are reported.
The learning parameter lambda is also tuned by trying 10 values for each train/test pair. By tuning this parameter, the accuracy becomes 68.17% at lambda=0.01.
Spatial Pyramid
Spatial information is added by implementing the spatial pyramid described in Lazebnik et al. 2006. Rather than single-level, the pyramid features are used. We tried L with 0, 1 and 2. For each L, we tuned the learning parameter lambda from 10^-6 to 1 and picked the one returning the best average performance.
L | Accuracy Mean | Accuracy Standard Deviation |
---|---|---|
0 (1X1) | 68.17% |
0.9% |
1 (2X2) | 73.65% |
0.65% |
2 (4X4) | 75.11% |
0.71% |
We see that the accuracy increase by adding spatial information to the features.
RBF Kernel
The radial basis funtion (RBF):
K(a, b) = exp(- gamma ||a - b|| ^ 2)
was tried to build the kernel. The gamma parameter was tried at 10^-6, 10^-3 and 1 for the features without spatial information. However, the accuracy is 68.8% which is similar to that using the linear kernel.
Final Result
In the final result, we use spatial pyramid features at L=2, the linear kernel with lambda=0.0022 and run the cross-validation for 5 times. the result is
accuracy = 75.11%
standard deviation = 0.71%
and the confusion matrix of one train/test pair is shown as:
References
- Svetlana Lazebnik , Cordelia Schmid , Jean Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, p.2169-2178, June 17-22, 2006
- R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010