CS143 Introduction to Computer Vision: Project 1 Hybrid Images

For this project we will be implementing a basic bag of words model to classify scenes into one of 15 categories by training and testing on the 15 scene database introduced in Lazebnik et al. 2006.

Algorithm

The algorithm used in practice is simplified into five steps as below:

Collect many local features and cluster them into a vocabulary of visual words.
Represent each training image as a distribution of visual words.
Train 1-vs-all classifiers for each scene category based on observed bags of words in training data.
Classify each test image by converting to bag of words representation and evaluating all 15 classifiers on the query.
Build a confusion matrix and measure accuracy.

Raw Result

The result confusion matrix without any additional features and tuned parameters is as below:

And the accuracy is 61.53%.

Experiments with Vocabulary Sizes

Different vocabulary sizes 10, 20, 50, 100, 200, 400 and 1000 are tried and the performance is reported as below:

where we see that when we increase the size from a small number, the accuracy is also increasing. However, after we reach a top, the accuracy will decrease as the size is increasing.

Cross-validation and Parameters Tuning

The original training and testing data are combined in the code and the new training and testing data are picked randomly from the whole data set for each iteration. In each of the following experiments, 5 iterations are run and the average performance and standard deviations are reported.

The learning parameter lambda is also tuned by trying 10 values for each train/test pair. By tuning this parameter, the accuracy becomes 68.17% at lambda=0.01.

Spatial Pyramid

Spatial information is added by implementing the spatial pyramid described in Lazebnik et al. 2006. Rather than single-level, the pyramid features are used. We tried L with 0, 1 and 2. For each L, we tuned the learning parameter lambda from 10^-6 to 1 and picked the one returning the best average performance.

L	Accuracy Mean	Accuracy Standard Deviation
0 (1X1)	68.17%	0.9%
1 (2X2)	73.65%	0.65%
2 (4X4)	75.11%	0.71%

We see that the accuracy increase by adding spatial information to the features.

RBF Kernel

The radial basis funtion (RBF):

K(a, b) = exp(- gamma ||a - b|| ^ 2)

was tried to build the kernel. The gamma parameter was tried at 10^-6, 10^-3 and 1 for the features without spatial information. However, the accuracy is 68.8% which is similar to that using the linear kernel.

Final Result

In the final result, we use spatial pyramid features at L=2, the linear kernel with lambda=0.0022 and run the cross-validation for 5 times. the result is

accuracy = 75.11%

standard deviation = 0.71%

and the confusion matrix of one train/test pair is shown as:

CS143 Introduction to Computer Vision: Project 3 Scene Recognition with Bags of Words