Project 3: Scene recognition with bag of words

CS 143: Introduction to Computer Vision

Hang Su

Email: hangsu@cs.brown.edu
Oct. 24th, 2011
Figure1 15-category dataset with 3000 images Figure2 Confusion matrix using 3-level pyramid and SVM with RBF kernel (Accuracy = 0.76)

1 Algorithm

1.1 Pipeline

The task of this project is conceptually straightforward: to realize multi-class classification for given test images. This a quite traditional computer vision task using machine learning techniques, and has achieved many successful applications, e.g. automatic photo tagging, image searching, etc.

The main problem in such a task is how to find a proper representation of each image, which is the feature vector used in later learning procedure. The learning part, although does affect the result a lot, is not the focus of this project. Figure3 and Figure4 are the flow charts showing the genaral pipeline employed in this project. As we can see, both images are represented in the form of histogram of visual words, and SVM are used for the learning procedure. The training part must generate a model (classifiers of each class) together with a vocabulary.

Figure3 Training flow chart Figure4 Testing flow chart

1.2 Feature extraction and encoding

First of all, low-level features of each image are presented by densely calculated SIFT vectors ([4] has shown that dense features work better for scene classification). Doing this throws away the spatial information of each image. For all the training images, these vectors are clustered into several visual words, all of which together forming a vocabulary. Each test image is also first represented by many SIFT vectors, but then these vectors are projected into the vocabulary and a histogram is built to represent the image in high level.

This sort of high-level representation of image (encoding low-level features using histogram)is proved to be appropriate for scene recognition, because spatial information is usually not important in this case. However, later paper [1] proposes a way of introducing weak geometry in a bag-of-words representation by the use of spatial histograms [2]. This approach is implemented in this project and experiment shows decent improvement in performance.

1.3 Training

Given feature vectors of ~200 dimensions, SVM is used in this project. Both linear and non-linear kernel are tried; latter is proved to be better when parameters are well tuned.

1.4 Testing

In order to achieve reliable knowledge of the performance, test can never be done using training data. Besides, images from which the vocabulary are form is not suitable for testing as well. In practice, 2-fold cross-validation is used for both performance evaluation and parameter selection in this project. This has the advantage over typical k-fold cross-validation that our training and test sets are both large, and each data point is used for both training and validation on each fold.


2 Result

Figure2 shows the typical confusion matrix achieved in my implementation. 3-level pyramid and RBF-kernel SVM are used and kd-tree is used to find nearest feature. Testing is run 10 times using 2-fold cross-validation, in each fold the whole dataset for each category is randomly partitioned into 100 training images and 100 test images. Figure5 shows the result in each trial. The average accuracy in this experiment is 0.752, and standard deviation is only 0.012. Thus the trained model has both reasonable accuracy and good generalization ablility.

Figure5 Results using cross-validation

3 Discussion

3.1 Spatial pyramid matching

Lazebnik el al. proposed a method in [1] to add spatial information to traditional bag-of-words approach. The basic idea is that words matched in similar position are more reliable than others. This is realized through a coarse-to-fine binning strategy and an appropriate weight for each level. In an M bins L+1 levels setting, the pyramid matching kernel is defined as:

where weighted sum of histograms in different levels is then summed over different bins (visual-words).

The modification to original pipeline is rather small: the histogram is now a long concatenated histogram over different levels. Figure6 shows the result of an experiment using different levels and different size of vocabulary. As show in the figure, using 3 level does improve the average accuracy by ~7% than single level. However, there isn't significant increase in performance beyond 3 level.

Figure6 Experiment with different pyramid levels and vocabulary sizes

3.2 Vocabulary size

As shown in Figure6, vocabulary size also influences the performance. However, no significant improvement can be observed over 200 words (400 words has almost the same performance as 200). It must be noted that increasing vocabulary size dramatically increases training/testing time, so most of my test are done using 200.

3.3 Non-linear SVM

Non-linear SVM using RBF kernel brings about 6% increase in performance than linear SVM. RBF kernel function is defined as:

A proper value of σ is important for its performance, as well as the parameter lambda (=1/C). Latter is the punishment parameter of outliers, related to the generalization ability of the learned model. To find proper value for them, 2-fold cross-validation is run under different settings. Result is shown in Figure7. 0.1 seems a relatively good value for both of them. And the result also shows that SVM's performance is not quite sensitive to its parameters, which is one of its good properties.

Figure7 Experiment with different SVM parameters

3.4 Kernel codebook encoding (soft assignment)

This encoding method is also implemented in this project. The general idea is no to assign each feature into single bin(visual-word), but assign it into all the bins with different weights according to the distances:

However, using this strategy actually decreases the performance and slows down the program (because soft assignment requires all distances, so k-d tree can not be used).

3.5 k-d tree

k-d tree is a space-partitioning data structure for organizing points in a k-dimentional space. It's specially useful for fast search for nearest neighbour. [5] recommends the number of points in the data, N, should be N>>2k, otherwise the efficiency is no better than exhaustive search. However, although this relationship is obviously violated in this project, using k-d tree does improve the speed and performance modestly. Seems VLFeat really does a great job in optimization!


References

[1] S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognition Natural Scene Categories. In Proc. CVPR, 2006.

[2] K. Grauman and T. Darrell. Pyramid Match Kernels: Discriminative Classification with Sets of Image Features. In Proc. ICCV, 2005.

[3] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The Devil is in the Details: an Evaluation of Recent Feature Encoding Methods. In Proc. BMVC, 2011.

[4] L. Fei-Fei, and P. Perona. A Bayesian Hierarchical Model for Learning Nature Scene Categories. In Proc. CVPR, 2005.

[5] J. Goodman, J. O'Rourke, and P. Indyk. Chapter 39: Nearest neighbours in high-dimensional spaces. In Handbook of Discrete and Computational Geometry (2nd ed.), 2004.


Last update: Oct. 24th, 2011