CS 143 / Project 3 / Scene Recognition with Bag of Words
The goal of this project is to recognize 15 categories of scene using different image presentation and and classifier. We then compare the performance between each model. The following is the methods I implemented for this project
- Tiny images representation + K-NN
- Bag of SIFT representation + K-NN
- Bag of SIFT representation + Linear SVM
- Bag of SIFT representation (Spatial Pyramid) + Non-Linear SVM
Tiny images representation + K-NN
In this method, we resize each images to a 16 by 16 tiny images with zero mean and unit length and use these tiny images as the representation.
We calculate the L2 distance between every test images and train images, and run the K-NN algorithm to vote for the label.
By using this method, we observed that when K = 1 we will get the best result.
K | 1 | 3 | 5 | 10 | 15 |
---|---|---|---|---|---|
Accuracy | .227 | .209 | .212 | .214 | .217 |
Table 1. Different K's and their performances

The confusion matrix for tiny images representation and 1-NN classifier.
Bag of SIFT representation + K-NN/Linear-SVM
Algorithm
- Dense sample SIFT features for each train images with step size 8 and bin size 4. For each train images randomly select 150 SIFT features and add them into a descriptor pool.
- Run the k-means clustering algorithm on the descriptor pool and build up a k-word visual vocabulary.
- Represent every train and test images as histograms of visual words built from step(2). For each images, we sample SIFT features with step size 8 and bin size 4. We calculate the similarity between these new observed SIFT features and visual words to figure out which vocabulary they belong to.
- Use K-NN/Linear-SVM to classify each test images.
K-NN classifier
Use K-NN as the classifier. At evaluation phase, a test image is assigned the label with the highest vote. We observed that with bigger K we will get better accuracy. Table 2 shows the prediction accuracy of different K's.
K | 1 | 3 | 5 | 10 | 15 |
---|---|---|---|---|---|
Accuracy | .576 | .580 | .591 | .615 | .617 |
Table 2. Different K's and their performances. Vocab size = 400.

The confusion matrix for bag of SIFT representation and 15-NN classifier.
Linear-SVM
Instead of using the k-NN neighbor classifier, we train 1-vs-all linear SVMs. Then at evaluation phase, a test image is assigned the label of the classifier with the highest response. Table 3 shows the prediction accuracy of different vocabulary size.
Vocab_size | 400 | 800 |
---|---|---|
Accuracy | .637 | .659 |
Table 3. Different vocab_size and their performances. alpha = 0.001.

The confusion matrix for bag of SIFT representation with vocab_size = 800 and linear-SVMs.
Bag of SIFT representation (Spatial Pyramid) + Non-Linear SVM
Bag of SIFT features represent an image as orderless collection of features. It disregard all the spatial information of an give image. In order to capture the spatial layout of the features, we used the spatial pyramid method (introduced in Lazebnik et al. 2006) to generate our new image representation. Then, instead of using the linear-SVM, we used the non-linear SVM as our new classifier. As the paper suggested, we can compute the kernel matrix using the equation(4) in Lazebnik et al. 2006.
Kernel Matrix

Pyramid match kernel matrix for train images.
Setup: vocab_size = 400, L = 2 for spatial
pyramid, and dense sample SIFT features using step size = 8 and bin size = 4.
Tuning parameters
Tune SVM parameters on validation dataset to pick the best lambda. Note: All the following experiments is done by vocab_size = 400, Dense sample SIFT features using step size = 8 and bin size = 4.
Accuracy rate as function of lambda. L = 2.
Here we pick up lambda = 0.01 as our regularization weight. Then we fixed the lambda at 0.01, and run the experiment using different pyramid levels. For each level, the result were evaluated by randomly pick 100 training and 100 testing images for each iteration.
L | 0 (1X1) | 1 (2X2) | 2 (4X4) | 3 (8X8) |
---|---|---|---|---|
Avg. accuracy | .768 | .797 | .806 | .804 |
Table 3. Different pyramid level and their performances. Iteration = 5, lambda = 0.001.
We can see from the above table that performance remains identical when we go from L=2 to L=3. After tuning the parameters using validation data, we decide to pick up lambda = 0.01 and L = 2 which they yield the best result.
Different vocabulary size
We also run the experiment using different vocabulary size. From the experiment we observed that increasing vocabulary size does not always increase the accuracy, which is consistent to the result in Lazebnik et al. 2006.

Mean accuracy rate as function of vocabulary size. lambda = 0.001.
Vocab_size (M) | 50 | 200 | 400 | 800 |
---|---|---|---|---|
Avg. accuracy | .763 | .792 | .805 | .803 |
Table 4. Different vocabulary size and their performances. Iteration = 5, lambda = 0.001.
Cross Validation Final Results
According to the above experiments, the final setup is lambda = 0.01, L = 2, M = 400. Further we change the
step size of sampling SIFT features from 8 to 4, and run the vl_dsift function without 'fast' parameter.
Final result were evaluated by randomly pick 100 training and 100 testing images for 5 iterations.
