CSCI1430 Project 3: Scene Recognition with Bag of Words

Soravit Beer Changpinyo (schangpi)

1. Introduction

The goal of this project is to classify scenes into one of the 15 categories using bag of words models. The bag of words models ignore spatial information in the image. In other words, we assume that an image is represented as an unordered collection of pixels or local features corresponding to that pixel. We will be implementing the baseline method in Lazebnik et al. 2006, which uses a histogram of the frequency of visual words to represent the feature input to the SVM classifier.

2. Algorithm

2.1 Basic Flow

The pipeline of the algorithm is as follows. For each image, SIFT descriptors were extracted. Then I considered the features of all images in the training set and clustered them into visual words using K-means. Then I built a histogram of each image based on such visual words, that is, finding the nearest visual word for each feature of the image. I used the KD-tree data structure to speed up this process. To construct a classifier, I fed the histograms of the training images to the SVM. Finally, I used the classifier to classify the test images based on their histograms.

2.2 Extra Credit

First, to make sure that the accuracy measurement is trustworthy, I used 5-fold cross validation across random test/train splits (5 is reasonable given that cross-validation is quite expensive). Second, I trained and validated on folds in order to tune learning parameters (vocabulary sizes, lambdas). I chose the number of training images to be 100 (maximum). I tuned sampling rates manually during the first few tries since doing cross-validation and trying many vocabulary sizes slowed down the program pretty much already. In the end, I chose to sample 800 features per image.

Performances along with confusion matrices are reported in the next section.

3. Results and Discussion

As always, there are a large number of free parameters. One thing that we can do to avoid overly computational cost is to optimize one parameter at a time.

3.1 Sampling Rates

I decided to tune sampling rates first. I fixed vocabulary size to 20 and tried the following three different scenarios (In general, I tried to make sure that features did not concentrate on one part of the image):

I: nsamples = 200, vl_dsift(img, 'size', 4, 'step', 8)
II: nsamples = 500, vl_dsift(img, 'size', 4, 'step', 8)
III: nsamples = 800, vl_dsift(img, 'size', 4, 'step', 6)
where nsamples is the number of features I sampled from each image, and lambda = 0.1.

I II III
Accuracy 58.13 61.40 62.87
Confusion Matrix

3.2 Vocabulary Sizes and SVM Parameters

I tuned the vocabulary size and SVM parameters at the same time. For the vocabulary size, I tried 10, 20, 50, 100, 200, 400. Unfortunately, I had not tried bigger numbers since the program was getting much slower as the vocab size increased (from my calculation, it would take 15 and 150 hours to do K-means on all folds with vocab size = 1000 and = 10000, respectively, even though I made every attempt to make my implementation efficient).

For SVM parameters, first I set the maximum number of Newton steps to be 60 instead of 20. This will allow the parameters of the SVM classifier to be closer to its optimal values. Then for each vocab size, I tried the following values of the regularization parameter lambda: 1e-4, 1e-3, 0.005, 0.01, 0.05, 0.1, 0.5, 1. The regularization parameter is very important because it controls the complexity of the model. We can avoid underfitting and overfitting by adjusting this parameter.

The average accuracy on validation sets over all folds for each vocabulary size and lambda
	lambdas
	0.0001     0.001     0.005      0.01     0.05      0.1        0.5        1

10     0.3873    0.3867    0.3873     0.3867     0.3813    0.3827    0.3700    0.3600
20     0.4827    0.4833    0.4813     0.4860     0.4847    0.4867    0.4660    0.4427
50     0.5500    0.5507    0.5653     0.5680     0.5673    0.5613    0.5093    0.4827
100    0.5407    0.5860    0.6047     0.6087     0.5960    0.5733    0.5260    0.4920
200    0.5587    0.6013    0.6353     0.6387     0.6167    0.5987    0.5267    0.4920
400    0.5847    0.6247    0.6520     0.6467     0.6173    0.5900    0.5180    0.4787


Note: red lines are standard deviations


It seems like changing lambdas affect the result more at higher vocabulary size.

It is worth mentioning that cross validation is a more reliable method in determining the parameters of the model than just considering one train-validation set. This is because the performances of a model from different folds are not consistent. However, cross validation is also much more expensive.

3.3 Final Results

Using the models tuned in 3.1 and 3.2, the final average accuracy over all folds is 0.668800, and the standard deviation is 0.014379. The standard deviation seems to be lower than those from cross validation above. The confusion matrix is shown below. It is clearly better than the ones in 3.1. Of course, there are still rooms for improvement; one can try using more complicated features such as gist, or using more complex kernels in SVM. Moreover, even though ignoring spatial information does not give poor results, adding such information could lead to a much better performance.