CS143 Project 3: Bag of words

Introduction

Summary

The general flow of this project is as follows:

Collect a lot of features and use k-means to cluster those features into a visual vocabulary.
For each of the training images build a histogram of the word frequency (assigning each feature found in the training image to the nearest word in the vocabulary).
Feed these histograms to an SVM.
Build a histogram for test images and classify them with the trained SVM.

Notes about my algorithm

My project uses the standard image categories, shown in the image below.

Here is a short description of my algorithm:

I built my vocabulary calling vl_sift for every training image. I passed in the sift descriptors to vl_kmeans. vl_kmeans clustered the sift descriptors into the number of words determined by the vocabulary size.

Next, I created the histograms for the training images. For each of the training images I build a histogram of the word frequency (assigning each feature found in the training image to the nearest word in the vocabulary).

For the baseline program I fed these histograms to a linear SVM. The function primal_svm took care of training the SVM using the training data.

The last step is the test image classification: I took each test image and built a histogram for it. I then compared it to each of the training images. The trained SVM helped me make sense of the comparisons, and classified each test image as a part of a particular class.

Results

I kept the size of the vocabulary constant at 200 words. I started off with 20 images per class and used all of the Sift descriptors to create a vocabulary with Kmeans. The results weren't great and the runtime was excessive. Next, I tried sampling 1 out of every 50 Sift descriptors per image. This reduced the runtime considerably and didn't affect the performance. Below are the results obtained with 20, 50 and 100 images per class.

Vocabulary size: 200
Images per class: 20
Accuracy: 0.5433.

Vocabulary size: 200
Images per class: 50
Accuracy = 0.6227

Vocabulary size: 200
Images per class: 100
Accuracy = 0.6147

Graph comparison

Extra Credit

Experimenting with different vocabulary sizes

From the above tests, I decided to pick vocab size = 200 and 50 training images per class as my "ideal" settings. Next, I experimented with several different vocabulary sizes to measure the performance of my algorithm. I tried out the following vocabulary sizes: 10, 20, 50, 100, 200, 400 and 1000. As mentioned in class, the perfomance did improve with an increase in vocabulary size, but so did the running time. The accuracy results are shown below.

Vocabulary size: 10
Accuracy = 0.4047

Vocabulary size: 20
Accuracy = 0.5267

Vocabulary size: 50
Accuracy = 0.5847

Vocabulary size: 100
Accuracy = 0.6073

Vocabulary size: 200
Accuracy = 0.6147

Vocabulary size: 400
Accuracy = 0.6373

Vocabulary size: 1000
Accuracy = 0.6447

Experimenting with non-linear kernels

Most datasets are not linearly separable. Thanks to the kernel trick, we can send our dataset to a higher dimensional space, where the data "magically" becomes linearly separable. A variety of kernels can be used, and the choice of a kernel depends on dataset. It seems like making the right choice is more of an art than a science: either you have a particular gut feeling, or you should just go for trial and error and hope you get lucky. I wrote up and tried several kernels, specified below. If we let

Radial Basis (gaussian):

Exponential kernel

Cauchy Kernel

Perfomance

Radial Basis

I first used the following parameters : 20 images per class for testing and training, 200 words, and I let sigma equal to 1

With a linear linear SVM and the same parameters I originally got an accuracy of 0.5433 (shown above). Now, the accuracy rose to 0.5500

Next I tried: 50 images, 400 words, and I let sigma equal to 1. I got an accuracy of 0.5827. Originally,with a vocabulary size of 200 and 50 images per class I got an accuracy of 0.6227. The linear SVM outperformed the non-linear one. One of the main reasons is probably related to the fact that sigma has not been finely tuned. Next, I tried different sigmas to see which ones was most suitable for the dataset.

Testing with different sigmas

I reused the 400 words vocabulary with 50 images per class. Using this vocab, I tested the performance of the SVM with the following sigmas: .1, .5, 1.5 and 3 with the hope of identyfing the most suitable value. Below are the accuracy results:
For sigma = .1, accuracy = 0.0973. This is clearly not the choice of sigma.

For sigma = .5, accuracy = 0.2560. Now it's getting better, but it is still underperforming.

For sigma = 1, we already know the accuracy is 0.5827.

For sigma = 1.5, accuracy = 0.6733. This is the best score so far, things are getting exciting!

Let's try incresing sigma further.
For sigma = 2, accuracy = 0.6600

Given this decrease in performance, I estimate that the sweet spot for sigma lies somewhere close 1.5.