Scene recognition with bag of words

Project Description

In this project we used a dense sampling of SIFT features over a dataset of scene images to create a bag of words model for each image. The models for a set of training images were used to train an SVM. This SVM was then used to classify test images distributed over 15 different categories of scenes.

In this writeup I will show how I built the original vocabulary, trained 2 different 1vAll SVMs with different types of kernel functions, tested my classifiers using cross validation, and made a simple but useful speed-up over the stencil code.

Building a Vocabulary

In order to make a bag of words model for each image in the training dataset, a vocabulary of words had to be defined. In order to do this, I made a dataset of all of the SIFT features sampled from all of the 1500 training images. The SIFT features were spaced by 8 pixels throughout each image and had a bin size of 4.

Once this set of approximately 1.5M SIFT features was collected, I found 200 cluster centers for the space of features using the k-means implementation provided by VL_KMEANS.c. I then saved the vectors of these 200 cluster centers in a vocabulary matrix.

Calculating Image Bag of Words

In order to efficiently compare images in the dataset, a bag of words for each image had to be calculated. This was accomplished by calculating the densely sampled SIFT features for an image. Each SIFT feature was then compared to the previously calculated vocabulary. Each feature location was assigned a 'word' number based on the index of the closest feature vector in the vocabulary. Finally a histogram of all of the words in an image was made. I normalized this histogram. This final histogram was the 'bag of words' model for the image.

As a speed-up I maintained a cell array of all of the bag of words models I had calculated. This way I didn't have to calculate all of the bag of words models for all of the images in the dataset at once, but I also didn't have to recalculate any histograms after the first time. This made a big difference in the running time when it came to testing different SVMs and running cross-validation.

Training the SVM

In this project I used the primal_svm code to calculate the linear SVM models and the RBF kernel and non-linear SVM.

Linear SVM

I tried several different regularization parameters for the linear SVM, and found the optimal one to be lambda = 0.015. The results of using this parameter with a linear SVM are shown in the figures and listings below.

Radial Basis Function SVM

The radial basis function was calculated using the compute_kernel function from the primal_svm suite of matlab code. The essential difference in using an RBF kernel instead of a linear kernel is that the decision boundary created by the SVM can have arbitrary complexity. The SVM can optimize the boundary between one class of points and all other points using a more complicated shape than a linear hyperplane.

Testing

In the first test run I simply used a linear SVM with the stencil set of test and train images. This is the confusion matrix that was produced. This represents one of the higher scores I found overall.


Accuracy = 63.8%	Each category's accuracy labeled

Cross-Validation

To ensure that neither of my SVMs were over-fitting the boundary between test and train data, I used a cross-validation scheme of 10 random splits of all the available images in each category. Below are the lists of results from each of the 10 runs with the linear and non-linear SVMS. The non-linear SVM seemed to do slightly better with the highest accuracy being 64%. The lowest was 60.7%.

accuracy Linear =
0.6073
0.6220
0.6253
0.6307
0.6240
0.6073
0.6133
0.6287
0.6307
0.6160

accuracy RBF =
0.6400
0.6060
0.6280
0.6367
0.6240
0.6193
0.6227
0.6140
0.6327
0.6240

cs143 - Scene recognition with bag of words