In this project, scene categorization is done using the following process (there are train and test images):

1. Build vocabulary: collect SIFT descriptors from training images and use k-mean to cluster them into the desired number of words in the vocabulary.
2. Build histogram for the training set: For each training image, ors, their distances to each word in the vocabulary, take the minimum distance for each descriptor, match it with the corresponding word, and build the histogram of the vocabulary in the image.
3. Train a one-vs-all SVM: use a linear SVM to train the model from histograms obtained from step 2.
4. Classify every test image in the test set with the trained SVM model.

Although the submitted code is designed to use a random set of descriptors of an image, I mostly run and test the results with the full set of descriptors to avoid errors generated by randomization.

Below is a result from a vocabulary chosen from random half of the descriptors. Accuracy = 0.6193.


Extra credit

 

1. Report performance on different vocabulary sizes:

8 different sizes are examined, overall results yield that 400 is the best size to be chosen. The graph shows relationship between them:

 

|V| = 10, Accuracy = 0.4213 |V| = 40, Accuracy = 0.5680 |V| = 80, Accuracy = 0.6047

|V| = 200, Accuracy = 0.6113 |V| = 400, Accuracy = 0.6180 |V| = 800, Accuracy = 0.6067

|V| = 1600, Accuracy = 0.6033 |V| = 10000, Accuracy = 0.5650

2. Pyramid Match Kernel:

I use a non-linear SVM using the histogram intersection matrix as the input kernel matrix. The result accuracy is improved to 0.7113 with vocabulary size = 200.