Project 3 Scene Recognition with bag of words

Design Overview

1. build_vocabulary.m:
In each loop, I called the vl_dsift to get a set of dense sifts with size 4 and step 8. Then I randomly sampled 100 sift descriptors from the set. In the end, I clustered them with k-means, k = vocab_size.

2. make_hit.m:
I looped through all the sift descriptors of the image, for each of them, find the nearest one in the vocab and add 1 to the row that correspond to the vocab in the histogram.

Below is the performance across different vocab sizes and their confusion matrixes.

vocab size	accuracy
20	0.5507
50	0.6100
100	0.6480
200	0.6560
400	0.6630

confusion matrix for vocab size 20

confusion matrix for vocab size 50

confusion matrix for vocab size 100

confusion matrix for vocab size 200

confusion matrix for vocab size 400