Project 3: Scene recognition with bag of words
Project Overview
Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based only on a histogram of the frequency of visual words. Visual words are identified by clustering a large corpus of example features.
The major steps are :
- Collect a lot of features.
- Use k-means to cluster those features into a visual vocabulary.
- For each of the training images build a three level pyramid. Construct the histograms of each pyramid and concatenate them.
- Feed these hitograms to an SVM.
- Build a histogram for test images and classify them with SVM just trained.
Build Vocabulary
In this part, I use k-means to cluster the sift-features to get a vocabulary of visual words. Here, I use 100 training images for each category scene. Then for each image, I get the sift features of it using vl_dsift, then use vl_kmeans to cluster all the features get from all the training images. The result is the vocabulary.
Make Histogram
In this part, I use vl_kdtreeBuild() to index the columns of the vocabulary. Then I use vl_kdtreequery() to calculate the histogram. For extra credit, I made a three level pyramid. For each pyramid, I calculate the histograms for each subpart and fianlly concatenate them all together. The three level pyramid increases the accuracy by 12%.
Results
The spatial pyramid method improves the accuracy of the prediction from 0.6107 to 0.7273(3 level). Three level pyramid works better than 2 level pyramid, improves the result from 0.6827 to 0.7273.
Basic result: accuracy 0.6107

2 level spatial pyramid result: accuracy: 0.6827

3 level spatial pyramid result: accuracy: 0.7273
