Project 3 -- Scene Recognition with Bag of Words

By: Kaijian Gao

Algorithms

The first step is to collect local features for the word vocabulary. For each of the images in the training data set, I apply the vl_dsift function to it (with size of 4 and step of 8) to obtain a dense collection of SIFT features. I use the horzcat function to piece together the SIFT_features of all the images, and applied vl_kmeans to the combined SIFT_features for all images. vl_kmeans returns the vocabulary bank we want, with vocab_size = 200.

Then, for each image in the training images, I create a histogram of the word frequency, by assigning each local feature in that image to the closest word in the vocabulary. The closest word is identified by calculating the differences between the image features and the vocab with vl_alldist2, then finding the vocab with the smallest difference. The histogram is then normalized by dividing each element by the sum of all the histogram components, so that the size of the images will not affect the histograms' representation of data.

After having the histograms for each image, I train 1-vs-all classifiers for each scene category, based on the bag of words obtained from the training data. Then, I am able to convert each test image into a bag of words representation (by building histograms from them) and classify the images with the trained SVM.

Finally, I build a confusion matrix and measure the accuracy of the bag of words classification.

Results

This is the confusion matrix that resulted from my classifier:

And this is the numerical presentation of the confusion matrix:

	confusion_matrix =

    94     1     0     1     1     0     0     0     1     0     1     0     0     1     0
     1    79     0     6     0     4    10     0     0     0     0     0     0     0     0
     0     0    94     0     0     5     0     1     0     0     0     0     0     0     0
     1    12     1    76     2     2     3     1     0     0     1     0     1     0     0
     4     3     1     1    59     0     0     6     8     0     1     0    11     0     6
     6     0     5     2     0    83     2     2     0     0     0     0     0     0     0
     5    27     8     6     0     8    43     1     0     0     2     0     0     0     0
     2     0     0    10    20     3     1    52     4     0     0     2     0     3     3
     0     3     2     0     3     5     0     6    73     1     3     3     0     0     1
     0     0     0     0     0     0     0     0     0    88     2     0    10     0     0
     3     0     0     0     2     3     2     0     3    13    45     3    17     6     3
     6     2     1    11     7     4     5     4    10     2     1    33     3     3     8
     1     0     0     0     9     1     0     0     2    21     8     3    50     0     5
     1     0     0     0     3     3     0     0     8    23    24     2    17     8    11
     			1     0     3     3    18     6     0     3     5     4     3     1     3     2    48			
			

After classifying all 3015 test images with a training set of 1500 images, my classifier has an overall accuracy of 61.67%. From the confusion matrix, it appears that images belonging to the classes "Suburb" and "InsideCity" had the most accurate classification - up to 94%, while those of the class "Industrial" were barely classified correctly (8% classification accuracy), with many of the images mislabeled as "Highway" or "Mountain" images.