Scene Recognition with Bag of Words

Kyle Cackett

Scene categorization using a bag of words model ignores spatial information and deals only with image features. It is performed by first extracting a large number of descriptors from a set of images, then clustering the descriptors into a small set of “visual words” that compose a visual vocabulary. Descriptors are then extracted from a set of pre-categorized training images. Each descriptor is matched to the nearest word in the visual vocabulary to create a histogram cataloging the frequency of each visual word in the training images. These histograms are used to learn a support vector machine (SVM). Histograms are then constructed for uncategorized images and given to the SVM for classification. Using a visual vocabulary of 200 words (created by finding 200 clusters from a set of descriptors built by aggregating random samples of 10 descriptors from each image), and a linear SVM with lambda of 0.5 I achieved an accuracy of 0.6473.

Building the Visual Vocabulary

To build the visual vocabulary I used vl_dsift to get descriptors from 1500 training images (100 images from each category). I used bin size of 4 and step size of 8 to create the dense set of descriptors. I randomly sampled 10 descriptors from each image to create a raw set of 15,000 descriptors. I then used k-means to cluster the large set of descriptors into a vocabulary of visual words. I experimented with a vocabulary size of 100 and 200 to see if vocabulary size impacted performance. I found that increased vocabulary size performed slightly better (see results section). I also experimented with the number of samples taken from each image to build the raw descriptor set. Somewhat surprisingly, I found that using 100 samples per image as opposed to 10 samples per image reduced the accuracy of my classifier (see results section).

Constructing Training Histograms

Training histograms are created by calculating the set of descriptors for pre-categorized images, matching each descriptor to its closest word in the visual vocabulary, and counting the frequency of each visual word for each training image (thus building a histogram). Descriptors were computed using vl_dsift with bin size 4 and step size 8 (as was used to create the vocabulary). Euclidean distances were used to compute the closest visual word to each descriptor using the vl_alldist2 function. Finally, all histograms were normalized by dividing each frequency by the max frequency in any bin. A more uniform normalization scheme would be to divide each bin by the total number of descriptors. I did not investigate the performance of this alternative normalization technique.

Training the SVM and Classifying Images

Training histograms are passed to the primal_SVM function to learn the classifier. In my implementation I experimented with different parameters for training the linear classifiers. I varied the lambda parameter to evaluate its effect on performance (see results section). Once trained, the SVM is used to classify uncategorized images. Descriptors are extracted from each image as above and used to build histograms cataloging the frequency of visual words from the vocabulary for each unclassified image. These histograms are then given to the 1-vs-all classifier and are assigned a label with a certain confidence. To evaluate performance we find the label with the maximum confidence and use it as our classification. A confusion matrix is constructed displaying the correct classifications along the diagonal and incorrect classifications on the off-diagonals.

Results

As mentioned above, I experimented with different vocabulary sizes, different lambda values, and different descriptor sample sizes per image to evaluate the effect of these parameters on performance. With a vocabulary size of 100, lambda of 0.1, and 10 descriptors sampled from each image I achieved an accuracy of 0.6027. The confusion matrix for these parameter values is included below. 100-L-0.1-none.jpg

By simply increasing the lambda parameter to 0.5 I achieved an accuracy of 0.6300. The confusion matrix is again included below.

100-L-0.5-none.jpg

I then experimented with a larger vocabulary size of 200. With a lambda of 0.1 I achieved an accuracy of 0.6267 (left confusion matrix) and after increasing the lambda parameter to 0.5 I achieved my highest accuracy of 0.6473 (right confusion matrix).

200-L-0.1-none.jpg 200-L-0.5-none.jpg

Interestingly, after increasing the number of sample descriptors per image to 100 with a vocabulary size of 200 words and lambda of 0.5 I saw a decrease in performance to 0.6233.