Scene Recognition with Bag of Words
Kyle Cackett
Scene
categorization using a bag of words model ignores spatial information and deals
only with image features. It is
performed by first extracting a large number of descriptors from a set of
images, then clustering the descriptors into a small set of “visual words” that
compose a visual vocabulary. Descriptors
are then extracted from a set of pre-categorized training images. Each descriptor is matched to the nearest word
in the visual vocabulary to create a histogram cataloging the frequency of each
visual word in the training images.
These histograms are used to learn a support vector machine (SVM). Histograms are then constructed for
uncategorized images and given to the SVM for classification. Using a visual vocabulary of 200 words (created
by finding 200 clusters from a set of descriptors built by aggregating random
samples of 10 descriptors from each image), and a linear SVM with lambda of 0.5
I achieved an accuracy of 0.6473.
To build the
visual vocabulary I used vl_dsift to get descriptors
from 1500 training images (100 images from each category). I used bin size of 4 and step size of 8 to
create the dense set of descriptors. I
randomly sampled 10 descriptors from each image to create a raw set of 15,000
descriptors. I then used k-means to
cluster the large set of descriptors into a vocabulary of visual words. I experimented with a vocabulary size of 100
and 200 to see if vocabulary size impacted performance. I found that increased vocabulary size
performed slightly better (see results section). I also experimented with the number of
samples taken from each image to build the raw descriptor set. Somewhat surprisingly, I found that using 100
samples per image as opposed to 10 samples per image reduced the accuracy of my
classifier (see results section).
Training histograms
are created by calculating the set of descriptors for pre-categorized images, matching
each descriptor to its closest word in the visual vocabulary, and counting the
frequency of each visual word for each training image (thus building a
histogram). Descriptors were computed
using vl_dsift with bin size 4 and step size 8 (as
was used to create the vocabulary). Euclidean
distances were used to compute the closest visual word to each descriptor using
the vl_alldist2 function. Finally, all
histograms were normalized by dividing each frequency by the max frequency in
any bin. A more uniform normalization
scheme would be to divide each bin by the total number of descriptors. I did not investigate the performance of this
alternative normalization technique.
Training
histograms are passed to the primal_SVM function to
learn the classifier. In my
implementation I experimented with different parameters for training the linear
classifiers. I varied the lambda parameter
to evaluate its effect on performance (see results section). Once trained, the SVM is used to classify
uncategorized images. Descriptors are
extracted from each image as above and used to build histograms cataloging the
frequency of visual words from the vocabulary for each unclassified image. These histograms are then given to the
1-vs-all classifier and are assigned a label with a certain confidence. To evaluate performance we find the label
with the maximum confidence and use it as our classification. A confusion matrix is constructed displaying
the correct classifications along the diagonal and incorrect classifications on
the off-diagonals.
As mentioned
above, I experimented with different vocabulary sizes, different lambda values,
and different descriptor sample sizes per image to evaluate the effect of these
parameters on performance. With a
vocabulary size of 100, lambda of 0.1, and 10 descriptors sampled from each
image I achieved an accuracy of 0.6027.
The confusion matrix for these parameter values is included below.
By simply increasing
the lambda parameter to 0.5 I achieved an accuracy of 0.6300. The confusion matrix is again included below.
I then
experimented with a larger vocabulary size of 200. With a lambda of 0.1 I achieved an accuracy
of 0.6267 (left confusion matrix) and after increasing the lambda parameter to
0.5 I achieved my highest accuracy of 0.6473 (right confusion matrix).
Interestingly,
after increasing the number of sample descriptors per image to 100 with a vocabulary
size of 200 words and lambda of 0.5 I saw a decrease in performance to 0.6233.