Scene Recognition with Bags of Words

Charles Yeh, Oct. 2011

 

To recognize scenes with bags of words, it begins by collecting a set of features. These features are chosen by collecting the SIFT descriptors of a training image collection. K-means is run on all the SIFT descriptors and the cluster centers are used as vocab words for training and recognition. Histograms are then created for each training images by finding the closest vocab word to each feature in the training images, where each vocab word is a bar on the histogram.

These histograms were fed into SVMs, which were used to recognize further images. The following are three histograms for the first three training images with feature size 8, step 16, and vocab size of 400. Even though all three images are within the same category, it's interesting that they contain most of the words in the vocabulary are in a single category. This means that it's the numbers of each feature within a scene, not whether some features are within a scene, that define its category. Using feature size 4, step 8, and vocab size of 200, I got an accuracy of 62.07%.

It was interesting to see that most errors occurred for industrial images. They tended to be confused with highways. The kitchen was also occassionally confused with forests. It's possible that the reason for this is that the orientations between the two confused categories are similar.


Different Vocab Sizes

I experimented with several vocab sizes, but only a vocab size of 400 increased accuracy. Mouse-over a size to view the results. Vocab sizes: 10, 200, 400, 1000

The reason for this increase in accuracy when 400 vocab words was used is possibly that 200 didn't define the images well enough.
But 1000 was too much and began to generate noise and unnecessary dimensions.


Different Feature Sizes and Steps

I also tried different feature sizes and steps, though all with a vocab size of 200, and found that a larger size and step yielded better results. Small features may be too small to be defining features, so increasing the feature size may have allowed more features unique to specific scenes to appear.

Mouse-over a feature size and step to view the confusion matrix. Size 4, Step 8 | Size 8, Step 16 | Size 16, Step 32


Size=8, Step=16, variable Vocab

Finally, I combined the two experiments and found that a size of 8, step of 16, and vocab of 400 yielded by far the best results. It yielded an accuracy of 64.67%.

The other results were not as high.

Size 8, Step 16, Vocab 100, Accuracy = 61.13% Size 8, Step 16, Vocab 200, Accuracy = 62.40