CS1430 Project 3:
Scene recognition with bag of words

Bryan Tyler Parker

Overview

For this assignment we had to employ the 'Bag of words' model for image classification. Simply put, this model looks for features in a given image and creates a histogram of the frequency of certain features ('visual words') as determined by a training set.

Algorithm

Step 1: Establish a vocabulary of visual words through feature clustering

For this step, I took each image, applied a vl_imsmooth pass to it, and then ran vl_dsift over the image. The smoothing was to simplify the image a bit so we are more likely to get key features rather than high frequency unimportant feature hits. I cumulatively appended each feature set to a matrix 'features'.

Once each image had its features extracted, I ran vl_kmeans over them, binning them according to the vocabulary size desired. This resulted in a set of common features, or vocabulary of visual words.

Step 2: Convert all training images into the histogram representation

To convert all training images into a histogram representation, we ran very simular code as we did when we built the vocabulary. Each image, I smooth it, dsift it, and apply kmeans. Now, however, I take the kmeans result and create a distance matrix from the result and the vocabulary (using pdist2). The minimum corresponding distance value between a point in the kmeans result and a vocabulary point in the distance matrix is the yields the vocab word the feature point is closest to. Taking these, I create a histogram from them, and normalize the histogram.

Step 3: ...

Using our training histogram set, we train 1-vs-all classifiers (SVMs) for each scene category based on observed bags of words in training data. This code was provided for us.

Step 4: Profit!

We now can classify a whole slew of test images, and see how well our classifier works. Knowing what each set of iamges should be classified as, we create a confusion matrix, which shows the frequencies as to what image was classified. Ideally, this should be an identity matrix, meaning the image to be classified is always classified properly.

Results

Baseline Implementation

For this and my 'improvement' extra credits, I used 50 features over 50 images per class, for time/testing reasons.

Size: 50 - Number Images Per Class: 50 - Accuracy: 49.33%

Originally, I ran 200 features over 50 images per class, but with a slightly faulty distance function. I imagine that the accuracy improvement with providing a larger feature set would be linear over my extra credits.

Extra Credit: Experiment with features at multiple scales

I ran feature extraction on two additional gaussian levels, and kmeans the results. When building the histograms, I did the same thing in the same order.

The idea behind this is so one can get features at multiple levels of detail. With each sampling of a gaussian level, one gains larger and subtler features, and less smaller more distinct features.

Size: 50 - Number Images Per Class: 50 - Accuracy: 51.20%

A minor improvement. Let's try over just one gaussian level:

Size: 50 - Number Images Per Class: 50 - Accuracy: 47.33%

Worse than our baseline. I imagine that this is because the baseline is the most important, and while using gaussian levels could help if used in addition to the baseline, by themselves they are not as good.

Extra Credit: Use "soft assignment" to assign visual words to histogram bins

For this, I took my resulting distance matrix, sorted it row-wise, and then grabbed the indices of the sort. This gave me a matrix where each row now had a number basically saying the rank of how high it was. I then squared the indices matrix and then summed it column-wise, which gave a vector. A low value in this vector meant that that feature had high relation. I normalized the vector, then inverted it, and passed that as my histogram.

Size: 50 - Number Images Per Class: 50 - Accuracy: 32.80%

My implementation is worse, though I tried a few variations on it.

Extra Credit: Experiment with different vocabulary sizes

Now, caveat, these were run on code that had a less effective distance metric, so they are not as good as the revised baseline. Despite this, the relative timings of the different runs should be interesting. All these were done over 100 images per class.

Vocab Size: 20 - Accuracy: 14.07%

Vocab Size: 50 - Accuracy: 18.20%

Vocab Size: 200 - Accuracy: 23.40%

Vocab Size: 400 - Accuracy: 22.73%

Vocab Size: 600 - Accuracy: 23.87%

Here we generally see a steady improvement but with diminishing returns. As we have a larger and larger vocabulary, the less and less important the additional vocabulary entries are, simply because of "overdescribing" the scene.

CS1430 Project 3: Scene recognition with bag of words