Scene recognition with bag of words (Mason McGill | mm28)

The goal of a "bag of words" model of image classification is to sort images by scene category (e.x. forest, mountain, industrial). The pipeline:

Step 1: Establishing a vocabulary of visual words through feature clustering

SIFT chooses key points where texture changes occur, and represents each of these points as 128-dimensional feature vectors designed to be invariant to affine transformation. The SIFT key points in each image and the SIFT feature vectors representing each point were extracted using the vl_dsift function from the VLFeat.org library. I set the "size" parameter of vl_dsift to 4 and the "step" parameter to 8. I also passed in the parameter "fast", which instructs the function to use piecewise-flat, rather than Gaussian, windowing for faster computation; this had little effect on the results. The feature vectors from 10 random key points per image were clustered into 200 clusters using k-means. The centroids of these clusters were taken to be the "words" that defined relevant features of the data set.

Step 2: Converting all training images into the histogram representation

For each training image, I classified all of the key points as one of the 200 words defined previously, using nearest-neighbor classification. I then represented each image as a histogram of word types (i.e. bin[1] = number of "word 1"s; bin[2] = number of "word 2"s). The histograms were normalized so the ratios between word representations were emphasized.

Step 3: Learning a set of one-vs-all classifiers from the training histograms

Using the normalized histograms as feature vectors describing each image, a set of support vector machines were trained to distinguish between members and non-members of each scene category.

Step 4: Classifying each test image and building a confusion matrix

Each test image was presented to each of the support vector machines, and assigned the category of the support vector machine with the strongest response. The confusion matrix--a 2D histogram of the p of the images, organized by target category one one axis and prediction on the other--was constructed.

Step 5: Visualization and scoring

The confusion matrix:

Responses along the diagonal are correct. All others are incorrect. Overall accuracy: 60.07%

Tuning parameters

I used a test set to tune parameters, and then a validation set to confirm that I had not over-fitted the data, and that the parameters translated to other images.

I also tested multiple vocabulary sizes, and found that a vocabulary size of 200 does not over-fit the data when using around 50-100 data points.

Using cross-validation I found that the performance of the system is relatively consistent. Using 50 training/test examples per class, rather than 100, this is the performance over 10 randomized runs:

Standard deviation: 0.0088