CSCI1430 Project 3 : Scene Categorization

Michael Price (mprice)


The goal of the project is to train a recognizer for images from certain types of scenes, based on a 'bag of words' model for visual words. At test time, the algorithm is given an input image and predicts which type of scene is depicted in the image.


Algorithm


The algorithm involves training a classifier to represent scene categories as clusters of visual words, which are themselves clusters of SIFT features. Thus the training algorithm can be broken into three main parts:
1. Create a vocabulary of visual words.
2. Create per-image histograms of these visual words.
3. Train an SVM to model the decision boundaries between clusters of histograms that are associated with certain scene labels.
Finally, images are tested by the SVM by fitting them into its learned model.

To create a vocabulary of visual words, we extract SIFT features from each training image. We then cluster the entire extracted set via k-means into a set of cluster centers of a specified size. This size and number of cluster center features is the size of our vocabulary of visual words, and each visual words is a SIFT feature.

Our next level of 'featurization' occurs when we take histograms for each training image. We sample SIFT features from the image, and determine the closest visual word for each feature, and keep a running total for each visual word. The resulting histogram represents roughly the types of visual words in the image.

We then feed these histograms (along with the correct scene labels that correspond to each histogram) to a linear SVM for each scene category, which models a "does a given histogram belong to this category or not?" boundary.

Parameters: I used a number of different vocabulary sizes(discussed below), held the following values constant:
Number of features taken per training image: I randomly sampled 20 features per training image to cluster to form the visual vocabulary. This seemed sufficient and provided a speedup.
Number of training/test images per category per iteration: 50.
Number of cross validation iterations: 10.


Results


The highest, and most trustworthy accuracy value I can report is 0.6564. This value was obtained with the parameters given above, and a vocabulary size of 1000.

Cross Validation: To get more trustworthy results, I ran the algorithm on 10 random test/train splits of the data. The visual vocabulary was recomputed for each split from only the testing data for that split. The reported results are the averages of 10 trials.

Different vocabulary sizes: I took measurements from a range of vocabulary sizes from 10 to 5000. The results are visualized below:

Vocab Size | Average Accuracy | Standard Deviation
10 0.4517 0.0196
20 0.5284 0.0197
50 0.6025 0.0182
200 0.6275 0.0191
400 0.6449 0.0175
1000 0.6564 0.0122
5000 0.6591 0.0126

In general, a larger vocabulary gives more accurate results. I would imagine, however, that increasing the vocabulary size to 10,000 or 15,000 would cause overfitting and would adversely affect the results, as I was creating the vocabulary from a total of 15,000 features to begin with.

Here is an example of a confusion matrix from one of 10 iterations with a vocab size of 5000:

It would appear that 'industrial' images are confused with offices (fair enough).
...but also that kitchens are confused with forests.

Here is an example of a confusion matrix from one of 10 interations with a vocab size of 10:

As you can see, it is much further from the identity matrix.