CSCI 1430 Project 2

Scene Recognition with Bag of Words
Reese Kuppig (rkuppig)

The objective of this project was to implement a scene recognition algorithm to accurately categorize images from a set of specific scene types. This identification is achieved by building an image feature vocabulary, based on the SIFT feature set, with which to describe images of each category. A set of SVMs are then trained to discriminate between the different types of scenes and they decide to which category each novel image belongs. 



Algorithm Pipeline:

The scene recognition algorithm implemented for this project breaks down into several key steps:

1. Identify common image features in a set of training images and build a feature vocabulary.
2. Use the feature vocabulary to describe the training images and train an SVM to discriminate between scenes.
3. Apply the SVM to novel images in a testing set and fine-tune the SVM's performance.

 

Feature Vocabulary:

This algorithm aims to discriminate between 15 categories of scenes (forest, tall building, kitchen, etc.), and for each category there is a training set of 100 images, for a total of 1500 training images. But to make use of these images for training, they must be quantified in some way, so they are each analyzed using the dense SIFT algorithm, which subdivides each image and analyzes the qualities of patches in each subdivision, producing a 128-vector for each patch.

After collecting these feature vectors for each image, the whole collection of features is processed by k-means, grouping similar features together across all training images in all scene categories. The centers of the k-clusters represent common image features found across training images, essentially forming a feature vocabulary, such that each image may be described by a histogram of the occurrences in the image of each feature 'word' of the vocabulary.   



Training the SVMs:

Once an image feature vocabulary is established, the qualities of each image in the training set may be quantified as a unique histogram of feature 'word' occurrences from the vocabulary. Now SVMs may be trained to discriminate between categories.

This algorithm implements a non-linear SVM, which works by mapping the feature vectors to be learned to a higher-dimension space according to a kernel function. The Gaussian/RBF kernel was chosen for this algorithm, which maps points according to the kernel function:

k(\mathbf{x_i},\mathbf{x_j})=\exp(-\gamma \|\mathbf{x_i} - \mathbf{x_j}\|^2)

By applying the kernel function, mapping the feature vectors to a higher dimension, it is more likely that the trained SVMs will be able to identify and capture a hyperplane dividing the different categories. Since SVMs natively can only discriminate between two categories, this algorithm trains an SVM for each image category, which will discriminate between the specific category and all other categories grouped together.

Once the SVMs are trained, the testing image sets are evaluated to determine the performance of the SVMs. The SVM performance results can be seen below in the form of confusion matrices, which visually depict the distribution of SVM categorizations against the actual image categories. 



Results:

The performance of the SVM classifiers quickly reaches a plateau around 70% accuracy with vocabulary sizes of 200+ feature 'words'. As can be seen in the confusion matrices, the classifier appears to have trouble distinguishing between the 'office,' 'highway,' 'mountain,' and 'industrial' scenes especially. The figure below displays the most common image features in each of these four scene types, and since there seem to be unique dominant features for each category, the confusion is perhaps due to the nature of the one-vs-all SVM classifiers. While these four scene types may have unique features when compared to each other, when each is compared to the 14 other categories, it is likely that the unique features seen here become much less unique and distinguishable.

 

The performance results for varying vocabulary sizes using non-linear, Gaussian/RBF-kernel SVMs:

 

The greatest accuracy achieved was 0.6987 with the 500 word vocabulary.

 

The confusion matrices for varying vocabulary sizes using non-linear, Gaussian/RBF-kernel SVMs: