The approach we will take to scene recognition is a bag-of-words method. This ignores most information about the arrangement of features in an image and makes recognition choices based on the distribution of the features in the image and a feature vocabulary we generate from the training set. The algorithm we use here follows the baseline method described in Lazebnik et al. 2006.
The basic pipeline is as follows:
In this algorithm, we compute features using a Scale-invariant feature transform (SIFT). First, we compute the features of each of the images and combine them into a large collection of features. To create the vocabulary, we use k-means clustering, with a predetermined vocabulary size, and use the centers of the clusters as our words.
Note that using clustering on all of these features often proves too memory-intensive to run. As a workaround, we use a probabalistic selection method to randomly select the features to incorporate. Our results here were achieved using a 10% sampling rate.
We again compute the SIFT features of each image, and find the vocab word closest to that feature and increase its count by one. After computing all of the histogram values, we divide by the total count to eliminate image size as a confounding variable.
We use a support vector machine (SVM) to create a set of 15 classifiers, one per class of training images. This combines the training data from the histograms of images in each class, allowing us to classify test images. In this model, we use a simple linear classifier, but more complicated SVMs can be used.
We create histograms for each test image and compare them to those generated for the training data. Each test image is evaluated under all 15 classifiers, and the one that gives the highest confidence value is selected as the classification for that image.
Once we have classified all of the test images, we can compare them to the actual classes. We record which classes of images were classified where, in a structure called a confusion matrix. (A perfect classification would occur when there were values only on the diagonals.) We can compute an accuracy score by computing the mean value of the diagonal per class.
Using a vocabulary size of 200, the algorithm achieved an accuracy of 0.6213. The confusion matrix is visualized below: