Project 3: Scene recognition

Hari Narayanan

Algorithm

The approach we will take to scene recognition is a bag-of-words method. This ignores most information about the arrangement of features in an image and makes recognition choices based on the distribution of the features in the image and a feature vocabulary we generate from the training set. The algorithm we use here follows the baseline method described in Lazebnik et al. 2006.

The basic pipeline is as follows:

Use the training sample to create a feature vocabulary
For each training image, assign each feature to the nearest vocabulary "word" and create a histogram measuring vocabulary frequencies
Make a set of classifiers based on the histograms
Apply step 2 to each test image and apply all of the classifiers to each test image
Evaluate results

1. Creating the vocabulary

In this algorithm, we compute features using a Scale-invariant feature transform (SIFT). First, we compute the features of each of the images and combine them into a large collection of features. To create the vocabulary, we use k-means clustering, with a predetermined vocabulary size, and use the centers of the clusters as our words.

Note that using clustering on all of these features often proves too memory-intensive to run. As a workaround, we use a probabalistic selection method to randomly select the features to incorporate. Our results here were achieved using a 10% sampling rate.

2. Making histograms

We again compute the SIFT features of each image, and find the vocab word closest to that feature and increase its count by one. After computing all of the histogram values, we divide by the total count to eliminate image size as a confounding variable.

3. Training the classifiers

We use a support vector machine (SVM) to create a set of 15 classifiers, one per class of training images. This combines the training data from the histograms of images in each class, allowing us to classify test images. In this model, we use a simple linear classifier, but more complicated SVMs can be used.

4. Classify test images

We create histograms for each test image and compare them to those generated for the training data. Each test image is evaluated under all 15 classifiers, and the one that gives the highest confidence value is selected as the classification for that image.

5. Evaluating results

Once we have classified all of the test images, we can compare them to the actual classes. We record which classes of images were classified where, in a structure called a confusion matrix. (A perfect classification would occur when there were values only on the diagonals.) We can compute an accuracy score by computing the mean value of the diagonal per class.

Results

Basic model

Using a vocabulary size of 200, the algorithm achieved an accuracy of 0.6213. The confusion matrix is visualized below:

Modifications

With a vocab size of 20, it got an accuracy of 0.5133
With a vocab size of 50, it got an accuracy of 0.5687