Scene Recognition Report     CS143 Website    

Vazheh Moussavi (vmoussav)


Contents

Project Overview

In this project, we want to create a model which attempts to correctly assign a category from a fixed set to an image, using a "bag of words" model. We begin by constructing a vocabulary of "visual words", which is done by taking a bunch of image features (dense SIFT, in this case), and clustering them (using K-means code provided by VLFeat) to form hubs representing each word's location in the feature space. Once we have the visual words, we can define take in a set of labeled images define each by their approximate distribution over words, (done by building histograms of nearest-neighbors for each image's extracted features), and use them to train a Support Vector Machine for classification. We can introduce a new, unlabeled test set, and plug in their histogram distributions to classify them.

Basic Results

Using the original settings (vocab size=200, linear SVM, original lambda value, relatively sparse vocabulary building, etc.), I obtained an accuracy of 0.62. Here is a confusion matrix showing accuracies over each category pair (ideally, we want a red diagonal with dark blue everywhere else).
Clearly, this wasn't the case. There isn't too much of a trend in terms of category confusions, as the classifier performs poorly on natural (MITopencountry) as well as non-natural scenes (industrial, livingroom) too.

Extensions/Experimentation

Vocabulary Sizes

The most simple (and time-consuming) thing to try to do was to play around with different vocabulary sizes, since we don't have any real intuition of how many we should have, and non-parametric clustering algorithms (ones that don't impose the number of clusters) are slow. I tried values of 10,20,50,100,200,400,and 1000. Also showing different slack variable values (lambda), we can see that it probably hits a peak somewhere in between 200 and 400, but not much. Time to try more interesting things.

Different Kernels

A linear SVM isn't very interesting. It doesn't project the data to a higher dimension in the attempt of performing complex separation, which is the whole purpose of the SVM.

Radial Basis Kernel

The RBF (function) acts like a finite set of distances to project how close things are to the support vectors, and is genearlly a much more realistic kernel to use than a linear one.
We can see that the results do get a decent boost, which is to be expected because of the more complex planes that can be separated by the rbf, giving a max accuracy of 0.67.

Histogram Kernel

The problem with the rbf is that it doesn't take into account the specific constraints that come with having data as a histogram. Using the above formula, we can better attempt to measure distance knowing that everything must sum to 1, by taking mins acting as differences, using alpha/beta as scaling parameters.
Things did improve slightly, but as you can see, it wasn't long until increasing alpha destroyed numerical precision and made things blow up, with alpha=beta=1 giving the only good results before things went to ...

Single-Level Spatial Content

I also tried using a simple way of incorporating spatial information, by splitting up the image features into quadrants using the feature's locations and creating different histogram bins for each quadrant respectively. This should improve results, as it gets away from the restrictive "bag of words" assumption.
This worked really well with the histogram kernel (alpha=1), giving a top accuracy of 0.782, rewarding our addition of intuitive discriminants.

Cross-Validation (kinda)

While all of these reported accuracies are from using the original 100/100 train/test split, most of my experimentation was done on randomized subsets of the test data to (somewhat) not "train on test". It was also necessary for running many trials in a fast way.