Sam Boosalis

Algorithm

Below I describe my algorithm.

Get Features

Densely sample the features with a bin size of 4 and a step size of 8

Clustering

K-means

randomly select 100 features from each image
use K-means to cluster said subset of features into a vocabulary of words

Create Training Data

Represent each training image as a distribution of words

For each training image in the training data,

use a simple 1-NN to convert each feature to a word
reduce the image to a histogram of word counts

Train SVM

For each class,

train a one-versus-all SVM with a simple linear kernel
The SVM bisects the data along greatest-margin. What that means is: of all possible lines that perfectly bisect the data, choose the one with the longest line segment (margin) perpendicular to it, such that the region defined by the line and its margin has no points

Classify Test Data

For each test image,

run each one-versus-all SVM assigns against the test image
get the SVM whose value is maximal
that SVM (a 'scene' against 'not-scene' classifier) most strongly believes that test image belongs some scene
(although there seems to be no principled way to compare outputs of SVMs, since they are not probabilities)
output that scene as your classification

Results

Params

bin_size = 4
step_size = 8
vocab_size = 200
linear kernel
lambda = 0.1

Confusion Matrix

Accuracy

My accuracy, with the above parameters, was: 0.6200

Extra Credit

Different Vocabulary Sizes

I ran my code with the above parameters on vocabulary sizes of 10, 20, 50, 100, 200, 400.

vocab_size = 10

accuracy =

vocab_size = 20

accuracy =

vocab_size = 50

accuracy = 0.5740

vocab_size = 100

accuracy = 0.6113

vocab_size = 200

(same as above)

vocab_size = 400

accuracy = 0.6420

Discussion

As expected, vocabulary size is positively correlated with accuracy.

Non-linear (RBF) SVM kernel

Params

RBF kernel (characteristic length scale squared i.e. variance = 1)
bin_size = 4
step_size = 8
vocab_size = 200
lambda = 0.1

accuracy = 0.6200

Discussion

While the accuracy happens to be exactly the same as the linear kernel, the confusion matrix is somewhat different (you can also verify this by looking at them side by side). Had I had more time, to tune the parameters, I would expect more of an increase in accuracy. As I played with more extreme variances, I got dismal results. The SVM-ensemble was classifying everything as the same image. Since SVMs are binary classifiers, we must aggregate the results of each SVM in some way to get multiclass classification. I think my implementation was correct, but I did not have time to properly tune the parameters (to which SVMs are super sensitive). Thus, I attribute this to one one-versus-all SVM (e.g. the bedroom-or-not) returning large values, whose voice thus dominated classification. Since these values are not probabilites, there is no principled way to compare the voices of different SVMs.

variance = 0.001

variance = 100