Algorithm
Below I describe my algorithm.
Get Features
Densely sample the features with a bin size of 4 and a step size of 8
Clustering
K-means
- randomly select 100 features from each image
- use K-means to cluster said subset of features into a vocabulary of words
Create Training Data
Represent each training image as a distribution of words
For each training image in the training data,
- use a simple 1-NN to convert each feature to a word
- reduce the image to a histogram of word counts
Train SVM
For each class,
- train a one-versus-all SVM with a simple linear kernel
- The SVM bisects the data along greatest-margin. What that means is: of all possible lines that perfectly bisect the data, choose the one with the longest line segment (margin) perpendicular to it, such that the region defined by the line and its margin has no points
Classify Test Data
For each test image,
- run each one-versus-all SVM assigns against the test image
- get the SVM whose value is maximal
- that SVM (a 'scene' against 'not-scene' classifier) most strongly believes that test image belongs some scene
- (although there seems to be no principled way to compare outputs of SVMs, since they are not probabilities)
- output that scene as your classification
Results
Params
- bin_size = 4
- step_size = 8
- vocab_size = 200
- linear kernel
- lambda = 0.1
Confusion Matrix
Accuracy
My accuracy, with the above parameters, was: 0.6200
Extra Credit
Different Vocabulary Sizes
I ran my code with the above parameters on vocabulary sizes of 10, 20, 50, 100, 200, 400.
vocab_size = 10
accuracy =
vocab_size = 20
accuracy =
vocab_size = 50
accuracy = 0.5740
vocab_size = 100
accuracy = 0.6113
vocab_size = 200
(same as above)
vocab_size = 400
accuracy = 0.6420
Discussion
As expected, vocabulary size is positively correlated with accuracy.
Non-linear (RBF) SVM kernel
Params
- RBF kernel (characteristic length scale squared i.e. variance = 1)
- bin_size = 4
- step_size = 8
- vocab_size = 200
- lambda = 0.1
accuracy = 0.6200
Discussion
While the accuracy happens to be exactly the same as the linear kernel, the confusion matrix is somewhat different (you can also verify this by looking at them side by side). Had I had more time, to tune the parameters, I would expect more of an increase in accuracy. As I played with more extreme variances, I got dismal results. The SVM-ensemble was classifying everything as the same image. Since SVMs are binary classifiers, we must aggregate the results of each SVM in some way to get multiclass classification. I think my implementation was correct, but I did not have time to properly tune the parameters (to which SVMs are super sensitive). Thus, I attribute this to one one-versus-all SVM (e.g. the bedroom-or-not) returning large values, whose voice thus dominated classification. Since these values are not probabilites, there is no principled way to compare the voices of different SVMs.
variance = 0.001
variance = 100