In this project, I implemented an image classificatin system using a bag of words model. In order to do this, I had to start by extracting SIFT descriptors from a large number of training images. A SIFT descriptor encodes information about the gradients of an image around a point, and can be used to uniquelly identify small regions. It is possible to only sample these descriptors at locations that are determined to be prominent and unique, but for this project, we sampled them across a desne grid of the original images. Since this results in millions of descriptors, many of which we hope are similar, we perfrom k-means on them to extract out a vocabulary of 50-1000 descriptors that epitomize most of our images. Next, we created histograms detailing how much like each descriptor a series of already classified images are. To do this, we made more SIFT descriptors for each image, found which "vocabulary" descriptor they were most like, and counted up what fraction of its descriptors lined up with each vocab cluster. We then used these histograms to train a series of 1-v-all SVM's for each scene classification. Finally, to identify new images, we would construct our histograms for the test image, run it through all the SVMs, and see which one gives the best performance.

Vocab-Size	20	50	100	200	400	1000
Accuracy	0.5200	0.5807	0.6053	0.6173	0.6233	0.6140

As is easily seen, the best performance happened with a vocabulary of 400 SIFT descriptors. The Confusion Matrix for this vocab size is

The only problem is that creating this original vocabulary was very expensive and took a very long time. While this calculation only had to happen once, I was curious to see what effect would be had if I reduced the amount of data being clustered. To do so, before adding each image's SIFT's to the pool to get clustered, I would cluster the image's descriptors to some fraction of the original number. Here are the results.

Pre-clustering	None	1/5	1/10	1/20	1/50
Accuracy	0.6233	0.6140	0.6160	0.6207	0.6180
Time to SIFT (s)	87	119	110	98	96
Time to cluster (s)	1761	698	343	173	53

It appears that preclustering to gett 1/20th the data works well, causing our accuracy to only drop 0.2% while decreasing the computational time by an order of magnitude. Its confusion matrix is

This looks almost identical to the matrix without the preclustering. This may prove useful if we were to experiment with different descriptors than SIFT, but since I am keeping this project limited there, computing once is fine even if it does take a long time.

Now that we've pinned down our vocabulary size, we're going to try optimizing our parameters. The most important one is the SVM parameter. As it turns out, we can make significant improvements just by changing this one value (Note, all previous results are setting it to the default of 0.1).

Lambda	0.1	0.01	0.008	0.001
Accuracy	0.6233	0.6847	0.6873	0.6647

Just by changing this one paremeter, we are able to achieve a performance increase of nearly 6%. The resulting confuions matrix is

There is a noticeable difference from the base algorithm, especially for the later classes which became noticeably more accurate. I was also curious to see how well a radial basis would affect my SVM as opposed to this linear one (using the same optimal lambda) while changing the scale length of the kernel.

It appears that the kernel has two peaks, both of which are above the best performing linear kernel, one with a very tiny scale length and another with a very large one, and both seem to perform simiarly well. The Confusion Matrix for the better which occurs around lambda=0.85 is

This confuion matrix is almost indistiguishable from the previous one with a linear kernel, meaning that our gains are relatively minor. Each of them only provides a 0.5% performance boost over the linear kernel. It is possible that the radial kernel has a better SVM parameter for each scale length, but I lacked the time and interest to perform a full 2 dimensional optimization.