Scene recognition with bag of words

Alex Hills (ahills)

Purpose

This project is aimed at image classification: given a tagged set of input images, train classifiers that are able to, given another input image, classify it with a specific tag. For instance, given a lot of pictures of a forest, enable a computer to tag a completely new image (of a forest) as a forest.

Feature extraction and vocabulary building

The first step involved finding a way to train the classifiers in the first place. For this project, we extracted features from the images (a version of SIFT features). After performing this feature extraction, we clustered the features using k-means clustering to produce centroids through which we could then determine what featuresets made a specific image.

Image histograms

We then took every image we used to generate our vocabulary and then created a histogram of features given proximity to the centroids of our vocabulary.

SVM training and classification

The final step was to train SVMs and then use them to classify the final image. Each SVM is trained with the histograms, and is a 1vsAll svm. Then, to classify images, we extracted the features, created histograms again, and tested them with all the SVMs we trained. The classification is then chosen by the SVM that chose with the highest confidence.

Rating

We then built a confidence matrix which showed what images were classified as what by the algorithm, and what their actual classifications were.

Results

The first time I ran my code, I ended up with a .2. That means that 20% of the images were classified correctly (far higher than the 7% that it would be if it were random). This is a pretty bad score. I'm not sure how to analyze it, because it's clearly working, but something's going wrong, as you can see here:

After some tweaking, I ended up also comparing features from three levels of a gaussian pyramid. This increased my scores by about 5%, regardless of how many samples I took or the vocab size.

I also tested a number of different sampling rates and vocabulary sizes. Here are some socres (all are taking multiple levels of gaussian pyramids and pre-smoothing unless otherwise specified):
Vocab 200, 2000 samples: 31% (this was the best)
Vocab 200, 500 samples per image: 27%
Vocab 100, 500 samples: 22%
Vocab 200, 1500 samples, no pre-smoothing: 26.8 (this showed how important the pre-smoothing was)
Vocab 200, 500 samples, pre-smoothed, but no gaussian pyramid: 20%
Vocab 100, 500 samples, pre-smoothed, no gaussian pyramid: 16%
Vocab 50, 500 samples, pre-smoothed, no gaussian pyramid: 9%