This project is meant to categorize pictures into one of 15 scenes (bedroom, suburb, industrial, kitchen, living room,
coast, forest, highway, inside city, mountain, open country, street, tall building, office, store). At a high level,
the implementation follows this pipeline: build a visual vocabulary from the training images, represent the training images as
histograms of the visual words in our vocabulary, train 1 vs. all classifiers for each category, represent each of the
test images as a histogram of visual words, feed these representations into the classifiers, and classify each of
the test images by assigning it to whichever classifier claims it the most strongly.
The building of the visual vocabulary is accomplished by first generating SIFT descriptors for each image by densely
sampling the images. The set of SIFT descriptors for all of the training images are then fed into a kmeans function
in order to cluster them into the given vocabulary size.
In order to turn an image into a histogram of visual words from the previously defined vocabulary, we first perform
the same dense sampling SIFT extraction on the image. Each descriptor is then assigned to the closest visual word
in the vocabulary based on squared euclidean distances.
For my baseline, I used the following parameters:
In order to attempt improving on my baseline, I experimented with increasing the SIFT step size and bin size. I did not experiment with decreasing the bin and step size because the baseline parameters were already small and decreasing them further would have led to thousands of tiny SIFT descriptors that would not have captured relevant information.
Larger bin and step size:
I also experimented with different vocabulary sizes, both larger and smaller than the 200 baseline.
Overall, I found that increasing the bin and step size for the dense SIFT decreased performance markedly. As far as vocabulary size is concerned, I found that a decrease in vocabulary size of approximately one order of magnitude (200 to 50) led to a negligible decrease in accuracy, but that once the vocabulary size was decreased by two orders of magnitude (200 to 10), a significant fall off in accuracy was seen. When increasing the vocabulary size, a doubling of the vocabulary size led to a negligible increase in the accuracy.