High level scene recognition is an important problem in Computer Vision. We want to see if computers can sort images into a thematic scene-based classification scheme. A good way to approach this problem is to use a Bag of Words model. The model is simple, visual "words" (or SIFTS in our case) can be loosely or tightly associated with particular scenes. Given a dictionary of vocabulary "words", we learn the distribution of these words across images from each scene. To do this we use an SVM that creates a model of "words". We will use the SVM later to classify new test images. We will find the histogram of the vocabulary "words" for a test image and then try to match it to one of the scene distributions with the SVM.
We need to devise a strong list of vocab for our Bag of Words model. Our vocabulary list will be a collection of visual "words", and specifically the words will be scale invariant features. These features allow us to identify and describe local features in images. In my code, I used a vocabulary size of 200 "words" taken from all 1500 training images. Specifically I use vl_dsift to give me a column of SIFT features from each image. I then concatenated these features and pumped them into vl_kmeans. The vl_kmeans algorithms clusters all of the SIFT features into 200 clusters. I take the center feature from each of these clusters to add to my vocabulary list.
Note: the relative advantage of clustering all features from all 1500 training images was minimal. Clustering a randomized subset of features, from a randomized subset of images can find 200 strong vocabulary features and reduce the running time of this step tremendously.
In our workflow, the SVM will create a model for a scene, based on the corresponding image's "word" frequencies histogram. The next part of the project is to build these word frequencies histograms. Given an image and a vocab list, we first find all of the SIFT features in the image using vl_dsift(). We then create a histogram with 200 buckets, with 200 being the size of the vocab list. Next for each SIFT feature from the image, I compare it to all 200 words in the vocabulary. I find the minimum distance using vl_alldist2 and increment its corresponding bucket. I then normalize the histogram
The make_hist() script is used in two steps. During the learning stage, we take the histograms from each image in a scene and pump it into the SVM to determine a model. In my execution I found histograms for 100 images per scene class. The second time make_hist.m is called during the testing phase.
for my final test I did a full evaluation.
I set the parameters to the following:
in build_vocabulary.m and make_hist.m, my vl_dsift() function call used a size of 4 and a step of 8.
The complete run through and testing showed that my code had an accuracy of 0.6173. The confusion matrix shows a strong response down the horizontal which means that the scenes of most images were correctly recognized.