Scene Recognition with Bag of Words

aabouche

One way for a computer to identify scenes from images is to use a bag of words model. You can identify which class of location that the image falls into by recognizing which visual words it contains. First, the computer must learn a visual language by being fed a set of labeled images. In this case, 1,500 images are given to the algorithm with a limit of 100 (I tried vocabularies of 100 and 230 words) visual words to include in the vocabulary. The algorithm determines the SIFT descriptors for each of the 1500 images and randomly chose 500 of the descriptors from each image to combine into a larger collection of SIFT descriptors (750,000 in total). This larger collection of SIFT descriptors was put through kmeans and clustered into 100 words (vocabulary size).

Next, each of the training images is converted into a histogram of the vocabulary. The SIFT features of the training example are found. The histogram has one bin per word in the vocabulary and these bins contain the number of each occurrence of a word in the collection of SIFT features for an image. Because the SIFT features (or words) from the training example are probably not exactly the same as the vocabulary words, the closest word to the feature is found. vl_alldist2 takes in the collection of sift features and the vocabulary. The function compares each pair of columns that can occur between the two matrices and stores the difference in a result matrix. From the result matrix, the closest words to the features can be identified and accounted for in the histogram.

From these training histograms, we can know how words are distributed among the categories of locations (the training labels). This information is given to an SVM that will now determine what distributions of words go with each category. The test images are given to the SVM and a label is determined for each of them.

Results

The first run used a vocabulary of 100 words. The accuracy was 0.6193.

The second run used a vocabulary of 230 words. The accuracy was 0.6180.