CS 143 / Project 3 / Scene Recognition with Bag of Words

Project Objective

The goal of this project was to implement basic scene recognition. The process was originally implemented with simple methods such as tiny images and nearest neighbors, then progressed onto more advanced implementations using bag of words features and linear classifeiers built by SVM's. While simple implementations had poor recognition (~20%), a tuned system can obtain results around 70%, 10x more than chance.

Devlopment and Performance

Tiny Images and Nearest Neighbors (14.5%)

Tiny images enocode only the lowest level frequencies as most of the high frequencies of the image is lost when it is downsized to such a small size (16x16). Additionally, the image is not shift invariant, so different viewpoints on a scene may be classified with different labels. Nevertheless, when paired with a NN classifier, about twenty percent of the matches were correct. This is still 3x better than chance. To improve on performance, I normalized and the image and made it have zero mean. This had slight performance gains by increasing intensity invariance, but did not overcome the inherent losses of a tiny image representation.

Bag of Words and Nearest Neighbors (55.4%)

To build a vocabularly for our BoW representation, we sample every image with a large step size (15) and collect sift features. We then use k-means to cluster these features and take the resulting cluster centers as our vocabularly. To find what word a feautre is, we simply assign it to the nearest cluster center. Bag of Words model encodes much more useful information about the image. Even at smaller dimensions (100 words) it still outperforms the tiny-image representation. This is becuase the bag of words helps describe the types of structures that are found in a scene, and to what frequency they are found/ We know that man-made scenes tend to have strong, orthogonal edges compared to a natural scene. This distinction can be higlighted in a BoW model.

Tiny Images and Linear SVM (12.3%)

All vs. one linear svm's were trained on the input. The classifier with the most positive confidence was used to label the test image. However, I did not see much to any improvement in the performance of the tiny images classifier, still getting around 20% accuracy. This suggests that the fault lies in the feature, not the classifier. As previosuly discussed, the tiny image classifier does not encode useful information for recognizing a scene. Thus even with a more inteligent classifier, we cannot better determine results. Often, the voting was unstable, and the results were strongly shifted to voting for one category and particular. I did tune my lambda value to produce a set of classifers that were more varied in their voting.

Bag of Words and Linear SVM (62.2%)

While linear svm's did not benefit the performance of tiny image features, I did find that it increased performance of my bag-of-words features by ~10%. Some words in the vocabularly may not be beneficial to recognition, as many images may have some of the words. A linear SVM can assign a very small weight to these words, indicating that they have little effect on the outcome. Nearest neighbor classifiers do not share this luxury, and may be swayed by words that really are not beneficial to the recognition process.

An Interesting Discovery: Vocabulary Source is not a Major Factor

In the devlopment of my code, I originally had a bug in my code that during the sift feature collection for my vocabularly, I was overwriting my features instead of concatinating them. The result was that vocab was based off of features from only one image. Yet, I did not see a negative impact on my performance, nor did I see a change in my performance when I switched to using features from all the image. This suggests that it is more the seperation and quantization of the SIFT space that matters for a BoW model, and having a diverse and itelligent vocabularly is not as important.

More Test Results

Testing Stability with Respect to Lambda in TinyImages+SVM
These were unstable, voting for mainly one category.
Lamdba = 0.00000003
Lamdba = 0.001
Lamdba = 0.01
Lamdba = 0.1
This one voted for 3 main categories, slightly more stable.
Lamdba = 10, 3 Main Categories

I had the oppurtunity to test many different vocabulary sizes, ranging from 100 to 5000. I found that the increase in vocabularly size did produce modest performance boosts of a few percent. However, the cost of runtime did increase greatly at the largest end, as 10's of times more distance calculations per image were necessary to obtain these few percent. It would be more beneficial to better encode image content in the feautres then to incease our vocabularly to descibe our features.

Testing Different Vocab Sizes All of these were constructed with a vocab based only on the last image
Lamdba = 0.001, Vocab Size = 800
Lamdba = 0.001, Vocab Size = 1000
Lamdba = 0.0001, Vocab Size = 1000
Lamdba = 0.0001, Vocab Size = 1500

These had a vocab from all images
Lamdba = 0.0001, Vocab Size = 100
Lamdba = 0.0001, Vocab Size = 400
Lamdba = 0.0001, Vocab Size = 5000 (67.3%)