The goal of this assignment was to perform scene recognition by using a bag of words model, based on a hand-labeled data set of 15 scene categories.
We move gradually from a baseline placeholder with low accuracy and then gradually implement more classification methods on top of our existing pipeline, ending up with a solid accuracy at the end. Our starting pipeline consists of two stages:
Implementing this gave us the following results:
The above image visualizes our results as a heat map. The y-axis represents the correct label for a given image, whereas the x-axis represents the label our algorithm has given it. A good implementation should, in other words, have a strong diagonal.
As we can see from the above image, this amount is enough to give us some improvement over random chance (which would have a roughly 6.7% accuracy), but not enough to be at all reliable.
To improve accuracy, we'll replace our tiny image representations of images with representations of 'bags of SIFT features.'
First, we need to establish a 'visual vocabulary.' We can form this vocabulary by computing SIFT features from our training data, and then clustering them with kmeans. The resulting clusters form our visual vocabulary, and we can categorize any new features we have by the closest cluster center to it. To improve performance, we'll take a random sample of the training data (in our case 30%) to cluster over, and use a step size of 8, to compute k=200 clusters, to form our visual vocabulary.
Now that we have a visual vocabulary, we can represent our images as 'bags of SIFT descriptors.' We find a set of SIFT descriptors for every image in the training and testing sets. Instead of saving them all, however, we instead store this representation as a histogram over the nearest 'vocab word,' or kmeans cluster centroid. We then normalize over the number of features found to prevent large images from skewing the results, and then re-use our nearest neighbor classifier over these histograms to label images in the test set.
The last replacement in our pipeline is to train 1-vs-all linear SVMs. Simply put, instead of measuring by nearest neighbor, which equally weighs all 200 of our vocab words, we can instead downweight words that might be less relevant, or be present in all image types. Since linear SVMs are binary (either on one side of the hyperplane or the other), we have to compute one linear SVM for each of our training categories, and then assign to each test image the category with the best score. The final results of this can be seen below.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.610 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Bedroom |
![]() Office |
![]() Office |
Store | 0.280 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Industrial |
![]() Kitchen |
![]() Bedroom |
Bedroom | 0.390 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() InsideCity |
![]() LivingRoom |
![]() Office |
LivingRoom | 0.170 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Bedroom |
![]() Bedroom |
![]() Kitchen |
Office | 0.980 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() Kitchen |
![]() Kitchen |
Industrial | 0.530 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Store |
![]() TallBuilding |
![]() InsideCity |
Suburb | 0.950 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Coast |
![]() TallBuilding |
![]() LivingRoom |
InsideCity | 0.590 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Kitchen |
![]() LivingRoom |
![]() Suburb |
TallBuilding | 0.820 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Forest |
![]() Mountain |
Street | 0.490 | ![]() |
![]() |
![]() |
![]() |
![]() TallBuilding |
![]() Mountain |
![]() Industrial |
![]() Industrial |
Highway | 0.780 | ![]() |
![]() |
![]() |
![]() |
![]() Suburb |
![]() Industrial |
![]() Coast |
![]() Coast |
OpenCountry | 0.360 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() Highway |
![]() Coast |
![]() Suburb |
Coast | 0.800 | ![]() |
![]() |
![]() |
![]() |
![]() Highway |
![]() OpenCountry |
![]() Mountain |
![]() Highway |
Mountain | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() OpenCountry |
![]() Suburb |
![]() Suburb |
Forest | 0.930 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() OpenCountry |
![]() Street |
![]() Mountain |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |