The goal of the project was to implement a 'visual bag of words', which functions as a vocabulary of visual features. This was paired with a linear SVM in order to classify images. Classification was in the realm of scenes, not specific objects. Specifically, testing and training images ranged from forests to cities to kitchens.
Before the algorithm is described in any detail, it must be noted that in addition to a bag of sift features paired with a linear SVM as a classification method, 'tiny image features' and the k-nearest neighbors classifier were also implemented. All results will be discussed in the 'Results' section.
The algorithm works as follows:
Accuracy of the tiny image feature representation and the nearest neighbor (with k = 1) was 0.225. Experimentation with different k values yielded little improvement, and in many cases, significant decline in accuracy (sometimes down to ~19%). Experimentation with the linear SVM classifier also yielded poor results, with accuracy ~19%.
Accuracy of the bag of SIFT feature representation and the nearest neighbor (with k = 11) was 0.563. Experimentation with different k values yielded a wide range of values, from ~0.51 to 0.563, although it is still possible that there is a k value that makes the accuracy even higher. However, the k value in this case is heavily dependent on the dataset, so the 'best' k value will not likely transfer to different tests.
Accuracy of the bag of SIFT feature representation and the linear SVM classifier varied from 0.64 to 0.68. The instance represented by this image was one of the better tests, with an accuracy of 0.671. In total, from start to finish, the pipeline takes about 1-1.25 hours to run. Building the vocabulary matrix takes between 40 minutes and an hour (with ~2/3rds of the time taken by vl_kmeans), and getting the bag of SIFT feature representation takes 10-20 minutes, all depending on the computer being used. Tiny image features, nearest neighbor classification, and linear SVM classification all take only a few seconds.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.610 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() InsideCity |
![]() Industrial |
![]() Suburb |
Store | 0.500 | ![]() |
![]() |
![]() |
![]() |
![]() TallBuilding |
![]() Industrial |
![]() Forest |
![]() Highway |
Bedroom | 0.490 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() TallBuilding |
![]() Kitchen |
![]() Office |
LivingRoom | 0.210 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Bedroom |
![]() Street |
![]() Kitchen |
Office | 0.870 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Kitchen |
![]() Bedroom |
![]() Kitchen |
Industrial | 0.340 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() InsideCity |
![]() Kitchen |
![]() TallBuilding |
Suburb | 0.970 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Industrial |
![]() LivingRoom |
![]() LivingRoom |
InsideCity | 0.620 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Store |
![]() Kitchen |
TallBuilding | 0.790 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Store |
![]() Street |
Street | 0.790 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() LivingRoom |
![]() Highway |
![]() InsideCity |
Highway | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Industrial |
![]() InsideCity |
![]() InsideCity |
OpenCountry | 0.480 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() Coast |
![]() Forest |
![]() Coast |
Coast | 0.820 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Industrial |
![]() OpenCountry |
![]() Mountain |
Mountain | 0.800 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() TallBuilding |
![]() Forest |
![]() OpenCountry |
Forest | 0.940 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Mountain |
![]() Mountain |
![]() OpenCountry |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Overall, the most important parameters were the vocabulary size, and lambda in the SVM classifier. The results are good, in that the false positives are not drastically different from their correct categories. For example, an 'bedroom' scene classified as a 'living room' is a reasonable mistake, as there are many similar features in the two scenes. A 'street' being classified as a 'highway' is also reasonable. It would be worse if a 'mountain' were classified as a 'living room', as there are very few similar features. Thus, this pipeline's accuracy is justified, and does not really present any unexpected results.