I stuck to baseline implementations for all parts of the project. The only somewhat interesting thing I did was play around with my input parameters (alpha for vl_svmtrain, step and size for vl_dsift) to get better results.
First, for tiny image features, I simply just called imresize to resize each image to 16x16 and then made the image zero mean by subtracting the mean from the image. For nearest neighbors, I only did 1 nearest neighbor and pretty much let vl_alldist2 do all of the work.
For my bag of SIFT representation, I built my vocabulary by finding SIFT features and then clustering them with k-means, where k was the vocabulary size. I then would take input images and determine which cluster center it was closest to and assign that "vocab word" to the image.
Finally, for my linear SVM classifier, I first found all of the 1-vs-many SVM's (one for each category), and then ran test image features against each SVM. Whichever SVM scored highest for the test image feature would specify which category the image feature belonged to.
Using tiny image features and the nearest neighbor classifier, I had an accuracy of 0.204, or 20.4%. Below is the resulting confusion matrix.
Using bag of SIFT features and the nearest neighbor classifier, I had an accuracy of 0.517, or 51.7%. This takes about 10 minutes to run if the vocab.mat file is already built (or at least this is how long it takes on my Macbook Air). Below is the resulting confusion matrix.
Using bag of SIFT features and the linear SVM classifier, I had an accuracy of 0.658, or 65.8%. This takes roughly the same time as the previous. Below is the resulting confusion matrix and table of classifier results.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.470 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() Office |
![]() Store |
Store | 0.550 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() TallBuilding |
![]() Industrial |
![]() Office |
Bedroom | 0.490 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() LivingRoom |
![]() Store |
![]() LivingRoom |
LivingRoom | 0.360 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Bedroom |
![]() Bedroom |
Office | 0.780 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Store |
![]() LivingRoom |
![]() Kitchen |
Industrial | 0.460 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Street |
![]() OpenCountry |
![]() LivingRoom |
Suburb | 0.920 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() InsideCity |
![]() LivingRoom |
![]() TallBuilding |
InsideCity | 0.510 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Street |
![]() Kitchen |
![]() TallBuilding |
TallBuilding | 0.690 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Industrial |
![]() Industrial |
![]() LivingRoom |
Street | 0.710 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() InsideCity |
![]() TallBuilding |
![]() Store |
Highway | 0.800 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Industrial |
![]() Bedroom |
![]() Coast |
OpenCountry | 0.490 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() TallBuilding |
![]() Coast |
![]() LivingRoom |
Coast | 0.820 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() OpenCountry |
![]() InsideCity |
![]() Bedroom |
Mountain | 0.860 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() TallBuilding |
![]() Suburb |
![]() Suburb |
Forest | 0.960 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() TallBuilding |
![]() TallBuilding |
![]() Mountain |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |