The goal of this assignment was to recognize scenes using various methods. These methods included
The tiny images representation is created by simply resizing the image to a smaller, fixed size (e.g. 16x16), and representing this smaller image as a vector. The nearest neighbor classifier consists of simply labelling the test image with the label of the training image whose representation is closest, in terms of Euclidean distance in our case, to the representation of the test image.
The accuracy of this combination was 19.1%. I did not add any optimizations to the code, though things like k-nearest neighbors instead of nearest neighbor would improve performance.
The bag of SIFT representation is created by first finding SIFT descriptors for
a dense sampling of local features, and then clustering the descriptors using
k-means. The cluster centers form the vocabulary of visual words we'll use to
classify test images. In my implementation, I sampled SIFT features with
a step size of 100, using the 'fast' parameter of vl_dsift
, and
a vocabulary size of 400 (i.e. 400 cluster centers).
Once the vocabulary has been created, we represent an image by sampling SIFT features (this time I used a step size of 10) and forming a 400-dimensional histogram that counts the number of SIFT descriptors for an image lie in each cluster in the vocabulary. The histogram is then normalized so that the size of the image does not dramatically affect the bag of feature magnitude.
The nearest neighbor classifier is the same as described in the previous section.
The accuracy of this combination was 50.6%, showing that bag of SIFT is a much better representation of images than tiny images (which matches intuition).
In this combination, we represent the images in the same way as described in the section
above, but we classify them using 15 1-vs-all linear support vector machines (SVMs),
one for each image category. For each linear SVM, the SIFT feature space is partitioned
by a learned hyperplane that divides training images in the given category from
images that aren't in the category. In my implementation I used vl_svmtrain
with lambda = 0.0001. Both larger and smaller values of lambda yielded lower accuracies
(I tested all powers of 10 between 0.00001 and 0.01).
Each test image is then evaluated with all 15 SVMs, and is given the label that matches the most confident SVM. Confidence is the distance from the margin measured by W*X + B, where '*' is the dot product, W and B are the learned hyperplane parameters, and X is the representation of the test image.
This combination yielded the best results, with an accuracy of 64.6%. A full report of the results are shown below:
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.520 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() LivingRoom |
![]() LivingRoom |
![]() Bedroom |
Store | 0.460 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Bedroom |
![]() LivingRoom |
![]() Mountain |
Bedroom | 0.530 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() LivingRoom |
![]() Kitchen |
![]() LivingRoom |
LivingRoom | 0.310 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Store |
![]() Store |
![]() Industrial |
Office | 0.840 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Kitchen |
![]() Kitchen |
![]() Kitchen |
Industrial | 0.440 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Store |
![]() Highway |
![]() LivingRoom |
Suburb | 0.940 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() Highway |
![]() TallBuilding |
![]() Industrial |
InsideCity | 0.530 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Kitchen |
![]() LivingRoom |
![]() Highway |
TallBuilding | 0.720 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Suburb |
![]() InsideCity |
![]() Street |
Street | 0.680 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() InsideCity |
![]() InsideCity |
![]() InsideCity |
Highway | 0.790 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Industrial |
![]() Coast |
![]() Street |
OpenCountry | 0.500 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Coast |
![]() Coast |
![]() Mountain |
Coast | 0.740 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() OpenCountry |
![]() Highway |
![]() Highway |
Mountain | 0.760 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Industrial |
![]() Bedroom |
![]() Industrial |
Forest | 0.930 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() OpenCountry |
![]() Street |
![]() Mountain |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |