This project implements scene recognition using various different approaches, and evaluates these approaches using confusion matrices and accuracy scores. Specifcally, tiny image features and bag of sift features are used for image representation, while support vector machine and k-nearest neighbor classification are used for image classification. All four combinations of these approaches are evaluated and results are shown below.
There are lots of tune-able parameters for this project. I've experiemented with a variety of them and come up with parameters that seem to produce the best results for this particular implementation. For tiny image features, images are resized to 16 x 16 resolution, vectorized, and then normalized. For k-NN, k = 1 actually performs very well (a lot of times even better than values of k greater than 1), and k values around 10 also does well. For bag of sift features, I'm using a step size of 8 and bin size of 4. For SVM, performance differs as different image representations are used. Lambda values around 10^-4 seems to do well with bag of sift features.
Accuracy varied between 0.18 and 0.23. At k = 1 the accuracy is 0.225. At k = 20 it is 0.215, and at 40 it is 0.213.
The best accuracy I could get with linear SVM was 0.186 (I also tried radial kernel SVM but the result was worse). This was under lambda = 1e-4.
With k = 1 the accuracy is 0.524. At k = 10 it is improved to 0.531, which is also the best I could get with this configuration. The confusion matrix and table of classifier results are included here:
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.460 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() InsideCity |
![]() Office |
![]() Store |
Store | 0.520 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() InsideCity |
![]() Office |
![]() Office |
Bedroom | 0.330 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() Kitchen |
![]() Kitchen |
![]() Office |
LivingRoom | 0.150 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() Bedroom |
![]() Office |
![]() Office |
Office | 0.780 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Bedroom |
![]() InsideCity |
![]() Bedroom |
Industrial | 0.300 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() LivingRoom |
![]() Highway |
![]() Office |
Suburb | 0.910 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() OpenCountry |
![]() InsideCity |
![]() Bedroom |
InsideCity | 0.420 | ![]() |
![]() |
![]() |
![]() |
![]() Suburb |
![]() TallBuilding |
![]() Coast |
![]() Forest |
TallBuilding | 0.330 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Bedroom |
![]() Store |
![]() Street |
Street | 0.560 | ![]() |
![]() |
![]() |
![]() |
![]() TallBuilding |
![]() Kitchen |
![]() Store |
![]() InsideCity |
Highway | 0.770 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() Coast |
![]() Coast |
![]() Coast |
OpenCountry | 0.260 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Industrial |
![]() Mountain |
![]() Coast |
Coast | 0.670 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Suburb |
![]() Forest |
![]() Highway |
Mountain | 0.550 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() OpenCountry |
![]() Forest |
![]() Forest |
Forest | 0.960 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() OpenCountry |
![]() Mountain |
![]() Suburb |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
I was able to get an accuracy of 0.678 with this combination. This is achieved at lambda = 10^-4 (0.0001). Detailed results are shown below:
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.600 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() Bedroom |
![]() Office |
![]() LivingRoom |
Store | 0.580 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() InsideCity |
![]() Kitchen |
![]() LivingRoom |
Bedroom | 0.440 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() OpenCountry |
![]() LivingRoom |
![]() LivingRoom |
LivingRoom | 0.330 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Bedroom |
![]() Office |
![]() Store |
Office | 0.840 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Bedroom |
![]() Bedroom |
![]() Store |
Industrial | 0.490 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() InsideCity |
![]() LivingRoom |
![]() Highway |
Suburb | 0.940 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Kitchen |
![]() TallBuilding |
![]() Coast |
InsideCity | 0.500 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Store |
![]() Kitchen |
TallBuilding | 0.760 | ![]() |
![]() |
![]() |
![]() |
![]() Office |
![]() InsideCity |
![]() Forest |
![]() OpenCountry |
Street | 0.730 | ![]() |
![]() |
![]() |
![]() |
![]() TallBuilding |
![]() Industrial |
![]() Industrial |
![]() Highway |
Highway | 0.840 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Street |
![]() Industrial |
![]() Coast |
OpenCountry | 0.540 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() TallBuilding |
![]() Suburb |
![]() Highway |
Coast | 0.770 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Industrial |
![]() Highway |
![]() OpenCountry |
Mountain | 0.870 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() OpenCountry |
![]() OpenCountry |
![]() Forest |
Forest | 0.940 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Store |
![]() Mountain |
![]() Street |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Other than linear SVM, I've also tried to use radial kernel SVM for classifying the features. However, radial kernels do not seem to play well with this task. After trying a dozen of different parameters, I was only able to get to an accuracy of about 0.55. This can be achieved under several different settings. Two possible ones are sigma = 2 and lambda = 1, or sigma = 2^5 and lambda = 1e-2 (sigma is the radial kernel parameter). The confusion matrix for sigma = 2^5 and lambda = 1e-2 is shown below (it seems that store has much higher possibility of showing up in the result than other scenes):