The aim of this project is to do simple scene recognition using a variety of feature descriptors including (i) tiny images and (ii) histograms of quantized local features, and a variety of classifiers including (i) nearest neighbor classifiers and (ii) linear classifiers learned by suppoer vector machines.
For this project, we were supposed to implement them in the following order
The algorithm for tiny images is fairly simple. It merely involves resizing the image using imresize() into a 16 x 16 image, and then normalizing it by subtracting the mean of image, and then dividing by the squareroot of the sum of all squared elements. The normalization at the end is necessary basically to ensure that the image patches will be brightness invariant.
It's a good initial method because it assumes that images of the same scene have realtively similar global properties. This may work well for example for images of mountain peaks, there being some triangular shape in the central region of the image. However, this discounts a lot of things like perspective, different subject-background composition all together.(which leads us to)
This image representation is quantizing the occurence of local image features in the image and comparing these histograms. Creating this image representation requires two steps. First it requires creating a vocabulary, which you can think of as motif descriptors, SIFT descriptors in this case. This can be found by grabbing all SIFT descriptors from the training images and then running kmeans to find the 400 (or any number) centroids which will best summarize our distribution of SIFT descriptors.
Then, we have to sample SIFT descriptors from each training image. For each descriptor, find the closest centroid, then count this as an occurence of this centroid in the image. Create a histogram of the nearest neighbors of all the SIFT descriptors for a training image. Then, divide by the lengthof the histogram for normalization.
The downside of this representation is that it doesn't encode any spatial features, or the relationship between the different descriptors in the image which may be important, such as in scenes of streets, cars are almost always on the bottom of the screen and below buildlings for example.
The idea behind it is fairly simple. Imagine a cluster of points that are the tiny image vectors of all the training images. For each test image, find its nearest neighbor in this cluster of points, and make the label of that nearest neighbor the label of the test image. A simple extension of this would be to use k-nearest neighbors, labelling the test image with the majority label.
Simply put, a linear classifier learns a hyperplane that best separates all the positive and negative trained examples. To use SVM linear classifier for a recognition class like this, we have to run the SVM classifier for each scene on the training image, and label the training image with the same category as the classifier that returns the highest confidence.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.480 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() Office |
![]() Bedroom |
Store | 0.440 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() Forest |
![]() Office |
Bedroom | 0.340 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() Suburb |
![]() Store |
LivingRoom | 0.260 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Bedroom |
![]() Kitchen |
![]() Bedroom |
Office | 0.950 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() Kitchen |
![]() Kitchen |
Industrial | 0.310 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() InsideCity |
![]() Street |
![]() TallBuilding |
Suburb | 0.890 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() OpenCountry |
![]() InsideCity |
![]() Store |
InsideCity | 0.710 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Street |
![]() LivingRoom |
![]() Kitchen |
TallBuilding | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Mountain |
![]() Coast |
![]() Bedroom |
Street | 0.580 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Industrial |
![]() InsideCity |
![]() InsideCity |
Highway | 0.720 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Industrial |
![]() Coast |
![]() Industrial |
OpenCountry | 0.320 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() Coast |
![]() InsideCity |
![]() Highway |
Coast | 0.760 | ![]() |
![]() |
![]() |
![]() |
![]() Highway |
![]() Highway |
![]() OpenCountry |
![]() InsideCity |
Mountain | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() OpenCountry |
![]() Suburb |
![]() Highway |
Forest | 0.930 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Mountain |
![]() Street |
![]() Mountain |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |