In this project, we set out to make a program that could recognize scenes. We attempt various strategies that will sample features from a training set of pictures, and then create a classifier that should identify a given image as belonging to one of fifteen possible categories. The differences between the different approaches lie in how to approach the two steps of the problem : sampling features and classifying the features based on the training data. The three combinations of methods used are:
In a tiny image representation, all the images are simply rescaled to a tiny format. In this case, each image is resized to a 16 x 16 pixel size, effectively representing each image as a 256-dimensional feature. Then, a k-nearest neighbor algorithm is used: we simply check what the nearest k labelled points are to a given testing point, and assign the testing point the majority vote.
While tiny images seem like a very rudimentary idea, they do perform a little better than chance performance would: with 1 nearest neighbor, there was a 19.5 % recognition accuracy; using 5 nearest neighbors there is an improvement to 22.5 %.
It should be noted that these results are better than using tiny image representation along with a support vector machine classifier, which, depending on parameters of the SVM training, will perform at about 10 % - 14 % accuracy. In this combination, the SVM classifiers will tend to conglomerate all the data into just a few categories in a wildly inaccurate way.
Bag of words is a strategy used primarily in natural language processing, in which documents are identified as unordered collections of words drawn from a large vocabulary. Every document is essentially represented as histogram. In the visual world, the "words" are SIFT features, densely sampled from the image. Our vocabulary is created by densely sampling SIFT features from the training set, and then clustering them using a k-means clustering algorithm into a workable size - I used a vocabulary of 400 visual words. Each image being tested is then densely sampled, and every sample is added to the histogram bin corresponding to the feature in the vocabulary to which it is most similar. So in the end, each image is represented as a 400-bin histogram.
The nearest-neighbor classifier again then determines which labels it is closest to. The difference between the tiny image representation and this representation is that here the features exist in 400-dimensional space, whereas in the previous example the feature space had 256 dimensions.
Results with this method were already significantly better than the previous method. Using a step size of 30 to create the vocabulary was slow but seems to pay off in results. In the actual image sampling, I sampled at a step size of 10, again slowing down the code but leading to good results in the process. (I did take to saving the results for future runs, though.)
Results using a 1-nearest neighbor were around 45.3 %, with 5-nearest neighbor performing slightly better with 46.8 %. Using chi-squared distance metrics (instead of the Euclidian default) led to a 49.6 % recognition rate.
Using the same image representation as before, I tried a final classifier: the Support Vector Machine. Here, binary classifiers are trained by giving a set of training data along with labels. The SVM will try to define a function which as best as possible partitions the data such that the function can be given an input, and the function's output's sign determines which side of the binary decision the image is deemed to be on. As these classifiers are binary, a one-vs-all classifier was needed for every category : the entire dataset was used in training, so the SVM had seen both examples of the label as well as counterexamples. A tested image is then assigned the label whose SVM classifier seemed most "confident" that it partained to that label.
I did not alter the sampling frequency of the bag of SIFT words (as I had saved both the vocabulary and the image features from previous runs). I played around a lot with one parameter in the SVM training function, LAMBDA, which determined how strongly the function should penalise being closely aligned to the data set. Recognition rates varied significantly with orders of magnitude of this parameter, with recognition rates ranging from anywhere between 36.3 % to 56.7%. There is an apparent peak performance at values of lambda around 0.0001.
Interestingly, recognition also decreases when a higher vocabulary size is used, to approximately 52.6 %. There is a very elusive interplay between the different parameters and 56.2 % was the highest rate I was able to attain.
Even after playing around with the parameters a little, I determined I couldn't find the "perfect mix" of parameter finetuning to really bring the recognition up to scratch. I figured I was close enough to 60 % for it to seem reasonable. Probably adding spatial elements to the image feature representations would help improve recognition rates to higher levels.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.620 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Bedroom |
![]() Store |
![]() LivingRoom |
Store | 0.350 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() InsideCity |
![]() InsideCity |
![]() Mountain |
Bedroom | 0.240 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() TallBuilding |
![]() Kitchen |
![]() Suburb |
LivingRoom | 0.190 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Industrial |
![]() InsideCity |
![]() Bedroom |
Office | 0.760 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Bedroom |
![]() Coast |
![]() Kitchen |
Industrial | 0.170 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() TallBuilding |
![]() Mountain |
![]() Street |
Suburb | 0.910 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() Industrial |
![]() Store |
![]() Industrial |
InsideCity | 0.530 | ![]() |
![]() |
![]() |
![]() |
![]() Highway |
![]() TallBuilding |
![]() Highway |
![]() TallBuilding |
TallBuilding | 0.670 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Industrial |
![]() Coast |
![]() Kitchen |
Street | 0.610 | ![]() |
![]() |
![]() |
![]() |
![]() TallBuilding |
![]() InsideCity |
![]() TallBuilding |
![]() Forest |
Highway | 0.740 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Coast |
![]() Coast |
![]() Coast |
OpenCountry | 0.260 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() InsideCity |
![]() Forest |
![]() Highway |
Coast | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() OpenCountry |
![]() Bedroom |
![]() Suburb |
Mountain | 0.680 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() OpenCountry |
![]() Bedroom |
![]() Street |
Forest | 0.950 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() OpenCountry |
![]() Street |
![]() Bedroom |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |