For scene recognition, we tried three types of system classifiers:
Let's go talk about them!
Completely random is what it stands for really. It doesn't try to find any particular features of the image, and classifies each image with a random category. We do not expect much.
Accuracy (mean of diagonal of confusion matrix) is 0.074
This has slightly better accuracy.
To get the tiny images, we load in all the images and resize them to 16x16. Then, we resize that 16x16 image matrix to a 1x256 vector. This helps us calculate the distance between other similar kinds of vectors for nearest neighbor classification.
Once that is done with, we use nearest neighbor classification, which we take all of the test image features and compare them to each training image feature. Whichever training image feature is closest to the test image feature through Euclidean distance, we take that training label and classify the test image as such.
This improves in classification because we're specifically looking for similar constructed images. This still does not have a high classification rate because it looks for similarities in the entire image, and not for key features. As such, pictures of forest would need to be nearly alike to be significantly classified as a forest.
![]() ![]() ![]() ![]() |
There was not too much change when modifying the number of nearest neighbors to be calculated. Generally the maximum accuracy seemed to peak around a value near k = 9.
This has even better accuracy.
The first thing to do is build the vocabulary for SIFT representation. We first set the max size of the vocabularly, which is defining how many centroids there will be.For each training image, we iterate over some of the pixels with a given setpsize, and extract SIFT descriptors. Once we do that, we can cluster them with k-means to find the resulting centroids.
From there, for both the training and test set, we load all the images and construct the SIFT features again. This time we use a smaller stepsize for subsampling. Once we find each feature, we assign it to the nearest cluster centroid and build a histogram for each image. Then we normalize the histogram to reduce bias from larger images vs. smaller images. A KD Tree was used to speed up the histogram generation and SIFT feature extraction.
With that, then we do nearest neighbor classification on these histogram matrices. We keep k = 9 for now based on how okay it performed last time.
Random parameters tested:Vocab size | Stepsize | Histogram Stepsize | Accuracy | Time elapsed |
---|---|---|---|---|
100 | 16 | 8 | 0.504 | 181.635s (all) | 200 | 16 | 25 | 0.423 | 698.381s (this was before kd tree) |
200 | 16 | 8 | 0.510 | 110.958s (sift, knn) |
400 | 16 | 8 | 0.515 | 698.381s (all) |
The best parameters were determined to be vocab/centroid count = 400 and stepsize = 16, with the histogram stepsize to be = 8. Also it took quite a long time to run near the end so these parameters were chosen at the moment.
Results visualization
Accuracy (mean of diagonal of confusion matrix) is 0.515
This has even better accuracy than the nearest neighbor approach.
After determining the histogram features from SIFT, we determine a linear SVM classifier based on the training data, and apply that to the test data to determine the correct classifications. We generate a SVM classifier for each category, to represent whether the image is of that category or not. The SVM classifier with the highest score for an image will classify that image as its own.
This performs better than nearest neighbors because nearest neighbors can be negatively influenced by uniformative information/visual words. As such, the linear SVM classifier would be able to weight certain features based on how relevant they are, making it more accurate.
We determine the best accuracy of this based on modifying lambda, while using the best data from the previous example.
Random parameters tested:Lambda | Accuracy | Time elapsed |
---|---|---|
0.00001 | 0.548 | 689.611s | 0.0001 | 0.553 | 742.430s |
0.001 | 0.593 | 811.503s |
0.01 | 0.486 | 848.360s |
0.1 | 0.465 | 899.614s |
1 | 0.376 | 931.604s |
The best parameter was determined to be lambda = 0.001 at a vocab = 400, stepsize = 16, and the histogram stepsize to be = 8.
Results visualization
Accuracy (mean of diagonal of confusion matrix) is 0.593