CS 143 Project

Completely random is what it stands for really. It doesn't try to find any particular features of the image, and classifies each image with a random category. We do not expect much.

Tiny images representation and nearest neighbor classifiers

This has slightly better accuracy.

To get the tiny images, we load in all the images and resize them to 16x16. Then, we resize that 16x16 image matrix to a 1x256 vector. This helps us calculate the distance between other similar kinds of vectors for nearest neighbor classification.

Once that is done with, we use nearest neighbor classification, which we take all of the test image features and compare them to each training image feature. Whichever training image feature is closest to the test image feature through Euclidean distance, we take that training label and classify the test image as such.

This improves in classification because we're specifically looking for similar constructed images. This still does not have a high classification rate because it looks for similarities in the entire image, and not for key features. As such, pictures of forest would need to be nearly alike to be significantly classified as a forest.

Confusion matrices for tiny images and nearest neighbors; k = 5, 7, 9, 11

k = 5 -> 0.181
k = 7 -> 0.192
k = 9 -> 0.197
k = 11 -> 0.185

There was not too much change when modifying the number of nearest neighbors to be calculated. Generally the maximum accuracy seemed to peak around a value near k = 9.

Bag of SIFT representation and nearest neighbor classifiers

This has even better accuracy.

The first thing to do is build the vocabulary for SIFT representation. We first set the max size of the vocabularly, which is defining how many centroids there will be.For each training image, we iterate over some of the pixels with a given setpsize, and extract SIFT descriptors. Once we do that, we can cluster them with k-means to find the resulting centroids.

From there, for both the training and test set, we load all the images and construct the SIFT features again. This time we use a smaller stepsize for subsampling. Once we find each feature, we assign it to the nearest cluster centroid and build a histogram for each image. Then we normalize the histogram to reduce bias from larger images vs. smaller images. A KD Tree was used to speed up the histogram generation and SIFT feature extraction.

With that, then we do nearest neighbor classification on these histogram matrices. We keep k = 9 for now based on how okay it performed last time.

Random parameters tested:

Vocab size	Stepsize	Histogram Stepsize	Accuracy	Time elapsed
100	16	8	0.504	181.635s (all)
200	16	25	0.423	698.381s (this was before kd tree)
200	16	8	0.510	110.958s (sift, knn)
400	16	8	0.515	698.381s (all)

The best parameters were determined to be vocab/centroid count = 400 and stepsize = 16, with the histogram stepsize to be = 8. Also it took quite a long time to run near the end so these parameters were chosen at the moment.

Results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.515

Bag of SIFT representation and linear SVM classifiers

This has even better accuracy than the nearest neighbor approach.

After determining the histogram features from SIFT, we determine a linear SVM classifier based on the training data, and apply that to the test data to determine the correct classifications. We generate a SVM classifier for each category, to represent whether the image is of that category or not. The SVM classifier with the highest score for an image will classify that image as its own.

This performs better than nearest neighbors because nearest neighbors can be negatively influenced by uniformative information/visual words. As such, the linear SVM classifier would be able to weight certain features based on how relevant they are, making it more accurate.

We determine the best accuracy of this based on modifying lambda, while using the best data from the previous example.

Random parameters tested:

Lambda	Accuracy	Time elapsed
0.00001	0.548	689.611s
0.0001	0.553	742.430s
0.001	0.593	811.503s
0.01	0.486	848.360s
0.1	0.465	899.614s
1	0.376	931.604s

The best parameter was determined to be lambda = 0.001 at a vocab = 400, stepsize = 16, and the histogram stepsize to be = 8.

Results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.593

Yay ^ M ^

John Tran (jqtran)

CS 143 / Project 3 / Scene Recognition with Bag of Words

Completely random

Tiny images representation and nearest neighbor classifiers

Confusion matrices for tiny images and nearest neighbors; k = 5, 7, 9, 11

Bag of SIFT representation and nearest neighbor classifiers

Bag of SIFT representation and linear SVM classifiers