CS143 Introduction to Computer Vision Project 3 - SCENE RECOGNITION WITH BAG OF WORDS Hung-I Chuang Login:hichuang

CS143 Introduction to Computer Vision Project 3 - SCENE RECOGNITION WITH BAG OF WORDS

Hung-I Chuang Login:hichuang

Introduction

In this project, we will classify scenes in one of the 15 categories of scenes using bag of words models. Bag of words model ignore spatial arrangement and classifies only on a histogram of the frequency of visual words. Visual words are generated by clustering local features from training set of 15 scenes.

algorithm

Brown University - Computer Science

First we need to collect lots of features from each 15 categories of scene, I use SIFT descriptors (step 8 and size 4) to represent features of each training image as a 128 x n sample points matrix. Then I cluster them into 200 visual words as a visual vocabulary using k-means clustering.

After defining our bag of visual words, we then represent each image by a histogram of visual words appearance. We calculate the appearance of the visual word by finding the nearest visual word to a SIFT feature. Here I use K-NN to find the nearest visual word for a SIFT feature.

We next have to train classifier for each 15 scenes based on the visual words histogram of training datas using SVM 1 vs others classifying. Here I use a linear SVM with different setting of lambda value to test which finds the best result.

Now we have our classifiers, we have to do the similar thing to our test data. We convert test image to a histogram of visual words by K-NN and then evaluate all 15 classifiers to assign a scene to it.

Finally, we will build a confusion matrix and measure the accuracy by average the diagonal. Here is the result for the baseline with 200 visual words. The accuracy is 0.6107.

lambda

The digram shows that I experiment on different value of lambda on 200 words results in different accuracy, and lambda = 1.75 return the highest accuracy which improve the accuracy to 0.658.

Extra Credit

I experiment how different size of visual words effect the accuracy of the scene recognition with 10, 20, 50, 100, 200, 400, 1000, 2000 and 4000 vocabularies. I experiment them with lambda value 1.75 and found out that in general with visual words number higher, the better accuracy I got.

The highest accuracy happened when the number of visual words is 4000, the accuracy is 69.07%

below shows how number of visual words effect accuracy.

number of vocabularies

The confusion matrix for number of visual words 4000, lambda = 1.75 is shown below.