CS 1430 Project 3: Scene recognition with bag of words

Kumud Nepal
October 25, 2011

Overview

The bag of words model classifies images by making histograms of the frequency of visual words. These words are identified by clustering a large number of example features. For the purpose of this project, we will be using the SIFT features which are scale in-variant. First, SIFT features are extracted by using the vl_dsift open-source function. Then we cluster these features into a visual vocabulary of 200 words using the vl_kmeans function. The vl_dsift features extracts a lot of features. We chose a bin size of 4 and a step size of 8 for the function. Furthermore, only 1/4th of the features were randomly chosen for use. For each of the training images, a histogram of the word frequency is made (also normalized to avoid imfluence from image size) and fed into an SVM solver. The SVM solver makes models for different classifications and classifies test images from their histograms as well. As additional (extra credit) exercise, different vocabulary sizes ranging from 100 to 1000 is tested. Also, cross-validation is used to measure performance rather than fixed test/train split. 100 Images are randomly picked for each label from the total 200 given and accuracy is measured. This is repeated 5 times (manually) and the average and standard deviation of the performance is given. Finally, different distance parameters were used to build the histogram to be fed into othe SVM.



Results

Fig. 1: Visual Illustration of the confusion matrix built after classification with vocab size of 200, fixed test/training set, 1/4th features randomly chosen for clustering. Accuracy of 0.6027



89	2	0	1	2	0	0	0	1	0	1
3	78	0	4	1	4	9	0	0	0	1
1	0	94	0	0	4	0	1	0	0	0
0	9	0	76	5	4	3	0	0	1	1
4	4	2	1	60	0	0	6	4	0	0
5	2	5	2	0	75	3	2	3	0	1
4	23	13	9	0	9	37	2	0	0	2
1	0	0	10	21	3	0	51	2	0	0
0	2	2	0	2	4	0	6	74	0	3
0	0	0	0	0	0	0	0	0	85	4
2	2	0	0	0	2	0	1	1	15	35
4	2	2	10	7	3	6	5	7	3	2
1	0	0	0	7	1	0	0	5	16	17
1	0	0	1	2	3	0	3	5	28	21
2	0	2	3	17	6	0	4	5	3	5

Fig. 2: Textual representation of the confusion matrix in Fig. 1



Extra Credit Efforts

As earlier mentioned some parameters were tweaked in the algorithm as extra effort excercises. Some of them included changing vocabulary sizes and studying the effect on model accuracy. The second effort included trying different and random splits of test and training image sets for cross-validation. 100 images from the training folder and 100 images from the test folder were all mixed up and then 100 were randomly for training and the remaining 100 were chosen as the test set. This random split was done 5 times and accuracy was measured for each run. Average and std. deviation are presented later. As a third effort, different parameters for creating the histogram (essentially chanding the kernels into the SVM) were tested, for eg. KCHI2, KL1 as 'kernel' input to the vl_alldist2 used to generate the histogram. Also different 'metric' parameters like 'CHI2' and 'L1' were used.


Variable Vocabulary set

Vocab Size Accuracy # Features chosen for clustering
1000.58201/4
1000.52931/10
2000.60271/4
2000.56531/10
4000.60731/4
10000.60001/4

Table 1: Different Vocabulary sizes and Accuracy for different random sampling of SIFT features


Fig. 3: Graph showing varying vocabulary size vs. accuracy for same number of samples extracted from images



Random Splits of training and test images

ith random split Accuracy
10.6127
20.608
30.6087
40.6070
50.6073
Average: 0.60874
Standard Deviation: 0.00231

Table 2: Average and standard deviations for 5 runs of random split test and training sets for each scene label




Fig. 4: Random split vs. accuracy



Different kernels to SVM/different parameters for histogram formulation

Parameter Accuracy
KCHI20.0900
KLI10.0900
CHI20.6027
LI10.5860

Table 3: Different parameters used for calculating the pair-wised distance between extracted features and vocabulary. vl_alldist2 function takes 'kernel', and 'metric' parameters. KCHI2 and KL1 were kernel parameters and CHI2 and L1 were metric parameters. CHI1 is chi-square distance. Non of the accuracy measures were superior than the earlier calculated 0.6127 for the first random split.


Fig. 5: Visual illustration of the confusion matrix for the KCHI2 and the KLI1 'kernel' parameters for the vl_alldist2 function.