CS 143 Project

CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of this project is to recognize 15 categories of scene using different image presentation and and classifier. We then compare the performance between each model. The following is the methods I implemented for this project

Tiny images representation + K-NN
Bag of SIFT representation + K-NN
Bag of SIFT representation + Linear SVM
Bag of SIFT representation (Spatial Pyramid) + Non-Linear SVM

Tiny images representation + K-NN

In this method, we resize each images to a 16 by 16 tiny images with zero mean and unit length and use these tiny images as the representation. We calculate the L2 distance between every test images and train images, and run the K-NN algorithm to vote for the label. By using this method, we observed that when K = 1 we will get the best result.

K 1 3 5 10 15

Accuracy .227 .209 .212 .214 .217

K	1	3	5	10	15
Accuracy	.227	.209	.212	.214	.217

Table 1. Different K's and their performances

The confusion matrix for tiny images representation and 1-NN classifier.

Bag of SIFT representation + K-NN/Linear-SVM

Algorithm

Dense sample SIFT features for each train images with step size 8 and bin size 4. For each train images randomly select 150 SIFT features and add them into a descriptor pool.
Run the k-means clustering algorithm on the descriptor pool and build up a k-word visual vocabulary.
Represent every train and test images as histograms of visual words built from step(2). For each images, we sample SIFT features with step size 8 and bin size 4. We calculate the similarity between these new observed SIFT features and visual words to figure out which vocabulary they belong to.
Use K-NN/Linear-SVM to classify each test images.

K-NN classifier

Use K-NN as the classifier. At evaluation phase, a test image is assigned the label with the highest vote. We observed that with bigger K we will get better accuracy. Table 2 shows the prediction accuracy of different K's.

K	1	3	5	10	15
Accuracy	.576	.580	.591	.615	.617

Table 2. Different K's and their performances. Vocab size = 400.

The confusion matrix for bag of SIFT representation and 15-NN classifier.

Linear-SVM

Instead of using the k-NN neighbor classifier, we train 1-vs-all linear SVMs. Then at evaluation phase, a test image is assigned the label of the classifier with the highest response. Table 3 shows the prediction accuracy of different vocabulary size.

Vocab_size	400	800
Accuracy	.637	.659

Table 3. Different vocab_size and their performances. alpha = 0.001.

The confusion matrix for bag of SIFT representation with vocab_size = 800 and linear-SVMs.

Bag of SIFT representation (Spatial Pyramid) + Non-Linear SVM

Bag of SIFT features represent an image as orderless collection of features. It disregard all the spatial information of an give image. In order to capture the spatial layout of the features, we used the spatial pyramid method (introduced in Lazebnik et al. 2006) to generate our new image representation. Then, instead of using the linear-SVM, we used the non-linear SVM as our new classifier. As the paper suggested, we can compute the kernel matrix using the equation(4) in Lazebnik et al. 2006.

Kernel Matrix

Pyramid match kernel matrix for train images.
Setup: vocab_size = 400, L = 2 for spatial pyramid, and dense sample SIFT features using step size = 8 and bin size = 4.

Tuning parameters

Tune SVM parameters on validation dataset to pick the best lambda. Note: All the following experiments is done by vocab_size = 400, Dense sample SIFT features using step size = 8 and bin size = 4.

Accuracy rate as function of lambda. L = 2.

Here we pick up lambda = 0.01 as our regularization weight. Then we fixed the lambda at 0.01, and run the experiment using different pyramid levels. For each level, the result were evaluated by randomly pick 100 training and 100 testing images for each iteration.

L	0 (1X1)	1 (2X2)	2 (4X4)	3 (8X8)
Avg. accuracy	.768	.797	.806	.804

Table 3. Different pyramid level and their performances. Iteration = 5, lambda = 0.001.

We can see from the above table that performance remains identical when we go from L=2 to L=3. After tuning the parameters using validation data, we decide to pick up lambda = 0.01 and L = 2 which they yield the best result.

Different vocabulary size

We also run the experiment using different vocabulary size. From the experiment we observed that increasing vocabulary size does not always increase the accuracy, which is consistent to the result in Lazebnik et al. 2006.

Mean accuracy rate as function of vocabulary size. lambda = 0.001.

Vocab_size (M)	50	200	400	800
Avg. accuracy	.763	.792	.805	.803

Table 4. Different vocabulary size and their performances. Iteration = 5, lambda = 0.001.

Cross Validation Final Results

According to the above experiments, the final setup is lambda = 0.01, L = 2, M = 400. Further we change the step size of sampling SIFT features from 8 to 4, and run the vl_dsift function without 'fast' parameter.

Final result were evaluated by randomly pick 100 training and 100 testing images for 5 iterations.

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.750			Bedroom	LivingRoom	LivingRoom	Industrial
Store	0.730			InsideCity	Industrial	Office	InsideCity
Bedroom	0.790			LivingRoom	Store	Kitchen	LivingRoom
LivingRoom	0.670			Bedroom	Kitchen	Office	Kitchen
Office	0.960			LivingRoom	LivingRoom	LivingRoom	Kitchen
Industrial	0.680			Kitchen	Store	InsideCity	InsideCity
Suburb	1.000			OpenCountry	Industrial
InsideCity	0.790			Highway	Bedroom	Kitchen	Store
TallBuilding	0.900			Industrial	Industrial	Forest	InsideCity
Street	0.880			InsideCity	Highway	LivingRoom	InsideCity
Highway	0.880			Industrial	Industrial	Forest	Street
OpenCountry	0.700			Coast	Mountain	Coast	Mountain
Coast	0.820			OpenCountry	OpenCountry	Highway	OpenCountry
Mountain	0.920			OpenCountry	Industrial	Coast	OpenCountry
Forest	0.960			Store	OpenCountry	Mountain	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Chun-Che Wang ccwang