CS 143 / Project 3 / Scene Recognition with Bag of Words

The goal of this project is to recognize 15 categories of scene using different image presentation and and classifier. We then compare the performance between each model. The following is the methods I implemented for this project


Tiny images representation + K-NN

In this method, we resize each images to a 16 by 16 tiny images with zero mean and unit length and use these tiny images as the representation. We calculate the L2 distance between every test images and train images, and run the K-NN algorithm to vote for the label. By using this method, we observed that when K = 1 we will get the best result.

K 1 3 5 10 15
Accuracy .227 .209 .212 .214 .217

Table 1. Different K's and their performances

The confusion matrix for tiny images representation and 1-NN classifier.


Bag of SIFT representation + K-NN/Linear-SVM

Algorithm

  1. Dense sample SIFT features for each train images with step size 8 and bin size 4. For each train images randomly select 150 SIFT features and add them into a descriptor pool.
  2. Run the k-means clustering algorithm on the descriptor pool and build up a k-word visual vocabulary.
  3. Represent every train and test images as histograms of visual words built from step(2). For each images, we sample SIFT features with step size 8 and bin size 4. We calculate the similarity between these new observed SIFT features and visual words to figure out which vocabulary they belong to.
  4. Use K-NN/Linear-SVM to classify each test images.
  5. K-NN classifier

    Use K-NN as the classifier. At evaluation phase, a test image is assigned the label with the highest vote. We observed that with bigger K we will get better accuracy. Table 2 shows the prediction accuracy of different K's.

    K 1 3 5 10 15
    Accuracy .576 .580 .591 .615 .617

    Table 2. Different K's and their performances. Vocab size = 400.

    The confusion matrix for bag of SIFT representation and 15-NN classifier.


    Linear-SVM

    Instead of using the k-NN neighbor classifier, we train 1-vs-all linear SVMs. Then at evaluation phase, a test image is assigned the label of the classifier with the highest response. Table 3 shows the prediction accuracy of different vocabulary size.

    Vocab_size 400 800
    Accuracy .637 .659

    Table 3. Different vocab_size and their performances. alpha = 0.001.

    The confusion matrix for bag of SIFT representation with vocab_size = 800 and linear-SVMs.


    Bag of SIFT representation (Spatial Pyramid) + Non-Linear SVM

    Bag of SIFT features represent an image as orderless collection of features. It disregard all the spatial information of an give image. In order to capture the spatial layout of the features, we used the spatial pyramid method (introduced in Lazebnik et al. 2006) to generate our new image representation. Then, instead of using the linear-SVM, we used the non-linear SVM as our new classifier. As the paper suggested, we can compute the kernel matrix using the equation(4) in Lazebnik et al. 2006.

      Kernel Matrix

      Pyramid match kernel matrix for train images.
      Setup: vocab_size = 400, L = 2 for spatial pyramid, and dense sample SIFT features using step size = 8 and bin size = 4.


      Tuning parameters

      Tune SVM parameters on validation dataset to pick the best lambda. Note: All the following experiments is done by vocab_size = 400, Dense sample SIFT features using step size = 8 and bin size = 4.

      Accuracy rate as function of lambda. L = 2.

      Here we pick up lambda = 0.01 as our regularization weight. Then we fixed the lambda at 0.01, and run the experiment using different pyramid levels. For each level, the result were evaluated by randomly pick 100 training and 100 testing images for each iteration.



      L 0 (1X1) 1 (2X2) 2 (4X4) 3 (8X8)
      Avg. accuracy .768 .797 .806 .804

      Table 3. Different pyramid level and their performances. Iteration = 5, lambda = 0.001.

      We can see from the above table that performance remains identical when we go from L=2 to L=3. After tuning the parameters using validation data, we decide to pick up lambda = 0.01 and L = 2 which they yield the best result.


      Different vocabulary size

      We also run the experiment using different vocabulary size. From the experiment we observed that increasing vocabulary size does not always increase the accuracy, which is consistent to the result in Lazebnik et al. 2006.

      Mean accuracy rate as function of vocabulary size. lambda = 0.001.

      Vocab_size (M) 50 200 400 800
      Avg. accuracy .763 .792 .805 .803

      Table 4. Different vocabulary size and their performances. Iteration = 5, lambda = 0.001.



    Cross Validation Final Results

    According to the above experiments, the final setup is lambda = 0.01, L = 2, M = 400. Further we change the step size of sampling SIFT features from 8 to 4, and run the vl_dsift function without 'fast' parameter.

    Final result were evaluated by randomly pick 100 training and 100 testing images for 5 iterations.




Accuracy (mean of diagonal of confusion matrix) is 0.829

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.750
Bedroom

LivingRoom

LivingRoom

Industrial
Store 0.730
InsideCity

Industrial

Office

InsideCity
Bedroom 0.790
LivingRoom

Store

Kitchen

LivingRoom
LivingRoom 0.670
Bedroom

Kitchen

Office

Kitchen
Office 0.960
LivingRoom

LivingRoom

LivingRoom

Kitchen
Industrial 0.680
Kitchen

Store

InsideCity

InsideCity
Suburb 1.000
OpenCountry

Industrial
InsideCity 0.790
Highway

Bedroom

Kitchen

Store
TallBuilding 0.900
Industrial

Industrial

Forest

InsideCity
Street 0.880
InsideCity

Highway

LivingRoom

InsideCity
Highway 0.880
Industrial

Industrial

Forest

Street
OpenCountry 0.700
Coast

Mountain

Coast

Mountain
Coast 0.820
OpenCountry

OpenCountry

Highway

OpenCountry
Mountain 0.920
OpenCountry

Industrial

Coast

OpenCountry
Forest 0.960
Store

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label