CSCI1430 : Project 3 Scene recognition with bag of words


The goal of this assignment is to use the bag of words model in an image classification task with a data set of 15 hand labeled scene categories, in each of the categories there are 200 images. The execution of whole system can be interpreted as several independent procedures, including visual vocabulary creation, feature encoding, model training, inference and accuracy evaluation.


  1. Create Visual Vocabulary
  2. We built the visual vocabulary based on a sample set of dense sift descriptor computed from 1500 training images. For each images, we at first generate a set of random permutations of all its sift descriptors, then sample 15% of them into a descriptor pool. After the sampling step, we send all descriptors in the pool into a k-mean cluster algorithm. The algorithm will return the means of each cluster which are taken as our visual vocabulary.
  3. Compute features
  4. In this step we represent each of the train and test images as a histogram over the visual vocabulary. For each image, we first compute the dense sift descriptor over the images, then binning each of these descriptors into one of the visual category we built in previous step. We used the spatial pyramid encoding as described in Lazebnik et al 2006. We normalize the final feature vector to make sure its entries sum up to 1.
  5. Training
  6. We used a kernel-SVM to train our data. We used the histogram intersection function as our kernel function which is the same as eq.1 in Lazebnik et al 2006. We experimented with both Newton method and gradient descent method to estimate the model parameters while both of them gives similar results.
  7. Inference
  8. We use the MAP estimator to give each of test images an label assignment with highest probability. Because we used a kernel-SVM in training stage, here we also need to compute a kernel matrix for test images, i.e. for each of the test image, we use the same kernel function(histogram intersection) to evaluate its similarity with all training images.
  9. Accuracy Evaluation
  10. We used cross validation to evaluate our model performance. For images in each of the 15 classes, we first mix up all the training and test images together. Then we randomly pick up half as train data and the other half as test data. We report statistic summaries of accuracies of our estimation, including mean, median, mode and standard deviation.

Extra Credit

  1. Spatial Pyramid
  2. We built a 2 level spatial pyramid representation for our images(L0, L1 and L2). We use the histogram intersection function as our kernel function. The weighting strategy is also the same as described in Lazebnik et al 2006. For each image, we recursively divide into several grids(1, 4, 16 in our experiment), and compute local histograms for each of these grids which is considered to capture spatial information. All histograms are concatenated together and normalized to 1 using the weighting function.
  3. Test with different vocabulary size
  4. We tried different vocabulary sizes which is believed to be related with the strength of feature. That is to say, small vocabulary means weak features while big vocabulary means strong features. We found that when vocabulary size is small, the deviation of accuracies becomes large. The explanation may be that when vocabulary size is small, it is too limited to represent the visual complexity in our data set, thus depend on what vocabulary we have chosen, the performance may vary a lot.
  5. Cross Validation
  6. In traditional cross validation, data is split into several folds and partitions over these folds are made by a round-robin fashion to form the train and test set. In our experiment, we use a different strategy while keep the idea the same, which is, ach time we randomly permute the whole data set and select half as train and the other half as test.
  7. Test with different features
  8. We tried vl_dsift() and vl_phow() to compute our feature. When using vl_phow(), we scale size vector [4 8] as described in the SUN database paper. However, it didn't bring us any performance improvement. Surprisingly we find that vl_dsift() representation results in slightly more accurate estimations.


Comparison of different system setup over its accuracies (click to see detail)

Setup avg. accuracies
Baseline(vocab=200 + linear SVM) .622
vocab size=200 + kernel SVM(histogram intersection kernel function) .713
vocab size=200 + kernel SVM + phow descriptor .722
vocab size=200 + kernel SVM + dsift descriptor .800
vocab size=200 + kernel SVM + phow descriptor .790

Confusion matrix of our best run, where accuracy=.818

Suburb Forest InsideCity OpenCountry TallBuilding Bedroom Kitchen Store Coast Highway Mountain Street Office Industrial Livingroom
1.00 .88 .93 .83 .81 .83 .71 .93 .92 .99 .72 .63 .79 .60 .69

Diagonal values of previous confusion matrix, with classes of highest and lowest correct classification rate showed in color.

Statistics of accuracies over different vocabulary size, cross validated by 5 random initializations

The plot shows that increasing vocabulary size will not always improve the performance, which is consistent with the experiment result in Lazebnik et al 2006.

The following table shows the statistic summaries of our cross validation results, 5 runs for each vocab settings

Vocabulary Size 10 20 50 100 200 400 1000
Mean .637 .724 .756 .782 .783 .812 .802
Median .635 .727 .751 .783 .781 .814 .805
Mode .650 .733 .770 .792 .798 .818 .809
Std. Dev. (Normalized by N-1) .0106 .0079 .0125 .0078 .0102 .0057 .0085

Samples of failure cases, notice that some of the confusions are pretty reasonable

Some of the category, like 'open country' may be visually diverse, so we may need more data to train such category. Some other categories may be semantically or visually similar, such as 'insidercity' v.s. 'street', 'living room' v.s. 'kitchen' and 'opencountry' vs 'highway'
'Bedroom' categorized as 'Industry' 'coast' as 'country' 'industry' as 'living room' industry' as 'store' 'insidercity' as 'street'
'living room' as 'kitchen' 'opencountry' as 'coast' 'opencountry' as 'highway' 'opencountry' as 'store' 'store' as 'highway'