Project 3: Bag of Words Model for Scene Classification
CS 143: Introduction to Computer Vision

Confusion matrix for scene classification on the 15 scene data set, using 3200 vocab + spatial pyramid matching kernel + gist (RBF). Using 5-fold cross validation to tune parameters on the training set, the mean accuracy over the full test set (2985 images) is 83.12%
Overview
In this project, we explore the classic bag of words model for scene classification, using the 15 scene data set. Each image is treated as a document, containing a number of visual words. An image can thuse be efficiently represented using the counts of the words from a precomputed dictionary of visual words. This dictionary is formed by collecting sift features from many images followed by kmeans clustering. We also explore the benefit of the using a spatial pyramid matching kernel as discussed in Lazebnik et al 2006, as well as the benefit of adding gist descriptors on top.
Details
Basic Bag of Words Model
In the basic bag of words model as decribed in the project handout, we first compute a number of sift visual words and apply hard assignment to densely collected sift features in any given image to form a normalized histogram of size 1 x num_words, this is the image level representation. Then, using a one-vs-all SVM scheme, a typical accuracy on the test set would be about 62%.

a visualization of a sample dictionary of 200 visual words, each column is a word of 128 dimensions.
Spatial Pyramid Matching Kernels
Following Lazebnik et al 2006, a better representation can be formed by adding spatial information to the basic bag of words model, which does not encode any form of spatial information. Such information can be quite useful when comparing a natural outdoor scene (open, horizon, clouds, etc) to an indoor scene (closed, lack of horizon, cluttered). The spatial information is encoded using hard assignment as well. The image is first divided into a number of cells (the finest level), and a histogram is built for each cell by counting the word frequencies. At coarser levels, we simply combine the histograms from the finer levels, but apply a smaller weight to histograms at coarser levels, putting more trust in histograms with finer spatial information.
As defined in Lazebnik et al 2006, the pyramid matching kernel is simply the sum of histogram intersections at each bin location. This can be easily computed. Given any two images (with their spatial pyramid representation), we can precompute k(x_i,x_j) before training time, and store the kernel matrix K of size num_train x num_train, where K(i,j)=k(x_i,x_j). The primal_svm.m function conveniently allows use to define any custom kernel as long as we supply the kernel matrix as global variable K. At test time, the prediction for each image x* becomes a linear combination of k(x*,x_i).
With this part implemented, the performance can reach to about 80%, which is significantly better than the basic bag of words model.
Adding Gist Features
Adding gist features allows spatial and gradient information to be represented in a different way, which can be beneficial. For each image pair, we define k(x_i,x_j)=k_pyr(x_i,x_j)+k_rbf(x_i,x_j) where the first kernel is the spatial pyramid matching kernel as described above, the second is an RBF kernel in gist feature space. Since both kernels are Mercer, the sum is a valid Mercer kernel as well. After this modification, there are two parameters we need to tune: (1) sigma for RBF, (2) lambda for the primal SVM. These values are tuned using 5-fold cross validation on the training set.
Actually the above formulation might be insufficient and we need to tune the relative weight between the 2 kernels as well: k(x_i,x_j)=k_pyr(x_i,x_j)+w*k_rbf(x_i,x_j). However, I have not examined this yet.
Results

Per-catergory performance on the 15 scene data set, as well as the mean accuracy (see legend). All experiments use hard assignment and the spatial pyramid matching kernel. Using a larger vocab size alone shows little improvement. But adding gist features can help a lot. Note how adding gist can decrease performance for certain categories (especially indoor scenes). I have also tried soft assignment, but the improvement is not that significant, hence not reported in this graph.