1. Abstract

In this assignment, I implemented image classification using bag of words model inspired by Lazebnik et al. 2006 paper. I used 15 scene dataset from the paper for both training and test set, and yielded various results by changing feature(vocabulary) sizes and the number of training set.

2. Algorithm

The main idea of this project is building a histogram of dense image features and matching the histogram to a target image to compare how much two features are corresponding. To accomplish this, we need to collect dense image features and cluster them into the particular number of bins. According to several experiments, greater vocabulary size yields better result in general. Here is a basic flow of this algorithm.

< Fig 1. Bag of words model (from Oxford Visual Geometry Group) >

- Collect a lot of features to build visual vocabularies
- Use k-means(or k-tree for a speed up) to cluster those features into a visual vocabulary
- For each of training images build a histogram of word frequency (assigning each feature found in the training image to the nearest word in the vocabulary).
- Feed these histograms to an SVM to train.
- Build a histogram for test images and classify them with the SVM you just trained.

First of all, I collected example scenes from of each category in the 15 scene dataset from Lazebnik et al. 2006. I selected 50 images from each category to build a bag of visual words, 100 images from each category to build histograms and train a dataset, and 100 images which are not duplicated from each category for a testset.

< Fig 2. scenes from of each category in the 15 scene dataset (from Lazebnik et al. 2006)>

In this project I used SIFT(Scale-invariant feature transform) published by D.Lowe in 1999 to collect local features of images. SIFT is very robust image discriptor which reprensents a collection of feature vectors. I firstly collected SIFT features from all the training images and clustered the dense SIFT features into a vocabulary of 'visual words' using kmeans.

Input:
img[] (input image array)
vocab_size (size of vocabularies)
Output:
vocab[] (128 x vocab_size of clustered visual words)
Functions:
dsift(img, size, step) - calculate sift features of a single image
kmeans(matrix, k) - apply kmeans algorithm to cluster matrix
Algorithm:
matrix[] ← NULL
for i from 1 to size(img[]) by 1
sift_feature ← dsift(img[i], size, step)
matrix[] ← concatnate(matrix[], sift_feature)
vocab ← kmeans(matrix, vocab_size)

Since we now have a clustered bag of features, we are ready to convert all the training images into the histogram representation. To calculate the similarity between training images and bag of image features, we need to calculate Euclidean distance between them. I used kd-tree in this project to improve the calculation speed of Euclidean distances. We will eventually have corresponding frequencies by doing this.

< Fig 3. Histogram Representation >

Input:
img[] (input image array)
vocab[] (128 x vocab_size of clustered visual words)
Output:
histogram[] (histogram representation indicates frequencies of features) Functions:
dsift(img, size, step) - calculate sift features of a single image
kdtreebuild(matrix) - build a kd-tree from matrix
kdtreequery(forest, matrix, Y) - get the index of the nearest column of matrix for each column of Y
histc(index, edges) - calculate the frequencies of index within edges
Algorithm:
for i from 1 to size(img[]) by 1
   sift_feature ← dsift(img[i], size, step)
   forest ← kdtreebuild(vocab)
   index ← kdtreequery(forest, vocab, sift_feature)
   histogram[i] ← histc(index, edges)

We now can learn a set of one-vs-all classifiers using SVM(support vector machine) from the histogram. Let's consider SVM as a blackbox which yields a train set from a set of classified data, and we are finally able to classify each test image and build a confusion matrix. A confusion matrix is a visualization tool typically used in supervised learning. (cited from Wikipedia: http://en.wikipedia.org/wiki/Confusion_matrix )

Here are some results and their confusion matrix.

3. Results

According to several experiments, it turned out that greater vocabulary size yields better result in general.

< left = 0.4137 (vocab_size=10), right = 0.5153 (vocab_size=20) >

< left = 0.5753 (vocab_size=50), right = 0.6207 (vocab_size=200) >

< left = 0.6295 (vocab_size=400), right = 0.6427 (vocab_size=1000) >

Scene Recognition with bag of words

1. Abstract

2. Algorithm

3. Results