CS143 Introduction to Computer Visions

Project 3: Scene Recognition by Bag of Words (by Margaret Kim, mk20)

Objective

To classify various scenes by using bag of words model to train the classifier and to test on the 15 scene database.

Bag of Words Model

The bag of words model is a popular technique for image classification inspired by models used in natural language processing. It ignores or downplays word arrangement (spatial information in the image) and classifies based only on a histogram of the frequency of visual words. Visual words are identified by clustering a large corpus of example features. The baseline of this technique is discussed in the Beyond Bags of Features by Lazebnik et al. 2006.

The general steps of the bag of words are:

  1. Extract features
  2. Learn "visual vocabulary"
  3. Quantize features using visual vocabulary
  4. Represent images by frequencies of "visual words"

Overview of the Algorithm

The steps of this project is:

  1. Establish a vocabulary of visual words through feature clustering.
  2. Convert all training images into the histogram representation
  3. Learn a set of one-vs-all classifiers (SVMs) from the training histograms
  4. Classify each test image and build a confusion matrix

Details

The features that we will be extracting from images are the SIFT features. SIFT stands for scale-invariant feature transform and these features are tolerant to image noise, changes in illumination, uniform scaling, rotation, and minor changes in viewing direction. The features were extracted in the form of a regular grid.

SIFT grid

Once the features are extracted from all training images, we cluter the features into 200 vocabulary words using k-means.

Converting all training images into the histogram representation is done by figuring out the frequency of visual words in each training image.

Then the SVM is trained with the generated histograms and one can classify each test image and build a confusion matrix by comparing the histogram of the test image to the histogram of the vocabulary words

Results

The following are the results of varying the size and step values while generating the SIFT features.

Size = 4, Step = 8, Accuracy = 0.6353
Scene recognition with size 4 and step 8

Size = 8, Step = 16, Accuracy = 0.6380
Scene recognition with size 8 and step 16

Size = 4, Step = 16, Accuracy = 0.5920
Scene recognition with size 8 and step 16