Project 3 - Scene Recognition Using Bag of Words
Scott Newman (senewman)
October 24, 2011
The goal of this project was to attempt to classify an image as being part of a particular "scene."
Scenes can be anything - for example, given an image of a house, one might want to know
if the image is of the kitchen, the backyard, the bedroom, etc. By extracting general features
from training data, images which have been preclassified, the scene recognition algorithm
then determines the scene to which some unknown image belongs by determining which of its
unknown features appear most often from the already classified features.
Broadly, the scene recognition algorithm implemented for this project uses a
bag of words model
to recognize an image as being part of one of 15 scenes, as outlined in
Lazebnik et al. 2006.
The first step of the algorithm is to collect a series of features from training images
and cluster them into a visual vocabulary of so-called "visual words." These features are gleaned by examining various texture, brightness, and color information, to name
a few. In addition, these features are extracted using a
transform, or SIFT, meaning these features are invariant to translation, scaling, and rotation.
This allows features to be represented similarly among images with different lighting,
size, etc. We did not implement a feature extractor for this algorithm, so a more
technical description of feature extraction will be left out for the purposes of brevity.
The library used for feature extraction for this project is called
provides useful machine learning and computer vision tools.
After all of these features are extracted from the training data, they are then
clustered using k-means clustering. These cluster centers can be thought of
as "visual words." This is where the "bag of words" model comes into play;
these "visual words" are not dependent on spatial information in the images.
One can abstractly think of it as chopping up an image into many little segments,
taking the most relevant ones, and throwing them randomly into a bag.
(bag of words representation of the Mona Lisa)
Once the visual vocabulary is extracted, all the training images are represented as
histograms of words from the visual vocabulary.
(bag of words histogram example)
This is a crucial step - in determining which features appear most often in training images,
we will be able to determine how likely an unknown image is to belong to a particular scene
based on its features' relative frequency, since we know certain features appear
most often in certain scenes.
As an example, imagine one of the visual words is a small picture of a stove top.
In creating a histogram as described above, one might see that this "stove word"
appears most often in training images belonging to a kitchen scene. Therefore,
an unknown image that contains a high relative frequency of this stove feature
might be likely to belong to a kitchen scene. Features
in the training image might not exactly match the words in the visual vocabulary
exactly, so the most similar word is chosen in each case. "Similarity" here is defined
using a sum of squared differences metric.
If we have a feature
and a feature
their "distance" is
Naturally, then, the feature with minimum distance is chosen.
After the histograms are computed, a set of one-vs-all classifiers from the training histograms
are learned as a support vector machine,
also called an SVM. A linear kernel was used for this project. This training is based on the bag of words
computed at the beginning of this algorithm pipeline.
Once this is done, test images are converted to the bag-of-words representation (as the training
images were) and then evaluated by using the set of 1-v-all classifiers on the bag-of-words test image.
To evaluate accuary, a confusion matrix is built.
A confusion matrix is such that each column
represents instances of a predicated class, and each row represents instances of an actual class.
This allows one to easily evaluate if a class is being misrepresented as another. An accuracy metric,
therefore, is determined by taking the mean of the confusion matrix's diagonal (which represents
each actual class versus the predicated class), and dividing by the number of training images used.
Implementation and Results
I did not do anything particularly special for my implementation of the algorithm described above.
When building the visual vocabulary, I aggregated all of the SIFT features extracted from each image
into one big matrix and then performed a single k-means clustering at the very end. One optimization
I tried was performing k-means clustering of the SIFT features of each image, and then at the very end
taking all of THOSE cluster centers and clustering them together. This requires many more k-means clustering
operations, yet each one is over a much, much smaller set of elements. This sped up computation by around 5-10
minutes if I was extracting a lot of features (i.e., step size of 1), yet made accuracy go down by at most
0.05 to 0.1.
My histogram implementation was very straightforward. I used the distance metric described above, and to normalize
each histogram, I divided by the total number of features found in the image. This is so
the algorithm wouldn't favor big images with more features.
The actual results very heavily depended on fine tweaking of parameters.
I used 80 training images per class, a vocabulary size of 250, and a step size of 8 and
bin size of 4 when extracting image features, giving an accuracy of approximately 0.65.
In my final version of the code, I did not perform optimizations to speed up computation, though my implementation definitely
could have benefited from them. It took approximately 45 minutes to run with the above parameters.
In tweaking parameters, I found that using more images was more important for accuracy than extracting more
features. Doubling the number of features extracted per image led to an approximate increase in accuracy in the range
[0.01, 0.03], whereas doubling the number of images from 25 to 50 led to approximate increase in accuracy of
[0.1, 0.3]. That isn't to say this is a linear relationship; increase the number of images from 40 to around 80
led to a much smaller increase in accuracy, around [0.02, 0.04]. Still, this is all to say that if a fixed number
of features were extracted from the training images, it is better to extract fewer features per image from more images
than more features per image from fewer images. Increasing the vocabulary size from 200 to 250 helped a tiny bit, increasing accuracy by [0.005,0.01].
In general, with this project, MORE IS BETTER. More data consistently gave better results. I tried randomly sampling
n images per class instead of taking the first n images per class, which occasionally gave better results than the deterministic method that
yielded an accuracy of 0.65, as mentioned above. Using randomness, my accuracy ranged anywhere from around 0.62 to 0.67.
The following is the confusion matrix generated with the parameters in bold above.