Bag of words
The hypothesis of the project: having images of several contextual classes we can extract features from them, cluster
and store as a vocabulary. If every image gets a histogram of feature frequencies, we can train a classifier like SVM
and build a model. It should be able to assign a class label to new images never seen before.
Scene recognition is one of the core vision tasks. For this project we build a classifier based on bag of words approach.
Words are visual features in this context. The features' arrangement from images doesn't matter in this paradigm. The
only thing that matters is the distribution of features among images of the same class. First we collect the features
from images. The bigger the size of the vocabulary we are building, the better – easier to discriminate between classes.
The downside of a large vocabulary is a slow clustering. Kmeans is used to cluster image into visual vocabulary. After
the vocabulary is built based on all possible training images, we build a histogram for each of them, describing what
features from the vocabulary they have. We use SVM to build a classification model based on histograms belonging to
particular classes. After that we test our algorithm on new images, never seen before and try to classify them into one
of the predefined classes. The hypothesis has been proven to be correct: accuracy of 0.6107 was achieved. The confusion
matrix shows confident level on diagonal, which means that images that it was trained on are classified correctly to the
classes they belong to (see Image 1). Although some classes are confused more often than others, like 'kitchen' and
'forest'. This issue can be resolved by vocabulary increase.
The algorithm of the program:
VLFeat paths are set to be able to call the SIFT, Kmeans and KDTrees related functions
read all filenames
set vocabulary size
build vocabulary:
for every image:
read it
convert to single precision
extract SIFT features
append to feature matrix
convert feature matrix to single precision
cluster using kmeans
return vocabulary
build a forest out of vocabulary using KDTree
for every class and image in it:
make histogram:
read image
convert to gray scale
extract SIFT features
query the forest with extracted features
build histograms out of results
normalize it dividing by the largest value contained
use SVM to classify histograms and obtain models
classify every test image
visualize it and get the score
calculate confusion matrix
Image 1. Confusion matrix for the trained classes
|