Nicholas Ragosta
CSCI 1430 Project 3
Goal:
To use a bag of words model to recognize and classify an image of an unknown scene.
Method:
Building Vocabulary
- First we must gather a vocabulary, or set of distinct features that occur in the training images.
- The function vl_dsift is run on each of the training images. This function samples a number of key points from the image and returns a sift descripor for each pixel sampled.
The matrix returned is 128 x P, where P is the number of sampled pixels.
- To improve the speed of later computations, the sift matrix is further reduced in size by randomly sampling 100 sift descriptors from the total number of pixels (P) sampled.
- Vl_dsift is run on each of the training images, and the resulting matrices are stored such that the end result will be a 128 X (100*N) matrix where N*100 is the number
training images multiplied by the amount of random points sampled from each training image's sift descriptor matrix.
- Finally, Kmeans is run on the total sift descriptor matrix. The number of groupings is set to be equal to a value called vocab size. This produces a Matrix that contains
a number of distict sift descriptor sets equal to the value of vocab size. Each one of these sift descriptor sets is a word in our bag of words model.
Creating a Histogram Representation
- Next, we create a histogram representation of each of the training images.
- Again, vl_dsift is run on each of the training images. vl_alldist2 is then run, to compute the distance between the sift descriptor of each key point in the training image and the
sift descriptors of the vocabulary words.
- The minimum distances between vocab words and training image interest points are then found, and the interest point is said to be of the feature type to which is most closely corresponded.
- The indices of the minimum distances are stored and then the total number of occurences of each indice are computed. This produces a histogram of the number of occurrences of each
feature type (or vocab word) in a given training image.
Learning from training histograms
- The histograms of each image type are then fed into our SVM function. The SVM examines the all the histograms of each image type to determine a general histogram distribution that fits
each image type.
- For example, the SVM looks at all 100 histograms of the image type "Living Room," it then determines a feature distribution that closely matches the distributions found in the 100 histograms. This distribution
is used to determine if a test image should be classified as a Living Room.
Classifying a test image
- As stated previously, the feature distribution determined to most closely resemble the histograms of each training image class are used to classify the test images.
- The feature distibution of each test image is compared to the models of each image class, and a classification is determined.
- 100 test images of each image class are classified, the distributions of determined identities are stored in our confusion matrix and an accuracy score is reported.
A perfect classifier would correctly classify every test image, and our confusion matrix would be an identity matrix.
Results:
The baseline performs fairly well, correctly classifying the scenes 66.33 percent of the time when a vocabulary size of 200 was used. This classifying method performed especially
well for certain scenes. for example, it correctly classified Suburbs over 90 percent of the time. conversely, it had a difficult time classifying industrial, having an accuracy of
just over 40 percent for this scene type. Below is the confusion matrix of the baseline method run with a vocab size of 200 words.
Extra Credit and Discussion:
I examined the effect of varying the vocabulary size on the accurcy of classification. By increasing the vocab size used, a higher accuracy of classification was obtained. this result was to be
expected, because a larger vocabulary size allows for a more accurate and complete description a particular scene and should lead to greater disparity between scene types. Below is a plot, comparing
vocab size with accuracy, and the confusion matrices produced for each vocab size tested.
|
|
|
|
|
Vocab Size of 10 Words (47.47%)
|
Vocab Size of 50 Words (60.93%)
|
Vocab Size of 100 Words (65.53%)
|
Vocab Size of 500 Words (67.33%)
|
Vocab Size of 1000 Words (70.00%)
|