The aim of this assignment is to implement a bag of words model for scene categorization. The baseline algorithm presented in Lazebnik et al. 2006 is used for this assignment.
The implementation of the algorithm was done purely in MATLAB. A data set of 15 different scene categories and 200 images per category (100 for training and 100 for test) is used. First, local features from the training set is collected and clustered (using k-means) in to a vocabulary of varying sizes. Then each training image will be represented as a distribution of visual words using histogram encoding.
The steps after this were given to us in a stencil code. 1-vs-all classifiers are trained for each scene category. Then each test image is histogram encoded and queried for each scene category.
Vocabulary Size | Accuracy |
10 | 0.4307 |
20 | 0.5107 |
50 | 0.5707 |
100 | 0.6093 |
200 | 0.6293 |
400 | 0.6247 |
1000 | 0.6060 |
10000 | 0.5707 |
The accuracy of the model seems to be highest with a vocabulary size of 200, and it gets worse when the size is increased or decreased. The lower sizes must be oversimplyfying the features, and not giving a mean value for each distinct feature group. The higher sizes must be too soft in clustering the features and therefore leaving too many feature groups. This would result in not accumulating enough features per bin during histogram encoding.
The best accuracy was obtained using the vocabulary size 200. The image of the confusion matrix obtained from that run is given below.