Following the instructions for the "baseline" scene recognition, the results below were obtained:
I examined the effect of different vocabulary sizes on performance. The sizes 10, 20, 50, 100, 200, 500, 1000, and 2000 were tried. Their respective accuracies were as follows:
Vocab size | Accuracy |
10 | 0.4760 |
20 | 0.5393 |
50 | 0.5947 |
100 | 0.6153 |
200 | 0.6680 |
500 | 0.6620 |
1000 | 0.6073 |
2000 | 0.5227 |
I also tried measuring performance using soft assignment (kernel codebook encoding). In these experiments, gamma was chosen as 10^-4. This produced "soft" looking results while still enabling one to know what the hard selection would have been (because hard selections were 1-2 orders of magnitude larger). However, the accuracy decreased to 0.5307 for a vocabulary size of 200. This could be either due to the choice of gamma or because a different vocabulary size is optimal for soft assignments.