The goal of this project is to examine the task of scene recognition with different methods with various complexity.
This is done in two steps:
Accuracy (mean of diagonal of confusion matrix) is 0.221
Accuracy (mean of diagonal of confusion matrix) is 0.531
Accuracy (mean of diagonal of confusion matrix) is 0.666
One free parameter in the Bag of SIFT model is the size of the vocabulary, or the number of clusters that we'll store. Reducing the vocabulary size means that we're compressing the feature vectors by reducing their dimension, increasing the vocabulary size makes the visual words more fine-grained. We investigated how the performance varies with different vocabulary sizes, their correlation is shown in the graph below:
We can see that the performance vastly improves with vocabulary size going from 20 to 200 but improves more slowly for larger vocabuary sizes. This is because as more clusters are introduced, the chance for noise and the number of dimension in the feature vector also increases.
One disadvantage of vector quantization is that two features assigned to two different (even very close) clusters are considered totally different. To remedy this problem, we experimented with "soft assignment": each feature is is assgined with a given weight to several visual words nearby in the feature space. The weight assigned is proportional to the exponential of the negative distance from the feature to the visual word.
[indices, distances] = knnsearch(vocab, features', 'K', 3);
N_features = size(features,2);
weights = exp(-0.5*(distances.^2)./variance);
for u=1:N_features
for v=1:K
image_feats(ii, indices(u,v)) = weights(u,v) + image_feats(ii, indices(u,v));
end
end
Unfortunately, we did not see any substantial performance gain from this.