The scene recognition assignment has three majors which will all be discussed in the depth below. Each phase is unique in its use of a different feature descriptor or classifier. The types of features descriptors used are the tiny_image descriptor or the bag_of_sifts descriptor. As for classifiers there is the nearest neighbor classifier and the linear SVM classifier. Thus the resulting feature and classifier combinations that were tested and implemented are:
This feature descriptor was established by shrinking images down in size and then turning the thumbnail into a vector. The problem is that this method is very lossy especially in terms of an images high frequencies. As you can see below the tiny image feature descriptor code was implemented with normalization in mind but it was removed since it did not appear to improve results. However, subtracting the mean of the image_vector from the image_vector (zero mean) did seem to improve results marginally.
image_path_size = size(image_paths);
image_feats = zeros(image_path_size(1),256);
for i=1:1:image_path_size(1)
current_image = imread(char(image_paths(i)));
resized_image = imresize(current_image, [16,16]);
image_vector = resized_image(:);
image_vector = image_vector - mean(image_vector(:));
% sum_f = sum(image_vector(1,:));
% image_vector_normalized = image_vector./sum_f;
image_feats(i,:) = image_vector;
end
The nearest neighbor classifier was the first one I established and it employs a k-means method using vl_alldist2 which finds the distance of each training image from the each test image. With this distance matrix we can find the nearest neighbors. The number of nearest neighbors resulted in the most variability of results for this portion of the code. Here there were two options. Eitheir I could simply pick the nearest neighbor or instead vote on the type of image based on a histogram of some k nearest neighbors. I ended up coding both alternatives since they improved results in different situations. 1 nearest neighbor seemed more effective for tiny image classifier training however 4 nearest neighbors proved to be the most effective when working with bag of sift feature descriptions. Two implement the k nearest neighbors method I had generate histograms by hand using strcmp and sum to establish the number in each bin before indexing into a string array to grab the corresponding label.
%nearest neighbors
% k = 1;
% test_image_feats_size = size(test_image_feats);
% distance = vl_alldist2(train_image_feats',test_image_feats') ;
% [y,index_array] = sort(distance);
% predicted_categories = cell(test_image_feats_size(1),1);
% for j=1:test_image_feats_size(1)
% predicted_categories(j) = train_labels(index_array(1,j));
% end
k = 3;
temp = cell(1);
test_image_feats_size = size(test_image_feats);
distance = vl_alldist2(train_image_feats',test_image_feats') ;
[y,index_array] = sort(distance);
predicted_categories = cell(test_image_feats_size(1),1);
binranges = zeros(15,1);
strings = cell(15,1);
strings{1} = 'Kitchen';
strings{2} = 'Store';
strings{3} = 'Bedroom';
strings{4} = 'LivingRoom';
strings{5} = 'Office';
strings{6} = 'Industrial';
strings{7} = 'Suburb';
strings{8} = 'InsideCity';
strings{9} = 'TallBuilding';
strings{10} = 'Street';
strings{11} = 'Highway';
strings{12} = 'OpenCountry';
strings{13} = 'Coast';
strings{14} = 'Mountain';
strings{15} = 'Forest';
for j=1:test_image_feats_size(1)
binranges(1) = sum(strcmp('Kitchen',train_labels(index_array(1:k,j))));
binranges(2) = sum(strcmp('Store',train_labels(index_array(1:k,j))));
binranges(3) = sum(strcmp('Bedroom',train_labels(index_array(1:k,j))));
binranges(4) = sum(strcmp('LivingRoom',train_labels(index_array(1:k,j))));
binranges(5) = sum(strcmp('Office',train_labels(index_array(1:k,j))));
binranges(6) = sum(strcmp('Industrial',train_labels(index_array(1:k,j))));
binranges(7) = sum(strcmp('Suburb',train_labels(index_array(1:k,j))));
binranges(8) = sum(strcmp('InsideCity',train_labels(index_array(1:k,j))));
binranges(9) = sum(strcmp('TallBuilding',train_labels(index_array(1:k,j))));
binranges(10) = sum(strcmp('Street',train_labels(index_array(1:k,j))));
binranges(11) = sum(strcmp('Highway',train_labels(index_array(1:k,j))));
binranges(12) = sum(strcmp('OpenCountry',train_labels(index_array(1:k,j))));
binranges(13) = sum(strcmp('Coast',train_labels(index_array(1:k,j))));
binranges(14) = sum(strcmp('Mountain',train_labels(index_array(1:k,j))));
binranges(15) = sum(strcmp('Forest',train_labels(index_array(1:k,j))));
[value,label_ind] = max(binranges);
if value == 1
predicted_categories(j) = train_labels(index_array(1,j));
else
temp{1} = strings{label_ind};
predicted_categories(j) = temp;
end
end
This feature descriptor was implemented by first generating a vocabulary for the descriptors to use and label training images. The vocab was generated using vl_dsift on each training image to grab sift features before clustering these features with vl_kmeans. This process was extremely slow but smaller step size for vl_dsift and not using the 'fast' flag produced better and better results. After this vocab was established the training images were associated with their labels by finding the clusters that best described them and normalizing the features obtained at these clusters as seen below.
%Bag of Sift (after vocab is created)
load('vocab.mat')
vocab_size = size(vocab', 2);
image_path_size = size(image_paths);
image_feats = [];
for j=1:image_path_size(1)
current_image = imread(char(image_paths(j)));
[locations, sift_features] = vl_dsift(single(current_image),'step',10);
distance = vl_alldist2(single(sift_features),vocab');
[y,index_array] = min(distance');
binranges = (1:vocab_size);
bincounts = histc(index_array,binranges);
sum_f = sum(bincounts);
bincounts_normalized = bincounts./sum_f;
image_feats(j,:) = bincounts_normalized;
end
This classifier produced by far the best results when working through the testing images. The technique here is to create a classifier using vl_svmtrain for each unique string in the train_labels array which effectively creates a classifer for each image type. Following this the values returned from vl_svmtrain are stored for each category. Then on each image each classifier is run and the one that produces the largest weight results is selected as the corresponding image category. The weight or confidences are calculated using the formula W*X + B where '*' is the inner product or dot product and W and %B are the learned hyperplane parameters and X is the image.
%SVM
categories = unique(train_labels);
num_categories = length(categories);
lambda = .000008;
w_tot = [];
b_tot = [];
for i=1:num_categories
label_logical = strcmp(categories(i),train_labels);
label_binary = ones(size(label_logical)) .* -1;
label_binary(label_logical) = 1;
[w,b] = vl_svmtrain(train_image_feats',label_binary',lambda);
w_tot(i,:) = w;
b_tot(i,:) = b;
end
num_test_image_feats = size(test_image_feats);
for i=1:num_test_image_feats
weight = [];
for j=1:num_categories
weight(j,:) = dot(w_tot(j,:),test_image_feats(i,:)) + b_tot(j,:);
end
[max_weight,ind] = max(weight);
predicted_categories(i) = categories(ind);
end
predicted_categories = predicted_categories';
Image Comparison
Using the tiny image representation, the maximum accuracy obtained was 0.204. Interestingly this accuracy was obtained with use of normalizing the image_vectors. Furthermore, the nearest neighbor classifier only seemed to degrade results as more neighbors were used for voting. Therefore, for this step only 1 nearest neighbor was used.
Image Comparison
For the bag of sift representation the maximum accuracy obtained was 0.531. Unlike the tiny images this maximum accuracy was achieved by using multiple nearest neighbors. 4 nearest neighbors appeared the best results. Here the step size for creating the vocab was 10 and for the bag of sift it was also 5.
For the bag of sift representation with the linear svm classifier the best accuracy obtained was 0.699. This was obtained with a lambda of .0000008 and a step size of 10 for the vocab and 5 for bag of sift. The full results can be seen in the table in the section below. It should be noted that here tweaking parameters resulted in the most accuracy gains. Different lambda values resulted in accuracies as low as 0.61 and so it is clear that lambda has a huge impact. The other biggest gains in accuracy were resulting from removing the 'fast' flags from vl_dsift in bag_of_sifts and build_vaobulary as well as decreasing the step size in both of these files. The step size ended up being 10 looking through training images to build the vocab and 5 when categorizing images in bag_of_sifts.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.570 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Bedroom |
![]() Office |
![]() Office |
Store | 0.570 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() LivingRoom |
![]() Industrial |
![]() Suburb |
Bedroom | 0.470 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() TallBuilding |
![]() LivingRoom |
![]() LivingRoom |
LivingRoom | 0.390 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Industrial |
![]() Industrial |
![]() Kitchen |
Office | 0.930 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Kitchen |
![]() Kitchen |
![]() Kitchen |
Industrial | 0.560 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Street |
![]() LivingRoom |
![]() Highway |
Suburb | 0.990 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() OpenCountry |
![]() TallBuilding |
|
InsideCity | 0.570 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() TallBuilding |
![]() Kitchen |
![]() TallBuilding |
TallBuilding | 0.750 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Industrial |
![]() InsideCity |
![]() Store |
Street | 0.670 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Highway |
![]() InsideCity |
![]() LivingRoom |
Highway | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() OpenCountry |
![]() Coast |
![]() Suburb |
OpenCountry | 0.550 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() Mountain |
![]() Coast |
![]() Coast |
Coast | 0.860 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() OpenCountry |
![]() Mountain |
![]() Highway |
Mountain | 0.830 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() OpenCountry |
![]() OpenCountry |
![]() OpenCountry |
Forest | 0.950 | ![]() |
![]() |
![]() |
![]() |
![]() Mountain |
![]() Store |
![]() Mountain |
![]() Mountain |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |