CS 143 / Project 3 / Scene Recognition with Bag of Words

The scene recognition assignment has three majors which will all be discussed in the depth below. Each phase is unique in its use of a different feature descriptor or classifier. The types of features descriptors used are the tiny_image descriptor or the bag_of_sifts descriptor. As for classifiers there is the nearest neighbor classifier and the linear SVM classifier. Thus the resulting feature and classifier combinations that were tested and implemented are:

  1. Tiny images representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier

Tiny Images Feature Descriptor

This feature descriptor was established by shrinking images down in size and then turning the thumbnail into a vector. The problem is that this method is very lossy especially in terms of an images high frequencies. As you can see below the tiny image feature descriptor code was implemented with normalization in mind but it was removed since it did not appear to improve results. However, subtracting the mean of the image_vector from the image_vector (zero mean) did seem to improve results marginally.


image_path_size = size(image_paths);
image_feats = zeros(image_path_size(1),256);
for i=1:1:image_path_size(1)
   current_image = imread(char(image_paths(i)));
   resized_image = imresize(current_image, [16,16]);
   image_vector = resized_image(:);
   image_vector = image_vector - mean(image_vector(:));
%    sum_f = sum(image_vector(1,:));
%    image_vector_normalized = image_vector./sum_f;
   image_feats(i,:) = image_vector;
  
end

Nearest Neighbor Classifier

The nearest neighbor classifier was the first one I established and it employs a k-means method using vl_alldist2 which finds the distance of each training image from the each test image. With this distance matrix we can find the nearest neighbors. The number of nearest neighbors resulted in the most variability of results for this portion of the code. Here there were two options. Eitheir I could simply pick the nearest neighbor or instead vote on the type of image based on a histogram of some k nearest neighbors. I ended up coding both alternatives since they improved results in different situations. 1 nearest neighbor seemed more effective for tiny image classifier training however 4 nearest neighbors proved to be the most effective when working with bag of sift feature descriptions. Two implement the k nearest neighbors method I had generate histograms by hand using strcmp and sum to establish the number in each bin before indexing into a string array to grab the corresponding label.


%nearest neighbors
% k = 1;
% test_image_feats_size = size(test_image_feats);
% distance = vl_alldist2(train_image_feats',test_image_feats') ;
% [y,index_array] = sort(distance);
% predicted_categories = cell(test_image_feats_size(1),1);
% for j=1:test_image_feats_size(1)
%     predicted_categories(j) = train_labels(index_array(1,j));
% end


k = 3;
temp = cell(1);
test_image_feats_size = size(test_image_feats);
distance = vl_alldist2(train_image_feats',test_image_feats') ;
[y,index_array] = sort(distance);
predicted_categories = cell(test_image_feats_size(1),1);
binranges = zeros(15,1);
strings = cell(15,1);
strings{1} = 'Kitchen';
strings{2} = 'Store';
strings{3} = 'Bedroom';
strings{4} = 'LivingRoom';
strings{5} = 'Office';
strings{6} = 'Industrial';
strings{7} = 'Suburb';
strings{8} = 'InsideCity';
strings{9} = 'TallBuilding';
strings{10} = 'Street';
strings{11} = 'Highway';
strings{12} = 'OpenCountry';
strings{13} = 'Coast';
strings{14} = 'Mountain';
strings{15} = 'Forest';
for j=1:test_image_feats_size(1)
    binranges(1)  = sum(strcmp('Kitchen',train_labels(index_array(1:k,j))));
    binranges(2)  = sum(strcmp('Store',train_labels(index_array(1:k,j))));
    binranges(3)  = sum(strcmp('Bedroom',train_labels(index_array(1:k,j))));
    binranges(4)  = sum(strcmp('LivingRoom',train_labels(index_array(1:k,j))));
    binranges(5)  = sum(strcmp('Office',train_labels(index_array(1:k,j))));
    binranges(6)  = sum(strcmp('Industrial',train_labels(index_array(1:k,j))));
    binranges(7)  = sum(strcmp('Suburb',train_labels(index_array(1:k,j))));
    binranges(8)  = sum(strcmp('InsideCity',train_labels(index_array(1:k,j))));
    binranges(9)  = sum(strcmp('TallBuilding',train_labels(index_array(1:k,j))));
    binranges(10) = sum(strcmp('Street',train_labels(index_array(1:k,j))));
    binranges(11) = sum(strcmp('Highway',train_labels(index_array(1:k,j))));
    binranges(12) = sum(strcmp('OpenCountry',train_labels(index_array(1:k,j))));
    binranges(13) = sum(strcmp('Coast',train_labels(index_array(1:k,j))));
    binranges(14) = sum(strcmp('Mountain',train_labels(index_array(1:k,j))));
    binranges(15) = sum(strcmp('Forest',train_labels(index_array(1:k,j))));
    [value,label_ind] = max(binranges);
    if value == 1
        predicted_categories(j) = train_labels(index_array(1,j));
    else
        
        temp{1} = strings{label_ind};
        predicted_categories(j) = temp;
    end
    
end

Bag of Sift

This feature descriptor was implemented by first generating a vocabulary for the descriptors to use and label training images. The vocab was generated using vl_dsift on each training image to grab sift features before clustering these features with vl_kmeans. This process was extremely slow but smaller step size for vl_dsift and not using the 'fast' flag produced better and better results. After this vocab was established the training images were associated with their labels by finding the clusters that best described them and normalizing the features obtained at these clusters as seen below.


%Bag of Sift (after vocab is created)
load('vocab.mat')
vocab_size = size(vocab', 2);

image_path_size = size(image_paths);
image_feats = [];
for j=1:image_path_size(1)
    current_image = imread(char(image_paths(j)));
    [locations, sift_features] = vl_dsift(single(current_image),'step',10);
    distance =  vl_alldist2(single(sift_features),vocab');
    [y,index_array] = min(distance');
    binranges = (1:vocab_size);
    bincounts = histc(index_array,binranges);
    sum_f = sum(bincounts);
    bincounts_normalized = bincounts./sum_f;
    image_feats(j,:) = bincounts_normalized;
end

Linear SVM

This classifier produced by far the best results when working through the testing images. The technique here is to create a classifier using vl_svmtrain for each unique string in the train_labels array which effectively creates a classifer for each image type. Following this the values returned from vl_svmtrain are stored for each category. Then on each image each classifier is run and the one that produces the largest weight results is selected as the corresponding image category. The weight or confidences are calculated using the formula W*X + B where '*' is the inner product or dot product and W and %B are the learned hyperplane parameters and X is the image.


%SVM
categories = unique(train_labels); 
num_categories = length(categories);

lambda = .000008;

w_tot = [];
b_tot = [];

for i=1:num_categories

    label_logical = strcmp(categories(i),train_labels);
    label_binary = ones(size(label_logical)) .* -1;
    label_binary(label_logical) = 1;
    [w,b] = vl_svmtrain(train_image_feats',label_binary',lambda);
    w_tot(i,:) = w;
    b_tot(i,:) = b;
end

num_test_image_feats = size(test_image_feats);
for i=1:num_test_image_feats
    weight = [];
    for j=1:num_categories
        weight(j,:) = dot(w_tot(j,:),test_image_feats(i,:)) + b_tot(j,:); 
    end    
    [max_weight,ind] = max(weight);
    predicted_categories(i) = categories(ind);
end
predicted_categories = predicted_categories';

Results Phase 1: Tiny images representation and nearest neighbor classifier

Image Comparison

Using the tiny image representation, the maximum accuracy obtained was 0.204. Interestingly this accuracy was obtained with use of normalizing the image_vectors. Furthermore, the nearest neighbor classifier only seemed to degrade results as more neighbors were used for voting. Therefore, for this step only 1 nearest neighbor was used.

Results Phase 2: Bag of SIFT representation and nearest neighbor classifier

Image Comparison

For the bag of sift representation the maximum accuracy obtained was 0.531. Unlike the tiny images this maximum accuracy was achieved by using multiple nearest neighbors. 4 nearest neighbors appeared the best results. Here the step size for creating the vocab was 10 and for the bag of sift it was also 5.

Results Phase 3: Bag of SIFT representation and nearest neighbor classifier

For the bag of sift representation with the linear svm classifier the best accuracy obtained was 0.699. This was obtained with a lambda of .0000008 and a step size of 10 for the vocab and 5 for bag of sift. The full results can be seen in the table in the section below. It should be noted that here tweaking parameters resulted in the most accuracy gains. Different lambda values resulted in accuracies as low as 0.61 and so it is clear that lambda has a huge impact. The other biggest gains in accuracy were resulting from removing the 'fast' flags from vl_dsift in bag_of_sifts and build_vaobulary as well as decreasing the step size in both of these files. The step size ended up being 10 looking through training images to build the vocab and 5 when categorizing images in bag_of_sifts.

CS 143 Project 3 results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.699

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.570
LivingRoom

Bedroom

Office

Office
Store 0.570
Kitchen

LivingRoom

Industrial

Suburb
Bedroom 0.470
LivingRoom

TallBuilding

LivingRoom

LivingRoom
LivingRoom 0.390
Bedroom

Industrial

Industrial

Kitchen
Office 0.930
Store

Kitchen

Kitchen

Kitchen
Industrial 0.560
InsideCity

Street

LivingRoom

Highway
Suburb 0.990
Mountain

OpenCountry

TallBuilding
InsideCity 0.570
Store

TallBuilding

Kitchen

TallBuilding
TallBuilding 0.750
InsideCity

Industrial

InsideCity

Store
Street 0.670
InsideCity

Highway

InsideCity

LivingRoom
Highway 0.830
Street

OpenCountry

Coast

Suburb
OpenCountry 0.550
Mountain

Mountain

Coast

Coast
Coast 0.860
OpenCountry

OpenCountry

Mountain

Highway
Mountain 0.830
Kitchen

OpenCountry

OpenCountry

OpenCountry
Forest 0.950
Mountain

Store

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label