CS 143 / Project 3 / Scene Recognition with Bag of Words

Figure of image recognition

The goal of this project is to implement the image recognition. First, two different image representations, tiny images and bags of quantized SIFT features, are implemented.Then two different classification techniques,nearest neighbor and 1-vs-all linear SVMS are used.

The tiny images and nearest neighbor classification are very simple methord. The "tiny image" feature function implements the image representations by resizing each image to a small, fixed resolution. The nearest neighbor classifier is equally simple to understand. When tasked with classifying a test feature into a particular category, one simply finds the "nearest" training example and assigns the test case the label of that nearest training example.

The Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. First, a visual word "vocabulary" needs to be established. Then use the visual word "vocabulary" to help to classify the images

Progarm Description

get_tiny_images.m

It is a function to implement the tiny images representations. The tiny images is very simple method. The "tiny image" feature function implements the image representations by resizing each image to a small, fixed resolution.

nearest_neighbor_classify.m

It is a function to implement the nearest neighbor classify images representations. In this function, first, find the distance between each pairs of the test images and the train images, then find the "nearest" train images to every test images, assign the categories of train images to the test images

build_vocabulary.m

The function is used to build the vocabulary of visual words.First, randomly choose some train images from every categories. Then find the sift features for each images, at last, cluster all features with kmeans.

get_bags_of_sifts.m

This function is used to implement the images representation of Bag of words models. First, for each images, we get their sift features,then by assigning features to each cluster in visual word, we build a histogram of image. After normalising, the histogram will be used as the features of image

Code Example

load('vocab.mat')
vocab_size = size(vocab,2)


N=size(image_paths,1);
image_feats=zeros(N,vocab_size);

for i=1:N
    s = num2str(cell2mat(image_paths(i)));
    A=imread(s);

    A = mat2gray(A);
    [locations, SIFT_features] = vl_dsift(single(A),'step',5,'size',6,'fast');

    D=vl_alldist2(single(SIFT_features),vocab);
    [Y,I]=min(D,[],2);
    his=[];
    binranges=1:vocab_size;
    his=histc(I,binranges);

    nor_his=his/norm(his);
    image_feats(i,:)=nor_his;
    
end

svm_classify.m

This function is used to implement the 1-vs-alllinear SVMs.For each categories, we will get a linear function, then use this function to score every images, after this, for each images, get the largest score in all categories, assign this categories to this image

Results

Basic Result

Performance with tiny images and nearest neighbor classifier Performance with tiny images and linear SVM classifier Performance with bag of SIFT and nearest neighbor classifier
0.191 0.529 0.723

Result in different parameters

images num used in building vocabulary vocabulary sizes lambda for vl_svmtrain step(GBS) size(GBS) step(BV) size(BV) Accuracy Time
150 400 0.0001 5 6 8 10 0.709 25min
600 400 0.0001 5 6 8 10 0.705 32min
300 400 0.0001 4 5 8 10 0.703 41min
300 300 0.0001 4 5 8 10 0.701 31min
300 400 0.00001 5 6 6 8 0.670 27min
300 400 0.0001 5 6 6 8 0.723 27min
300 400 0.001 5 6 6 8 0.655 27min
300 400 0.01 5 6 6 8 0.492 27min
300 400 1 5 6 6 8 0.437 27min
300 400 10 5 6 6 8 0.477 27min
300 400 0.0001 5 6 8 10 0.703 27min
300 400 0.0001 6 8 8 10 0.691 19min
300 400 0.0001 8 10 8 10 0.667 12min
300 400 0.0001 25 30 25 40 0.585 2min

Result in different vocabulary sizes

(Get_bags_of_sifts: setp 5 size 6; build_vocabulary: step,8,size,10)

vocabulary sizes SVM 1NN Time
10 0.473 0.387 2min
20 0.577 0.467 3min
50 0.645 0.489 5min
100 0.681 0.493 8min
200 0.691 0.509 14min
400 0.696 0.525 21min
500 0.709 0.525 34min
1000 0.712 0.525 66min
10000 0.727 0.493 6h36min

CS 143 Project 3 results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.723

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.720
Store

LivingRoom

Bedroom

Bedroom
Store 0.630
InsideCity

InsideCity

Industrial

Industrial
Bedroom 0.540
LivingRoom

Industrial

Office

Highway
LivingRoom 0.400
Street

Bedroom

Office

Kitchen
Office 0.890
Store

Bedroom

Kitchen

Kitchen
Industrial 0.550
Street

InsideCity

TallBuilding

Highway
Suburb 0.960
Industrial

Industrial

LivingRoom

Coast
InsideCity 0.460
Bedroom

Coast

Coast

Kitchen
TallBuilding 0.790
Industrial

Street

Industrial

Forest
Street 0.810
InsideCity

Store

Suburb

Mountain
Highway 0.810
Industrial

Bedroom

Bedroom

Coast
OpenCountry 0.660
Coast

Mountain

Coast

LivingRoom
Coast 0.850
Highway

OpenCountry

Mountain

OpenCountry
Mountain 0.850
Coast

LivingRoom

OpenCountry

OpenCountry
Forest 0.930
OpenCountry

Store

TallBuilding

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Report Analysis

Based on the reports above, we can get the result that the tiny images and nearest neighbor classifier is the simpler and less method to implement Image Recognition,with Accuracy about 10%~20%, the linear SVM classifier is better than nearest neighbor classifier, it can improve the accuracy to 50%~60%. The bag of SIFT is better than tiny images, though much slower. It has accuracy of 60%~70%.With turing parameters(number of clusters, SVM regularization,number of pathces sampled when building vocabulary, size and step for dense SIFT features) well. When using bag of SIFT and nearest neighbor classifier, the, lager vocabulary number and lower step and size number will help to increase the accuracy, but slow down running time of the whole project. With different vocabulary sizes, the size increased, the accuracy and running time will increased fast, when the size reaches about 400, the accuracy for SVM will reach about 70%,and 1NN will reach about 50%, the speed of increasing will slow down.