The goal of this project is to implement the image recognition. First, two different image representations, tiny images and bags of quantized SIFT features, are implemented.Then two different classification techniques,nearest neighbor and 1-vs-all linear SVMS are used.

The tiny images and nearest neighbor classification are very simple methord. The "tiny image" feature function implements the image representations by resizing each image to a small, fixed resolution. The nearest neighbor classifier is equally simple to understand. When tasked with classifying a test feature into a particular category, one simply finds the "nearest" training example and assigns the test case the label of that nearest training example.

The Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. First, a visual word "vocabulary" needs to be established. Then use the visual word "vocabulary" to help to classify the images

Progarm Description

get_tiny_images.m

It is a function to implement the tiny images representations. The tiny images is very simple method. The "tiny image" feature function implements the image representations by resizing each image to a small, fixed resolution.

nearest_neighbor_classify.m

It is a function to implement the nearest neighbor classify images representations. In this function, first, find the distance between each pairs of the test images and the train images, then find the "nearest" train images to every test images, assign the categories of train images to the test images

build_vocabulary.m

The function is used to build the vocabulary of visual words.First, randomly choose some train images from every categories. Then find the sift features for each images, at last, cluster all features with kmeans.

get_bags_of_sifts.m

This function is used to implement the images representation of Bag of words models. First, for each images, we get their sift features,then by assigning features to each cluster in visual word, we build a histogram of image. After normalising, the histogram will be used as the features of image

Code Example


load('vocab.mat')
vocab_size = size(vocab,2)


N=size(image_paths,1);
image_feats=zeros(N,vocab_size);

for i=1:N
    s = num2str(cell2mat(image_paths(i)));
    A=imread(s);

    A = mat2gray(A);
    [locations, SIFT_features] = vl_dsift(single(A),'step',5,'size',6,'fast');

    D=vl_alldist2(single(SIFT_features),vocab);
    [Y,I]=min(D,[],2);
    his=[];
    binranges=1:vocab_size;
    his=histc(I,binranges);

    nor_his=his/norm(his);
    image_feats(i,:)=nor_his;
    
end

svm_classify.m

This function is used to implement the 1-vs-alllinear SVMs.For each categories, we will get a linear function, then use this function to score every images, after this, for each images, get the largest score in all categories, assign this categories to this image

Results

Basic Result


Performance with tiny images and nearest neighbor classifier	Performance with tiny images and linear SVM classifier	Performance with bag of SIFT and nearest neighbor classifier
0.191	0.529	0.723

Result in different parameters

images num used in building vocabulary	vocabulary sizes	lambda for vl_svmtrain	step(GBS)	size(GBS)	step(BV)	size(BV)	Accuracy	Time
150	400	0.0001	5	6	8	10	0.709	25min
600	400	0.0001	5	6	8	10	0.705	32min
300	400	0.0001	4	5	8	10	0.703	41min
300	300	0.0001	4	5	8	10	0.701	31min
300	400	0.00001	5	6	6	8	0.670	27min
300	400	0.0001	5	6	6	8	0.723	27min
300	400	0.001	5	6	6	8	0.655	27min
300	400	0.01	5	6	6	8	0.492	27min
300	400	1	5	6	6	8	0.437	27min
300	400	10	5	6	6	8	0.477	27min
300	400	0.0001	5	6	8	10	0.703	27min
300	400	0.0001	6	8	8	10	0.691	19min
300	400	0.0001	8	10	8	10	0.667	12min
300	400	0.0001	25	30	25	40	0.585	2min

Result in different vocabulary sizes

(Get_bags_of_sifts: setp 5 size 6; build_vocabulary: step,8,size,10)

vocabulary sizes	SVM	1NN	Time
10	0.473	0.387	2min
20	0.577	0.467	3min
50	0.645	0.489	5min
100	0.681	0.493	8min
200	0.691	0.509	14min
400	0.696	0.525	21min
500	0.709	0.525	34min
1000	0.712	0.525	66min
10000	0.727	0.493	6h36min

CS 143 Project 3 results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.723

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.720
Store
LivingRoom
Bedroom
Bedroom

Store 0.630
InsideCity
InsideCity
Industrial
Industrial

Bedroom 0.540
LivingRoom
Industrial
Office
Highway

LivingRoom 0.400
Street
Bedroom
Office
Kitchen

Office 0.890
Store
Bedroom
Kitchen
Kitchen

Industrial 0.550
Street
InsideCity
TallBuilding
Highway

Suburb 0.960
Industrial
Industrial
LivingRoom
Coast

InsideCity 0.460
Bedroom
Coast
Coast
Kitchen

TallBuilding 0.790
Industrial
Street
Industrial
Forest

Street 0.810
InsideCity
Store
Suburb
Mountain

Highway 0.810
Industrial
Bedroom
Bedroom
Coast

OpenCountry 0.660
Coast
Mountain
Coast
LivingRoom

Coast 0.850
Highway
OpenCountry
Mountain
OpenCountry

Mountain 0.850
Coast
LivingRoom
OpenCountry
OpenCountry

Forest 0.930
OpenCountry
Store
TallBuilding
Street

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label
Kitchen	0.720					Store	LivingRoom	Bedroom	Bedroom
Store	0.630					InsideCity	InsideCity	Industrial	Industrial
Bedroom	0.540					LivingRoom	Industrial	Office	Highway
LivingRoom	0.400					Street	Bedroom	Office	Kitchen
Office	0.890					Store	Bedroom	Kitchen	Kitchen
Industrial	0.550					Street	InsideCity	TallBuilding	Highway
Suburb	0.960					Industrial	Industrial	LivingRoom	Coast
InsideCity	0.460					Bedroom	Coast	Coast	Kitchen
TallBuilding	0.790					Industrial	Street	Industrial	Forest
Street	0.810					InsideCity	Store	Suburb	Mountain
Highway	0.810					Industrial	Bedroom	Bedroom	Coast
OpenCountry	0.660					Coast	Mountain	Coast	LivingRoom
Coast	0.850					Highway	OpenCountry	Mountain	OpenCountry
Mountain	0.850					Coast	LivingRoom	OpenCountry	OpenCountry
Forest	0.930					OpenCountry	Store	TallBuilding	Street
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label

Report Analysis

Based on the reports above, we can get the result that the tiny images and nearest neighbor classifier is the simpler and less method to implement Image Recognition,with Accuracy about 10%~20%, the linear SVM classifier is better than nearest neighbor classifier, it can improve the accuracy to 50%~60%. The bag of SIFT is better than tiny images, though much slower. It has accuracy of 60%~70%.With turing parameters(number of clusters, SVM regularization,number of pathces sampled when building vocabulary, size and step for dense SIFT features) well. When using bag of SIFT and nearest neighbor classifier, the, lager vocabulary number and lower step and size number will help to increase the accuracy, but slow down running time of the whole project. With different vocabulary sizes, the size increased, the accuracy and running time will increased fast, when the size reaches about 400, the accuracy for SVM will reach about 70%,and 1NN will reach about 50%, the speed of increasing will slow down.

Fan Yang (fya)

CS 143 / Project 3 / Scene Recognition with Bag of Words