Figure of image recognition
The goal of this project is to implement the image recognition. First, two different image representations, tiny images and bags of quantized SIFT features, are implemented.Then two different classification techniques,nearest neighbor and 1-vs-all linear SVMS are used.
The tiny images and nearest neighbor classification are very simple methord. The "tiny image" feature function implements the image representations by resizing each image to a small, fixed resolution. The nearest neighbor classifier is equally simple to understand. When tasked with classifying a test feature into a particular category, one simply finds the "nearest" training example and assigns the test case the label of that nearest training example.
The Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. First, a visual word "vocabulary" needs to be established. Then use the visual word "vocabulary" to help to classify the images
It is a function to implement the tiny images representations. The tiny images is very simple method. The "tiny image" feature function implements the image representations by resizing each image to a small, fixed resolution.
It is a function to implement the nearest neighbor classify images representations. In this function, first, find the distance between each pairs of the test images and the train images, then find the "nearest" train images to every test images, assign the categories of train images to the test images
The function is used to build the vocabulary of visual words.First, randomly choose some train images from every categories. Then find the sift features for each images, at last, cluster all features with kmeans.
This function is used to implement the images representation of Bag of words models. First, for each images, we get their sift features,then by assigning features to each cluster in visual word, we build a histogram of image. After normalising, the histogram will be used as the features of image
load('vocab.mat')
vocab_size = size(vocab,2)
N=size(image_paths,1);
image_feats=zeros(N,vocab_size);
for i=1:N
s = num2str(cell2mat(image_paths(i)));
A=imread(s);
A = mat2gray(A);
[locations, SIFT_features] = vl_dsift(single(A),'step',5,'size',6,'fast');
D=vl_alldist2(single(SIFT_features),vocab);
[Y,I]=min(D,[],2);
his=[];
binranges=1:vocab_size;
his=histc(I,binranges);
nor_his=his/norm(his);
image_feats(i,:)=nor_his;
end
This function is used to implement the 1-vs-alllinear SVMs.For each categories, we will get a linear function, then use this function to score every images, after this, for each images, get the largest score in all categories, assign this categories to this image
![]() |
![]() |
![]() |
Performance with tiny images and nearest neighbor classifier | Performance with tiny images and linear SVM classifier | Performance with bag of SIFT and nearest neighbor classifier |
0.191 | 0.529 | 0.723 |
images num used in building vocabulary | vocabulary sizes | lambda for vl_svmtrain | step(GBS) | size(GBS) | step(BV) | size(BV) | Accuracy | Time |
---|---|---|---|---|---|---|---|---|
150 | 400 | 0.0001 | 5 | 6 | 8 | 10 | 0.709 | 25min |
600 | 400 | 0.0001 | 5 | 6 | 8 | 10 | 0.705 | 32min |
300 | 400 | 0.0001 | 4 | 5 | 8 | 10 | 0.703 | 41min |
300 | 300 | 0.0001 | 4 | 5 | 8 | 10 | 0.701 | 31min |
300 | 400 | 0.00001 | 5 | 6 | 6 | 8 | 0.670 | 27min |
300 | 400 | 0.0001 | 5 | 6 | 6 | 8 | 0.723 | 27min |
300 | 400 | 0.001 | 5 | 6 | 6 | 8 | 0.655 | 27min |
300 | 400 | 0.01 | 5 | 6 | 6 | 8 | 0.492 | 27min |
300 | 400 | 1 | 5 | 6 | 6 | 8 | 0.437 | 27min |
300 | 400 | 10 | 5 | 6 | 6 | 8 | 0.477 | 27min |
300 | 400 | 0.0001 | 5 | 6 | 8 | 10 | 0.703 | 27min | 300 | 400 | 0.0001 | 6 | 8 | 8 | 10 | 0.691 | 19min |
300 | 400 | 0.0001 | 8 | 10 | 8 | 10 | 0.667 | 12min |
300 | 400 | 0.0001 | 25 | 30 | 25 | 40 | 0.585 | 2min |
vocabulary sizes | SVM | 1NN | Time |
---|---|---|---|
10 | 0.473 | 0.387 | 2min |
20 | 0.577 | 0.467 | 3min |
50 | 0.645 | 0.489 | 5min |
100 | 0.681 | 0.493 | 8min |
200 | 0.691 | 0.509 | 14min |
400 | 0.696 | 0.525 | 21min |
500 | 0.709 | 0.525 | 34min |
1000 | 0.712 | 0.525 | 66min | 10000 | 0.727 | 0.493 | 6h36min |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.720 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() LivingRoom |
![]() Bedroom |
![]() Bedroom |
Store | 0.630 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() InsideCity |
![]() Industrial |
![]() Industrial |
Bedroom | 0.540 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Industrial |
![]() Office |
![]() Highway |
LivingRoom | 0.400 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() Bedroom |
![]() Office |
![]() Kitchen |
Office | 0.890 | ![]() |
![]() |
![]() |
![]() |
![]() Store |
![]() Bedroom |
![]() Kitchen |
![]() Kitchen |
Industrial | 0.550 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() InsideCity |
![]() TallBuilding |
![]() Highway |
Suburb | 0.960 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Industrial |
![]() LivingRoom |
![]() Coast |
InsideCity | 0.460 | ![]() |
![]() |
![]() |
![]() |
![]() Bedroom |
![]() Coast |
![]() Coast |
![]() Kitchen |
TallBuilding | 0.790 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Street |
![]() Industrial |
![]() Forest |
Street | 0.810 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Store |
![]() Suburb |
![]() Mountain |
Highway | 0.810 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Bedroom |
![]() Bedroom |
![]() Coast |
OpenCountry | 0.660 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() Mountain |
![]() Coast |
![]() LivingRoom |
Coast | 0.850 | ![]() |
![]() |
![]() |
![]() |
![]() Highway |
![]() OpenCountry |
![]() Mountain |
![]() OpenCountry |
Mountain | 0.850 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() LivingRoom |
![]() OpenCountry |
![]() OpenCountry |
Forest | 0.930 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Store |
![]() TallBuilding |
![]() Street |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Based on the reports above, we can get the result that the tiny images and nearest neighbor classifier is the simpler and less method to implement Image Recognition,with Accuracy about 10%~20%, the linear SVM classifier is better than nearest neighbor classifier, it can improve the accuracy to 50%~60%. The bag of SIFT is better than tiny images, though much slower. It has accuracy of 60%~70%.With turing parameters(number of clusters, SVM regularization,number of pathces sampled when building vocabulary, size and step for dense SIFT features) well. When using bag of SIFT and nearest neighbor classifier, the, lager vocabulary number and lower step and size number will help to increase the accuracy, but slow down running time of the whole project. With different vocabulary sizes, the size increased, the accuracy and running time will increased fast, when the size reaches about 400, the accuracy for SVM will reach about 70%,and 1NN will reach about 50%, the speed of increasing will slow down.