For this project, I stuck to the baseline implementation, but tried to thoroughly tune my algorithm's parameters. I was able to achieve an accuracy of around 70%. Below are descriptions of each algorithm used in the project, and how I chose to implement them.
This algorithm builds descriptors by simply downscaling images and vectorizing them.
To implement it, I used the imresize
function to downscale the images to 16 by 16 pixels, reshape
to turn the 16 by 16 matrix into a vector, then normalized the vector and made it zero-mean (by subtracting subtracting the mean of the vector from itself).
This algorithm simply finds the closest training descriptor in feature space to each test feature, and assigns the test feature its category. It requires no training (unlike the SVM classifier), but it gives equal weight to each parameter being considered, which can be ineffective.
I used the vl_alldist2
function to find the distances between all of the training and test features and the min
function to find each test feature's nearest neighbor. I then assigned each test feature the category of its nearest neighbor.
This algorithm builds descriptors for images based on the frequency of occurences of visual "words" in them.
First, I needed to build a vocabulary. For each image, I densely calculated SIFT features with a step size of 28 and bin size of 4 using the vl_dsift
function. These features were all put into a cell vector (with one cell for each image). I then used the cell2mat
function to concatenate all of the feature matrices in these cells into one large matrix, which served as my respresentative set of SIFT features for the training data. I then clustered these features using vl_kmeans
. I used vocabulary sizes of 400 for debugging, and then 800 and 2000 when aiming for maximum performance.
Next, I needed to build a descriptor for each image based on the vocabulary. For each image, I computed SIFT features with a step size of 8 and a bin size of 4. Then I used vl_alldist2
to compute the distances between all of the features and words in the vocabulary, assigned each feature to the nearest word, and computed a histogram based on the frequency of each visual word's appearance in the image. The final step was to normalize this histogram so that descriptor was invariant to the size of the image.
This algorithm trains a linear SVM for each category, dividing the data set into members and nonmembers of that category. Each feature is then classified using each SVM, and the SVM that classifies it most strongly as a member of its category labels it.
For each category, I built a vector that divided the training data into member and nonmember portions (coded as ones and negative ones). I then used vl_svmtrain
to train an SVM on the labeled data. I found that a lambda value of 0.0002 maximized my accuracy.
Then, for each image, I scored it used the calculated SVM parameters. The SVM that provided the maximum 'score' for the image (i.e. that which was most confident that the image was on the member side of the hyperplane it computed) labeled the image with its category.
Accuracy: 0.199
Important Parameters:
Accuracy: 0.399
Important Parameters:
Accuracy: 0.695
Important Parameters:
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.560 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Store |
![]() Bedroom |
![]() Store |
Store | 0.560 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Kitchen |
![]() Kitchen |
![]() Bedroom |
Bedroom | 0.470 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() LivingRoom |
![]() LivingRoom |
![]() LivingRoom |
LivingRoom | 0.480 | ![]() |
![]() |
![]() |
![]() |
![]() Kitchen |
![]() Office |
![]() TallBuilding |
![]() Bedroom |
Office | 0.840 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Kitchen |
![]() Kitchen |
![]() Kitchen |
Industrial | 0.540 | ![]() |
![]() |
![]() |
![]() |
![]() InsideCity |
![]() Street |
![]() Store |
![]() LivingRoom |
Suburb | 0.990 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() InsideCity |
![]() Coast |
|
InsideCity | 0.460 | ![]() |
![]() |
![]() |
![]() |
![]() LivingRoom |
![]() Store |
![]() Industrial |
![]() Kitchen |
TallBuilding | 0.790 | ![]() |
![]() |
![]() |
![]() |
![]() Street |
![]() InsideCity |
![]() LivingRoom |
![]() Mountain |
Street | 0.660 | ![]() |
![]() |
![]() |
![]() |
![]() TallBuilding |
![]() Forest |
![]() Industrial |
![]() Mountain |
Highway | 0.860 | ![]() |
![]() |
![]() |
![]() |
![]() Industrial |
![]() Store |
![]() LivingRoom |
![]() Suburb |
OpenCountry | 0.590 | ![]() |
![]() |
![]() |
![]() |
![]() Coast |
![]() Coast |
![]() Forest |
![]() Highway |
Coast | 0.820 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() OpenCountry |
![]() Highway |
![]() OpenCountry |
Mountain | 0.880 | ![]() |
![]() |
![]() |
![]() |
![]() Highway |
![]() OpenCountry |
![]() Suburb |
![]() Forest |
Forest | 0.930 | ![]() |
![]() |
![]() |
![]() |
![]() OpenCountry |
![]() Store |
![]() Mountain |
![]() OpenCountry |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |