Project Description

In this project I experimented with a sliding window detector for recognizing human faces. I tested different densities of dense SIFT features, adding hard negatives to the training image set, linear vs. non-linear SVMs, and a small variety of search scale parameters. The results are shown in the sections below.

Feature Selection

The first step in improving the starter code was to implement stronger features than the raw pixels. I decided to use the vl_dsift function to calculate SIFT features over the sliding window crops. In the plots below I compare the performance of using just one SIFT feature calculated over a 36x36 pixel patch and using 9 evenly spaced SIFT descriptors that had smaller spatial extent and overlapped each other by 30%.

These plots show the performance of a linear SVM classifier trained on those different features. I didn't think that the difference in performance justified the additional computation time, so I proceeded using only one SIFT descriptor per patch. Because the search patches were already small, I think using one descriptor for each patch captured a sufficient amount of gradient orientation information.

One Sift descriptor per patch
using the whole patch
Nine Sift descriptors per patch
evenly spaced and overlapping 30%

Using Hard Negatives

The next step in improving the detector was to train using hard negatives. As a small speed-up I cached the calculation of features for the hard negative training examples. The plot below shows that just one pass of adding hard negative examples to the training set improved average precision by 0.04.

Using hard negatives for an additional training pass

Linear vs. Non-linear SVM

To further improve the face-detector performance, I compared the performance of linear and non-linear classifiers. The plots from the previous sections were made using a linear classifier with lambda = 1 in the SIFT tests and lambda = 100 in the hard negatives test.

Before I used the non-linear SVM, I tried to intelligently select the SVM parameters lambda (regularization) and sigma (RBF exponent). The plots below show the RBF kernel with different values of sigma. I selected sigma = 550 because it seemed to offer the best separation between the positive and negative training examples. The positive examples were values 1-1000 and the negatives 1001-2000.

RBF kernel with sigma = 300 RBF kernel with sigma = 500 RBF kernel with sigma = 700

I selected lambda by running the non-linear SVM on the training set and choosing the lambda that gave the best separation between the confidences for the positive and negative image examples. That was lambda = 0.01.

Separation of the detector confidences for positive and negative training examples:

lambda = 1000 lambda = 100
lambda = 0.1 lambda = 0.01
Training confidences using 4 passes of adding hard negatives with lambda = 0.01.

Detection Parameters Change

The last improvement I made was to change the search granularity of the face detector. A comparison of different lambda values using different search parameters is shown by the Precision Recall plots below.

2x additions of Hard Negatives, Sigma = 550, Lambda = 1, Window Step Size = 2px, Scale Factor = 1.4 4x additions of Hard Negatives, Sigma = 550, Lambda = 0.01, Window Step Size = 2px, Scale Factor = 1.4

The best performance my detector achieved was AP = 0.552, which was about 10 times the accuracy of detecting faces using raw pixels, random negatives, and a linear SVM.