CS143 Project 4: Face detection with a sliding window

Andy Loomis
November 12th, 2011

Overview

For this project I implemented a sliding window face detector. The pipeline for this project consists of training a classifier to descriminate between faces and non-faces in an image patch. Patches are then sampled from an image and classified. I experimented with different feature representations using the HOG descriptor, linear and nonlinear classifiers, and iterative retraining using hard negatives.

The HOG Descriptor

Instead of using raw image patches as features, I implemented a HOG descriptor as described in Dalal and Triggs 2005. My HOG descriptor begins by calculating the horizontal and vertical gradients in the image patch using a simple [-1,0,1] derivative filter. I use these values to calculate the magnitude and direction of the gradient at every pixel. The pixels are then divided into 9 bins based on the orientation of their gradient. The bins are aggregated into 4x4 pixel cells using bilinear interpolation. Finally, the cells are aggregated and normalized in overlapping blocks containing 4 cells each. Each pixel in the original image patch contributes to 4 cells, and each cell contribute to 4 blocks. This setup is equivalent to the R-HOG descriptor introduced in Dalal and Triggs. To test the descriptiveness of these features, I used a linear classifier to learn the weigths placed on each of the features. It was trained using 5000 positive and 5000 negative examples. A visualization of my HOG descriptor and the learned weights are shown below.

Image Patch HOG Features Positive Weights Negative Weights

Looking at the example above, the positive weights placed on the features in my hog descriptor clearly have a face like properties. You can see that the shape of the head, eyes, forhead, and even the shoulders are discernable. The performance of the basic pipeline increased dramatically once the image patch features had been replaced with the HOG features. Using a single stage linear classifier with the default parameters as well as the default step and scale parameters of the sliding window increased the average precision of the face detections from 0.05 to 0.45. After increasing the number of subwindows that my algorithm considered, the performance increased significantly. I was then able to test several different bin sizes and cell sizes.

The graph above shows the precision recall curve for various binning and schemes of the HOG descriptor. I experimented with binning gradients into 9 bins of 20 degrees and 18 bins of 10 degrees. I also experimented with cell sizes that were 4x4 and 6x6 pixels. The best performace came from a descriptor with 9 bins and 4x4 pixel cells, although it seams like the most important factor was the cell size.

To speed up the time it takes to compute the HOG descriptor in Matlbab, I implemented it in C and compiled the C into mex. This gave close to a 100 times speedup in the feature computation.

Mining Hard Negatives

The next addition I made to my face detection pipeline was the mining of hard negatives. This first involves training a classifier on a random set of negative examples generated from images with no faces in them. Then the classifier is used to detect faces in another set of images that do not contain faces. Each detection is therefore a false positive. You can then use these false positives to retrain your classifier in an attempt to improve its performance.

I experimented with two techniques for selecting which false positives to use for retraining my classifier. The first technique was to randomly select a subset of all of the false positives in a given image. The second technique was to sort all of the false positives by the confidence level that was assigned to them by the classifier. Then I took only the most confident classifications, or in other words, I took the hardest of the negatives for retraining. The average precision of both a two stage linear and nonlinear classifier are shown below.

From the above graphs you can see that using hard negatives in an iterative training procedure leads to greater precision in the final classifier. In addition, using only the hardest negatives improves preformace more than using a random subset of the negatives. This result also seams to be more significant for the nonlinear classifier than the linear classifier.

In addition to experimenting with the types of negatives to use I also considered using multiple rounds of hard negative mining. Below is the precision recall curve for a linear classifier trained with multiple rounds of hard negative mining. You can see that after two stages (one round of hard negative mining) there is no further performance gain to using multiple rounds. The preformance gains for the nonlinear classifier were equivalent.

Results and Discussion

The best performance I was able to achieve was an average precison of 0.831. The precision recall curve for this run is show below. In general, nonlinear classifiers preformed better than linear classifiers in almost all of the experiments that I ran. One of the biggest challanges that I had was controlling the false positive rate. Because these classifiers encounter so many more non-face patches than face patches, even a small false positive rate results in numerous extraneous detections. I imagine that a cascade approach could help in this area, but I did not have time to experiment.