Project 4: Face detection with a sliding window

Lu Zeng (lz7) - Fall 2011

Wildly generalizing, facial recognition is easy for humans and hard for machines. Does that stop the field of computer vision from trying this task? Surely not. In this project, we implement a sliding window face detector.

Every time you tag a friend, some classifier in Facebook probably gets another positive example. Nom.

Inspired by the papers by Dalal-Triggs 2005 and Viola-Jones 2001, we use a variety of techniques to develop a sliding-window face detector in this project. The overarching concept is to train a classifier with features derived from positive and negative examples; indeed, the linear SVM trained on the trivial feature, the image patch itself, has average precision (AP) of 0.05.

Normalizing the image patches increased average precision to 0.06. Fairly shabby.

First, features are derived from 6000+ positive examples. These are crops of faces -- heads, really -- from the Caltech Web Faces project. Then, negative training patches are pulled from images that are known not to contain faces; at first, these are random, but later, once an initial classifier has been trained, we can use the classifier to give us the most face-like non-face patches.

With this classifier, we approach the test image by taking patches at some interval -- this lets us detect faces at various places in the photo. These patches are the same size as our training examples; with the image at various scales, we can detect faces of different sizes.

There are several steps that improve performance.

Key step 1: Represent the image with stronger features.

Because training SVMs on SIFT features mysteriously threw Matlab into a tizzy of ill-conditioned matrices, I used HOG features instead. The implementation I used was found here. This implementation worked reasonably fast, though written in matlab. Using the linear SVM with lambda = 1.0, and using HOG features from training patches, we obtain AP = 0.239. This lambda was determined experimentally. A near five-fold increase from no more than three lines of code in crops2features.m; not bad, but we can do better.

Key step 2: Use a non-linear SVM.

Linear SVMs can learn only a hyperplane in high-dimensional space to separate positive and negative instances, but sometimes, a hyperplane is inadequately expressive. Classification on the non-linear SVM with no parameters adjusted resulted in AP = 0.340, nearly a 50% increase.

Key step 3: Train the non-linear SVM on hard negatives.

Because the non-linear SVM is more expressive, we can expect some improvement if we mine the faceless images for patches that most resemble faces.

To find hard negatives, we use the classifier trained on random negatives to classify patches from images known to not contain faces. The classifier returns its confidence on each detection, and ideally, I would have chosen the most confident non-faces out of the entire pool of non-face crops. However, this would have required scanning all the images and then reloading them to extract crops from bounding boxes; instead, I chose the 30% most confident detections from each image.

Another sensible way to choose hard negatives would have been to threshhold on some confidence -- going by the detection visualization on the class photo, it seems like a big drop happens between confidences of 0.5 and 0.6.

Using these patches in training in addition to the random negatives resulted in AP = 0.365.

Step 4: Tune parameters.

We really are sliding along the image, taking a sub-image at regular intervals. If this interval is large, we may miss faces entirely, positionally. Doing this on the image at different scales, if the scale at which we begin increment is zoomed out too far, the classifier may miss small faces entirely.

The cost of coverage, of course, is memory: for larger images, there are too many crops to fit in memory. Thus, in running this test, the parameters were as follows, except when the image was greater than 500*500 px, in which case the start_scale was set to 3. A flimsy justification is that large images have large faces that starting from a greater zoom-out will not miss. With these adjustments, the result was AP = 0.476.

step_size = 2; scale_factor = 1.5; min_scale_size = 50; start_scale = 2;

This isn't terribly impressive, but now we address the flimsiness of the justification: from inspection, larger images are actually more likely to have smaller faces. With a faster HOG implementation (or SIFT that did not result in near-singular matrices) and some way to bypass the memory issues, a better result could easily be obtained. Our cascade shows modest improvement in false positive rate (FPR), however.

Stage 1. TPR: 0.495, FPR: 0.002, TNR: 0.498, FNR: 0.005
Stage 2. TPR: 0.479, FPR: 0.004, TNR: 0.496, FNR: 0.021

Finally, the class photo, with detections of at least 0.5 confidence: