Face Detection

Purpose:

We attempt to train a classifier which given an input image can construct bounding boxes on all faces in the image.

Algorithm:

Main Idea

Our pipeline operates as follows:
1. The first set of training data is image crops of faces. For each image we extract a descriptor to represent it, saving these descriptors as yes instaces of faces.
2. The next set of training data is a collection of images known to have no faces. We randomly extract crops from these image save the associated descriptor as no instances of faces.
3. Based on this data, we then train an SVM to decide whether an image crop is a sliding window.
4. To detect faces in images, we use a sliding window at multiple scales to extract image crops, then submit these crops to our SVM.
5. If we run step 4 on marked images, we can then find hard negatives and add them to our negative examples and retrain our SVM. (Repeating as many times as we see fit).
6. Finally we run our adjusted SVM on a test set.
7. We then need to run maximum surpression on our output because, otherwise we will get many possitive results for a single face.
8. Finally we can evaluate our results.

Representing a Crop with a Descriptor

Given a crop, we need to decide on the best way to describe the "face" feature. I experimented with two types of image features, SIFT and HOG. As we have discussed in class and scene on previous assigments, SIFT and HOG features are much better than strict image crops because examining features like gradients is far more informative than stict pixel values. For my final results, I used SIFT because it provided means for significant speed ups in mining hard negatives and evaluating an image at test time. With the HOG and SIFT implementations, we could extract all image features at once from SIFT (for each sliding window at a given scale) rather than selecting out crops via the window and passing each individually to SIFT. You will see HOG results in my Linear vs. Non-Linear SVM analysis. I later optimized for SIFT, so while my HOG analysis is useful for comparing SVM's, it is not meaningful to compare the results of my SIFT to my HOG since I used much better parameters on my SIFT implementations.

Linear vs. Non-Linear SVM

We can use a linear or non-linear SVM as a decision boundary for an image crop. The meaningful insight came from using my HOG descriptors, where the non-linear SVM showed a small but significant increase in performance:
(Linear followed by Non-Linear)

Interestingly, the non-linear SVM did not work at all with my SIFT descriptors. While my linear SVM found accuracies aroun .78, the non-linear SVM simply did not work. With some parameters I tried, it thought almost every crop was a face, and with others it found no faces in the entire test set. (The outcome was also effected by whether or not I mined for hard negatives). Obviously, the non-linear SVM should not preform this badly, and given more attention, I likely could have tuned parameters to get its results at least in the neighborhood of my linear SVM. The graph's are not shown here, because the SIFT feature with linear is in the results section, and the non-linear SVM was not a graph because it had such terrible performance.

Mining Hard Negatives

The additional mining of hard negatives dramatically improved the preformance of my SIFT descriptor SVM. The results of hard mining on my HOG descriptors, made small changes (on the order of 2-4 %) rather than doubling as it did for SIFT. The results are not shown for HOG.

Best Results:

SIFT Descriptor, Linear SVM

For SIFT descriptor with 2 iterations of mining hard negatives, I obtained modest results. My lambda parameter for my SVM was 1.0. The resulting Precision Recall curve is:

Our AP is 0.776 with curve:

SIFT Descriptor, Linear SVM

Here is the same SIFT descriptor from above but run on 200 less training images. It is marginally worse:

Our AP is .768, with curve: