Project 4: Face detection with a sliding window

Seth Goldenberg

Background

Face detection is one of the biggest successes in the field of computer vision, to the point that it is commonly implemented in most commercially available digital cameras. This project attempts to perform face detection using a sliding window. Each section of the image is classified as a face or not a face. I tested a changing a variety of parameters and using both linear and non-linear support vector machines for training my classifiers. More details on the assignment can be found here.

Pipeline

Feature Extraction

The features from crops of positive and negative training examples are fed into a support vector machine. For features, I used Pedro Felzenszwalb's implementation of HOG features (code and description available here). I originally tried to extract a single dense SIFT feature from each 36 x 36 pixel image crop, but there was a large overhead preparing the image for feature extraction. I tried getting all the features from the full images instead of the extracting them from the crops, but I failed at this approach. My pipeline ran much faster using the HOG features, even though both features used Mex-compiled code. For my HOG features, I used a bin size of 8.

Training a classifier

I trained support vector machines using both a linear and RBF kernels. Not surprisingly, the non-linear classifier was faster to train and did not perform as well for the similar sized training sets. On the initial training of the classifier, I used known positive face crops and random negatives mined from images containing no faces. I also experimented mining hard negatives and iteratively using them to improve the classifier. My results section will show you the advantages (and disadvantages) of doing this.

In order to tune the parameters to the SVMs, I tested the detector only using random negatives and the default parameters. Each trial took 3-4 minutes, short enough that I could find the best parameters to achieve the highest average precision and lowest false positive rate. For the linear SVM, I used a lambda of 80. For the RBF kernel, I used a lambda of 1.0 and a sigma of 2.0.

Evaluating on the test set

In order to achieve decent results, I had to tweak the default parameters in the detector. I used a step size of 2, a scale factor of 1.25, and a start scale of 1. This improved my results. It resulted in my pipeline taking 3-4 hours to run though. Non-maximum suppression is used to remove redundant detections.

Results

Below are some of the precision-recall curves and positive-negative statistics from some of my trials. All results are based on a training size of 1000 positive and 1000 negative image examples unless otherwise noted.

No hard negative mining, Linear SVM

Stage 1. TPR: 0.479, FPR: 0.011, TNR: 0.489, FNR: 0.021

No hard negative mining, RBF-kernel SVM

Stage 1. TPR: 0.493, FPR: 0.002, TNR: 0.498, FNR: 0.007

2 rounds hard negative mining, Linear SVM

Stage 1. TPR: 0.479, FPR: 0.013, TNR: 0.487, FNR: 0.021
Stage 2. TPR: 0.487, FPR: 0.074, TNR: 0.425, FNR: 0.013
Stage 3. TPR: 0.488, FPR: 0.060, TNR: 0.440, FNR: 0.012

2 rounds hard negative mining, RBF-kernel SVM

Stage 1. TPR: 0.492, FPR: 0.001, TNR: 0.499, FNR: 0.008
Stage 2. TPR: 0.492, FPR: 0.009, TNR: 0.491, FNR: 0.007
Stage 3. TPR: 0.496, FPR: 0.013, TNR: 0.487, FNR: 0.004

1 round hard negative mining, Training Size = ~10000/category, Linear SVM

Stage 1. TPR: 0.408, FPR: 0.006, TNR: 0.570, FNR: 0.016
Stage 2. TPR: 0.391, FPR: 0.031, TNR: 0.568, FNR: 0.011

No hard negative mining, Training Size = ~10000/category, Linear SVM

Stage 1. TPR: 0.403, FPR: 0.007, TNR: 0.574, FNR: 0.016

2 rounds hard negative mining, Training Size = 5000/category, RBF-kernel SVM

Stage 1. TPR: 0.501, FPR: 0.002, TNR: 0.491, FNR: 0.005
Stage 2. TPR: 0.494, FPR: 0.006, TNR: 0.494, FNR: 0.006
Stage 3. TPR: 0.496, FPR: 0.009, TNR: 0.491, FNR: 0.004

No hard negative mining, Training Size = 5000/category, RBF-kernel SVM

Stage 1. TPR: 0.501, FPR: 0.002, TNR: 0.491, FNR: 0.005

And of course, the best detector of the bunch run on the test case of the class photo.

Confidence threshold = 1.2

20 out of 30 faces detected, 2 false positives

Confidence threshold = 1.3

19 out of 30 faces detected, 0 false positives

Confidence threshold = 1.2

3 out of 30 faces detected, 1 false positives

Confidence threshold = 1.3

2 out of 30 faces detected, 0 false positives

Discussion

Surprisingly, mining hard negatives had little effect on the results. Sometimes it had negative results on the detector. Even with using strong features like HOG, I was surprised I didn't achieve higher results. There are far too many parameters to tune in this pipeline. I may just not have found the right combinations of features and parameters to make a good classifier.

Not surprisingly, the RBF-kernel SVM fairs much better than the linear. While mining hard negatives provided minimal gains on the performance of the linear classifier, it actually has a negative performance on the RBF-kernel classifier. This was the case with both 1000 and 5000 example per category training sets.

From the class images, it's easy to tell that the detector failed with smaller sized faces. I probably could have tuned the parameters down further to find more faces like these. There were also still a fair number of false positives. Judging from the precision-recall curves, my detector was able to detect some faces his with high recall. Still, the best average precision I achieved was 0.775.