Overview
This goal of this project was to implement face detection. Image crops of many scales are extracted from the image using a sliding window. SIFT features are extracted from each crop and fed to a linear or non-linear SVM classifier. Finally, the classifier is run against a testset of images containing no faces. False positives are then used to further train the classifier.
Algorithm
Crops to features
In this step, image crops are converted to feature vectors. I chose to use VLFeat's vl_dsift function to extract SIFT descriptors. I used a bin and step size of 10, and took the first SIFT descriptor returned for each patch. I enabled the "fast" option to vl_dsift, which modestly improved runtime at no noticeable cost to performance.
get_hard_negatives
In this step, an image which we know contains no faces, is fed into the detector. Any crops which are detected as faces are known to be false positives and are returned to train the classifier. This is very similar to the ordinary detection code, except non-maximum suppression is not used.
Optimizations
As an optimization, I coded the get_img_feats function. This function returns SIFT features for the entire image. This function is used instead of calling get_img_crops followed by crops2features.
Detection Parameters
I tuned the following parameters to try to get the best average precision:
- Lambda for the linear SVM, Lambda/Sigma of the non-linear SVM (with RBF kernel)
- Number of negative examples
- Start scale
- Number of times to mine for hard negatives (number of stages - 1 stage means only random negatives were used)
Results
Table of results for some parameter values:
Lambda | Sigma | num_neg_examples | start_scale | stages | AP |
---|---|---|---|---|---|
1 | N/A | 5000 | 3 | 5 | 0.276 |
10 | N/A | 2000 | 3 | 1 | 0.385 |
1 | N/A | 5000 | 3 | 2 | 0.390 |
100 | N/A | 2000 | 3 | 1 | 0.393 |
100 | N/A | 5000 | 3 | 1 | 0.406 |
1 | N/A | 5000 | 3 | 1 | 0.425 |
10 | 350 | 1000 | 1 | 1 | 0.725 |
1 | N/A | 5000 | 1 | 2 | 0.735 |
10 | 350 | 1000 | 1 | 2 | 0.766 |
The first dramatic improvement came from using SIFT descriptors as features. This improved average precision from barely 0.05 to around 0.393. Mining hard negatives provided a modest improvement in comparison. Switching to a non-linear SVM also made a modest improvement. The best improvement beyond using SIFT descriptors came from decreasing the start_scale, and this was used to achieve the best average precision of 0.766 with these parameters:
- Features: SIFT
- Classifier: Non-linear SVM
- Lambda: 10
- Number of examples: 1000
- Start scale: 1
- Step size: 4
- Number of stages: 2
Sample Detections
Although there are some false positives in the photo above, most of the faces are detected despite obstructions like sunglasses and people not staring straight at the camera.
Sample False Positives
Here are some examples of false positives. Upside down 9 and G were detected as faces in two images. In general, false positives tended to be of roundish shapes.