CS143
Computer Vision
Project 4 - Face Detection Using a Sliding Window


Scott Newman (senewman)
November 14, 2011

Project Overview

This project was designed to attempt to recognizes faces in an image using a sliding window. The goal is to determine the location and size of human faces and is thusly a specific type of general object detection, a common and important task in computer vision. Face detection is used in many fields - law enforcement (CSI ZOOM+ENHANCE), social networking (Facebook tagging), etc.

Algorithm

The algorithm works by using a so-called "sliding window" to detect faces with an SVM classifier. The sliding window acts as its name describes - a window slides over an image repeatedly at varying scales and determines if the window at a given position and scale contains a face, a type of binary classification for object detection.

The pipeline for detecting faces is as follows:
  1. Sample the initial training data. 36x36 pixel patches are used as the standard.
  2. Extract features from data we know a priori to be faces and non-faces (training data) using some type of feature descriptor. I used SIFT-descriptors, using the VLFeat library.
  3. Run a linear-SVM on the results from step (2) (as outlined in Dalal-Triggs 2005), mined randomly at first to give a rough baseline. The process is repeated on hard negative examples, which we know have no faces. This retrains the classifier, finds false positives, and greatly improves its precsion when run on test images.
  4. Once the classifier is trained, run it on test images and evaluate its detections. Non-maximum suppression is performed to get rid of duplicate detections.
The data used is Caltech Web Faces for our positive training (i.e., images with faces, and a series of resources for negative training (i.e., images without faces) such as the SUN Database.

The pipeline above can also be run with a more powerful yet slower non-linear classifier to give better results.

Implementation and Results

I tried to perform a few small optimizations and tweaks to make the code run faster and give higher precision. Instead of throwing out old features in each iteration of the pipeline, I kept old features. I sped up both the linear and non-linear classifiers by nearly 90x by writing get_img_feats, which serves to extract features from the entire image instead of from individual crops. It basically collapses multiple operations into a smaller one and gets rid of the overhead accrued from many function calls. This gave me satisfactory results, and I didn't need to extract more than one feature per crop.

Overall, though, results were very, very dependent upon fine tweaking of parameters for feature extraction and sliding window scaling. Firstly, I set the start scale for the sliding window to 1 - this helped me catch small faces that a bigger window might not have caught. I also lowered the scale factor to 1.5, so the window wouldn't blow up after each iteration. I used the default lambda = 1.0 for my linear classifier. I used 1000 images for hard negative (and random) mining. For feature extraction, I used a bin size of 10 and a step size of 36 (one feature per patch). These are my results for linear classifier with hard negative mining:
Lowering the start scale had by far the most impact on my results for both classifiers at the cost of speed. With the parameters and optimizations outlined above, it took approximately between 45 minutes to run. Using the default start scale and scaling factor, it took around 15 minutes to run, with AP hovering around [0.4, 0.5]. The AP of my linear classifier without hard negative mining was worse:


The only parameters I changed for my non-linear classifier was the number of images, lambda, and sigma. I used 300 images, lambda = 1.0 (default), and sigma = 500. The default sigma yielded a degenerate kernel - I didn't get decent kernels except for large values of sigma. This took the better part of 2 hours to run, yet yielded higher precision than the linear classifier (and with less than 1/3 of the mined images). Results are as follows for the non-linear classifier with hard negative mining:
The AP for the non-linear classifier without hard negative mining was also worse, as with the linear classifier, though slightly less so:
Some example results (individual faces):






The last image's face wasn't detected with the linear classifier but was detected with the non-linear classifier.

Thanks for reading!