Face Detection with a Sliding Window

Kyle Cackett

Face detection with a sliding window is performed by training a classifier using known examples of faces and non-faces then using the classifier to separate faces from non-faces in a test set.  To train the classifier we first transform known examples and non-examples of faces into a high dimensional space.  This transformation is designed to separate the faces from the non-faces in the high dimensional space (that is, we hope that the faces will be clustered away from the non-faces in the higher dimensional representation).  We then feed the high dimensional coordinates of the faces and non-faces along with a label vector to an SVM which uses this information to construct a boundary between the faces and non-faces used to categorize new inputs to the face detector as “face” or “non-face.”

Representation

I experimented with many basic representations for image patches before using the recommended SIFT descriptors.  I tried using the image patch itself and the gradient of the image patch.  These representations were not expressive enough to adequately separate the face crops from the non-face crops.  I achieved accuracies of less than 5% with these representations.  I chose to use SIFT features as the high dimensional representation of my crops.  For each window I extracted a single 128 dimensional SIFT vector which is used to describe that window.

Mining Hard Negatives

Initially the classifier uses known crops of faces as positive examples and random crops from images that have no faces as negative examples.  To better find the boundary between faces and non-faces in the high dimensional space we must search for negative examples of faces that lies as close to the boundary as possible (this will create a much more accurate classifier).  To find the desired negative examples we first create a classifier using random crops of non-faces then run the classifier on a set of images without faces and use all the examples that detector incorrectly identifies as faces to train a new classify (with the hope that this new classifier will better identify the boundary between faces and non-faces).  This process can be repeated any number of times.  When mining hard negatives I executed this process only once and used 2000 false positive crops to train a new classifier.  

The Classifier
SVM classifiers can be linear or non-linear.  Linear classifiers use a hyperplane in high dimensional space as the boundary between faces and non-faces.  Non-linear classifiers can have arbitrarily complex boundaries created by applying arbitrary operations to the training data via a kernel matrix before creating the classifier.  However, non-linear classifiers are much more complex to compute and therefore it is impractical to use all the training data available to construct the classifier.  I experimented with both linear and non-linear classifiers, results are included below.  

Results

All results shown below use SIFT representations for image crops.  I first experimented with linear classifiers.  I initially trained with 1000 positive and negative examples (training data is 2000 crops).  With lambda of 1.0 and the default detector parameters I achieved an accuracy of .241 (shown below).

Linear-Lambda=1-1_Stage-No_Max_Neg_Per_Image.jpg

This classifier used to generate the curve above was trained by taking the first 1000 negative examples from the set of images without faces.  I decided to limit the number of negative training crops taken from each scene to 10 in an attempt to diversify my negative examples.  Leaving all other parameters the same I achieved an accuracy of .308 (shown below).

Linear-Lambda=1-1_Stage-Max=10_Crops_Per_Image.jpg

I then added a second stage to training the classifier in which I mined hard negatives using the classifier created above.  By adding the new negatives found by mining the negative data set to the current training data I was able to train a linear classifier that achieved an accuracy of .395 leaving all other parameters the same.

Linear-Lambda=1-2_Stage-Max=10_Crops_Per_Image-Memory-Reduced_con_Thresh.jpg

Finally, I trained a non-linear classifier.  I used an rbf kernel with lambda 5 and sigma 1000 (I selected sigma experimentally so that the kernel matrix would contain a variety of values not just zeros and ones).  I had one training stage and used a total of 2000 positive and negative training crops (4000 crops total).  When running the detector on the test set I lowered the step size to 2 and the beginning scale to 2 but did not adjust the scale factor or the confidence level.  With these parameters I achieved an accuracy of .720 (shown below).  By adjusting the scale factor I presumably could have obtained an even greater accuracy.

NonLinear.jpg