Face detection with a sliding window

Feature representation

I represented each local histograms of gradients (HoG). The transformation from image to HoG descriptors involved the following: Dalal-Triggs (2005) used more local normalization methods, but doing global normalization is faster, and doesn't seem to make much of a difference.

Using linear vs. nonlinear classifiers

Linear classifiers classified patches faster, but are bad at extracting meaning from impoverished or naive data representations, such as raw pixel information. Using nonlinear classifiers was more computationally demanding. Training each was fast and successful; the training accuracy of each was between .93 and .95, and could be brought above .99 when lambda was reduced.

A linear SVM trained on 1000 positive and 1000 random negative examples, with lambda = 100 and a bias (b0) shift of 0.12:



A nonlinear SVM trained on 1000 positive and 1000 random negative examples, with lambda = 100 and a bias (b0) shift of 0.09:

Mining hard negative examples

After a classifier was initially trained, it could be used to mine non-face patches that were difficult to classify. Mining hard negatives had little impact on the AP, since fundamental limitations in the data representation prevented the classifiers from gaining anything from the new data.

A linear SVM trained on 1000 positive and 1000 random negative examples, with lambda = 100 and a bias (b0) shift of 0.12, trained on mined hard negatives for 5 iterations: