In this project, we want to use a model based on sliding windows to search over an image and detect the presence (and as a result, the location) of a face. The way it works is by training a support vector machine in a smart way, such as in Dalal-Triggs 2005, where a linear classifier is trained on strong features such as a SIFT or
HOG descriptor, and Viola-Jones 2001, where a non-linear classifier is trained on simpler descriptors. In addition, we mine hard negative patches (ie. those which are our classifier tends to give a false positive result for). For my project, I chose HOG as my workhorse descriptor, using an mex-implementation kindly given to me by Paul Sastrasinh.
The testing portion of the project works by taking an image and examining patches over fixed step sizes (4 pixels, in my case), and doing so over multiple resolutions. A (trivially) significant time/accuracy tradeoff occurs when adjusting this parameter, such that reducing the step size will give much better results while slowing down evaluation greatly.
Basic Results
Let's start with the linear classifier: you can see that I got fair results, hitting around 0.4---we can do better though.
Non-Linear Kernel
A linear SVM is pretty fast, but the range of decision boundaries which it can model is much more limited than that of a non-linear kernel. The radial basis function (or kernel) can draw decision boundaries in a potentially infinite-dimension space, so its a good one to go with. Along with the inverse-slack variable (lambda), I played around with many values of the smoothing gamma term in the rbf, specifically different powers of 10 (-1,1,2,3), although the classsifier became incredibly adverse to false positives once gamma became larger than 10. I got decent results with gamma = 1 and 10, respectively (shown below). There was only a slight improvement---my hyptothesis is that they had to be very finely tuned, better than my exponents of 10.
Mining Hard Negatives
My approach to mining hard negatives for each image was to sort the "detected" patches according to their confidences, and take a subset of those patches in decreasing order (ie. take those that the classifier was really off on). Then, the classifier should be able to do a better job by discriminating more carefully. As you can see below, I got improvements for linear (left) and rbf (right) cases.
Cascading
I chose to follow a cascading architecture similar to that of Wu et al. 2008. Their cascade works by successively training weak classifiers, but biasing them so as to avoid false negatives at (almost) all costs. The idea is that because of the rarity of finding a face in a given image patch, there are worthwhile decision boundaries that can be made to eliminate less and less "obvious" cases at each level of the classifier.
My approach was, for each cascade level, to add a bias to my SVM's output to make the training false negative rate be below a certain threshold (I went with 0.01). This involved simply finding the positive data point which corresponded to the confidence percentile given by the threshold, and setting the bias to it's confidence. I attempted several different numbers of cascade levels (2,3,5), but none really seemed to give a significant boost. Part of this could be potentially explained by the fact that other models were already having their precision-recall tradeoff adjusted. The difference here is that the biasing is done within the model, rather than as a post-evaluation procedure, which makes it slightly more principled, but doesn't give a higher average precision. This is my best result, with a 5-stage linear cascade.
Here's the class photo...clearly, my system was a little too shy.