In this project, I implement a sliding window face detector based on a similar framework to that of Viola Jones. The main idea is that in order to detect a face, we first build up a large dataset of faces and nonfaces and construct a classifier using this data. Then, in order to actually try to find a face, we take a window and slide it in discrete intervals across the image at multiple scales and use the classifier to see if any of the sliding windows contains a face in it. To represent our faces and windows we use SIFT features which encode information about the distribution in intensity gradients in a 4 by 4 grid along each image. The main problem that we have to overcome when implementing this method is that we will have to classify thousands of sliding windows on each image, and find at most a few dozen faces. Even if the FPR is as low as 1%, we will have hundreds of false faces crowding out our real faces. In addition, it is possible that our sliding window will move over a little or slightly change scale, and detect the same face multiple times. Thankfully, we can handle both of these problems.
The first problem can be solved by mining hard negatives. We start by training our classifier using random non faces collected from a large image databank. Once we build a classifier with this random data, we run it on the original database of non-faces. Anything that gets detected as a face must be a false positive because we know by construction there are none. We can extract these "hard negatives" and retrain our classifier. If we repeat this process over and over again, our classifier should eventually trigger on none or close to none of the dataset, meaning it has learned well what exactly a face and nonface is. The second problem can be solved by maximum suppression. Essentially, if we take all the things we think are faces that are overlapping and remove all but the one we are most confident in, we should go down to 1 detection per face.
After having done a few tests to find the optimal SVM parameters, I gage the performance of the detector with the linear and nonlinear features with varying amounts of hard negative mining. Below you can see the performance of both the linear and radial kernel as functions of iterations.
As you can see, The linear classifier starts off rather bad at 0.64 with random negatives, and eventually climbs to a respectable 0.8 accuracy or so after 10 hard negatives. The radial kernel allows the classifier to start off much better (0.81 accuracy), but it barely gets any better as more hard negatives get mined. The reason I stopped after 5 iterations is because it actually failed to find any more hard negatives with the parameters I used, so it was already perfect on the training data. Lets look at exactly how the performances differ. Below are the precision recall curves for the linear classifier after 1 and 10 iterations (Note these plots are from a different run from the one in the previous plots, so accuracies are slightly different).
Mining hard negatives caused the precision to improve for almost all recalls, staying at almost 1 until the recall hit 0.25 or so. This means that as we lowered the threshold for what we thought is and is not a face, mining hard negatives made us better at figuring out what actually is a face. Lets see what happened similarly for the radial kernel.
The radial kernel with no hard negatives looks very similar to the linear with hard negatives, except it lacks the perfectly flat line at the star, which makes sense because random negatives can't give it a perfect sense of what a face is at any significant threshold. Once the hard negatives are mined, we see the same phenomenon as we did in linear, where the precision stays almost flat at 1, but for a much larger recall now before finally sloping down again. As expected, this kernel allows us to better classify face data which does not likely fall in a linear region. Chances are that in some high dimensional space it just occupies a region which is surrounded on all sides by nonfaces.