Face Detection with a Sliding Window

Implemented by Andrew Scheff (ajscheff) for CSCI1430, Fall 2011

"Life is notes under our fingers -- we just have to figure out what notes to play."
                                                 -Jamie Fox, New Jersey Politician

We were given the task of improving upon a baseline face detection scheme that could achieve an average precision of 5%. The baseline scheme does nothing fancy. It trains a linear SVM using raw, random non-face image patches and raw face crops as positive and negative training data. I used three different approaches to improving my face detector.

Improving the Feature Representation

I chose to use a Histogram of Oriented Gradients (HoG) descriptor to represent image features. I used Oswaldo Ludwig's implementation of HoG that can be found here. Simply using a HoG feature representation without any parameter tweaking boosted the average precision to 17.8% with the below precision curve:

The main parameter that I varied with the HoG descriptor was the number of HoG windows for each bounding box. The above PR curve was for a HoG descriptor with a 3x3 grid of windows and 9 histogram bins, which gives a 81 dimensional descriptor. I tried, a 6x6 grid of windows, 9x9, and 12x12. Their PR curves are shown below. I found that changing the number of histogram bins doesn't affect the AP very much.

I got the best performance from the 9x9 grid of windows, but I found that 6x6 was a good balance of performance and speed, so I used those parameters as I moved on to other improvements.

Mining for Hard Negatives

Mining for hard negatives did not improve performace greatly, at least with a linear classifier. I tried improving my classifier with one additional round of mining hard negatives, and I noticed that on many of the training images no hard negatives were found, and on many others, only several were found. Taking a maximum of 4 from any one image (so out of 1000 possible), I got a total of 597 hard negatvies. On the second round of hard negative mining, very very few were found in any image. I got 123 total. Below are the AP curves that I got for one and two rounds of mining hard negatives.

As you can see, these AP curves are not much better than the curve for the 6x6 grid of windows with no rounds of mining hard negatives, and because the falloff of number of hard negatives found shrunk so drastically, I didn't think that it would be worth trying this scheme for more rounds of mining.

Using a non-linear SVM

I found that varying sigma and lambda had a large effect on the performance of a non-linear classifier. Still using a 6x6 grid of windows HoG descriptor with 9 histogram bins, I ran the non-linear classifier with sigma and lambda both equal to 1.0 and I got an AP of only 17.0%. The PR curve is below:
I tried 3.0 and 5.0 for sigma, and for each of those I tried lambda values of 1.0, 0.1, and 0.01. The best curve I got was with lambda 0.01 and sigma 3.0. It is shown below:

Additional Tweaks

I was able to get several percentage points out of simply using more data. Instead of 1000 negative and positive examples, I used 5000. I was able to do much better with a non-linear classifier with the above parameters. See below:

Increasing the density with which I search for faces in the test images helps as well. Decreasing my step size from 4 to 2 improved my performance. See below:

Using the same scheme, as above, I introduced 2 rounds of hard negative mining, as well as increased the density of the scales at which I sample my test images, by changing the scale factor to 1.25 from 1.5. My code is still running.