“Your music is bad, and you should feel bad!”
Linear SVM, interestingly, Zoidberg is the face with highest confidence.
Linear SVM, Brown computer science founders.
The goal of this project was to experiment with a face detection pipeline using support vector machines (SVMs). Like most objected detection pipelines, the face detector runs a sliding window over the test image at different scales and determines, for each scale, whether or not the window contains a face (binary classification). The basic pipeline consisted of the following steps.
To improve the SVM accuracy, the SVM can be trained multiple times using any false detections as negative examples (mining for hard negatives) - this can potentially reduce the false postive rate, increasing accuracy.
The training data set used consisted of 6k cropped 36 x 36 faces from the Caltech Web Faces Project. Non face images used were mined from Wu et al. and the Sun scene database. The detector was benchmarked against the CMU + MIT test set, which consists of 130 images with 511 faces.
Linear SVM. Suprisingly good, considering how un-face like some of the faces are.
The feature representation used for image patches makes a significant difference in detection rate. The default example (simply the image crops themselves) performed terribly. Switching the internal representation to SIFT yielded accuracies of about 37%, using the default parameters of the base implementation (1000 positive examples, linear SVM, detector step size of 4 and scale factor of 1.5, without hard negative mining)
HoG features were used as the internal representation instead of SIFT, which appeared to be relatively computationally expensive and gave relatively mediocre results (for how long it took to compute). Instead of using a prexisting HoG implementation, I decided to write my own (since there don't seem to be that many good MATLAB HoG implementations out there).
My HoG implementation gives an average accuracy of 51% (see below left), using the same settings as above - a big improvement over SIFT (which was about 37%), and better than the other HoG implementations I tried. It is also relatively fast, running the whole project pipeline from start to finish in 2 minutes (parallelization speeds this up to around 50 seconds). Using an RBF SVM, the accuracy jumps by about 10% (below right), although it is a little slower.
1000 negative and positive training data without mining for hard negatives, Linear SVM vs RBF SVM, default detector values.
For speed, my HoG is implemented in C++ and compiled as a mex file. It follows the R-HoG descriptor as outlined in Dalal-Triggs 2005, except that the image patch is variance normalized before binning into the histogram - this makes the descriptor less sensitive to illumination changes, which gave a rough accuracy boost of around 5%.
/* hog computation */
void hog(const float *imageDataIn, float *descriptorOut, const int iWidth, const int iHeight,
const int ntheta, const int ncellsx, const int ncellsy, const int ngridx, const int ngridy,
const bool signedgradient, const int interplevel)
{
...
}
The basic pipeline for the HoG descriptor is:
As a random and interesting (at least I think it's kind of neat) side note, the variance and mean for normalizing the image patch can be estimated in a single pass:
/* image variance normalization */
void normalize(const float *imageDataIn, float *imageDataOut, const int iWidth, const int iHeight)
{
float mu = 0.f, m2 = 0.f;
// one pass variance and mean computation (Knuth 1998. The Art of Computer Programming.)
for(int i=0;i<iWidth*iHeight;i++)
{
float delta = imageDataIn[i] - mu;
mu += delta / (float)(i+1);
m2 += delta * (imageDataIn[i] - mu);
}
const float stdev = 1.f / (sqrtf(m2 / (float)(iWidth*iHeight)) + 0.01f);
for(int i=0;i<iWidth*iHeight;i++)
{
imageDataOut[i] = (imageDataIn[i] - mu) * stdev;
}
}
The HoG returns a feature which has dimension = #theta_bins x #cells_x x #cells_y
I found that the feature had best performance when the number of angular bins was best from ~ 9..12 bins, and cell size of about 4x4 pixels, similar to what the HoG authors found. However, I did not implement interpolation for histogram binning, since the block overlap coupled with gaussian weighting seems to indirectly do this already.
During the first training stage positive crops are used from the training data coupled with negative training crops randomly chosen from images without a face. To refine the SVM, the detector is run again on the non face scenes, and any detections (these are false positives) are used as new negative training examples to train another SVM, to decrease the false positive rate. This step can be repeated multiple times.
The strategy used for mining hard negatives was relatively simple: for each non face scene image, any detected faces were sorted by confidence, with the top results used as hard negative examples. Hard negative mining usually improved performance by about 5-10%, but seemed much more effective for the RBF SVM.
Below are some of the face detection results on different photos.
Face detection results on the class photo.
Below are some of the more interesting face detection results on the test set. I found it surprising how well it worked on faces that weren't actual face photos, but instead drawings.
Decreasing the scale factor increases the accuracy significantly, at the cost of computation time. I found that the RBF kernel was difficult to tune correctly, and couldn't get it to really beat the linear SVM by a significant amount - at best I got it to stay about even with the linear SVM (although with the RBF needing much less training data than the linear to achieve the same accuracy). Even with this advantage, however, most of my results are using the linear SVM, since it is still faster to compute, and uses much less memory.
Decreasing step size can also increase accuracy as well as increasing the total number of negatives.
It is kind of surprising how well the linear SVM managed to perform - although it should be possible to pass the linear SVM performance with the RBF kernel given careful parameter selection.