Project 4: Face detection with a Sliding Window

Hobart Reynolds (hcreynol)

November 13, 2011


In this project, we attempt to detect faces in images. We do this by training a classifier, sliding a window across the images (using multiple scales for the window), and placing a bounding box around windows that are classified as faces.

Algorithm

Step 1: Obtain image patches (in this case, samples of images 36 pixels x 36 pixels) that contain faces. These will be the positive training data. Additionally, obtain negative training data, in the form of patches without faces. (In this case, the negative training data is entire images with no faces, so any patch in such an image may be used.)

Step 2: Sample an equal number of positive and negative examples. The positive examples will be sampled randomly. The negative examples will be sampled as follows:

For the first iteration, there is no classifier that has been trained to distinguish faces from non-faces. Therefore, negative examples will be sampled randomly (random patches taken from different images in the negative training data, which are guaranteed not to have faces in them). For subsequent iterations, there is a classifier, so hard negatives will be sampled. A hard negative is a negative example (a non-face) that is incorrectly classified as a face by the classifier.

Step 3: Use the sampled positive and negative examples to build a classifier.

Step 4: Repeat steps 2 and 3 until a condition is satisfied for the algorithm to stop. This condition may have to do with the performance of the classifier on the training data (such as the classifier satisfying some threshold) or the number of iterations run, or when no more hard negatives are found.

To evaluate the face detection algorithm, the classifier was run on test data using a sliding window. A window was moved across each test image (this would be repeated with other windows of different sizes) and the patch in the window would be run on the classifier. The classifier would attempt to determine whether the patch was a face or a non-face. In the event that it was a face, a bounding box would be located on that patch. If two bounding boxes overlapped enough (such as if the same face was detected by two windows of different sizes), then the boxes would be consolidated into one box, to eliminate redundant detections. Finally, the detections would be compared to the ground truth detections (bounding boxes placed on the faces found by human annotators). For each bounding box found by the algorithm, if it overlapped enough with a ground truth bounding box, then the detection was a true positive. Otherwise, it was a false positive.

The performance of the algorithm was measured by a precision-recall graph and average precision. Precision is the ratio of true faces detected to detections (for example, if the algorithm detects 10 faces and 5 of them are actually faces, precision is 50%), while recall is the ratio of true faces detected to total faces (for example, if the algorithm detects 5 true faces but there are actually 10 in the test set, recall is 50%). Precision over various recall levels can be averaged for a general performance of the algorithm.


Parameters

The algorithm could represent image patches in different ways. For this assignment, I experimented with normalized image patches and SIFT descriptors. A normalized patch is the image patch, normalized to have a total intensity of 1 and vectorized so that a 36 x 36 patch becomes a 1296-element vector. For SIFT descriptors, I built SIFT descriptors with a bin size of 10 and a step size of 2, generating 9 128-element SIFT descriptors for each patch, vectorized into a 1152-element vector.

Support Vector Machines (SVMs) were used to build the classifier. The SVM could be either linear or nonlinear. For a linear SVM, the parameter lambda could be varied. For the nonlinear SVM, lambda, sigma, and the type of nonlinear SVM could be varied.

As a stopping condition (see step 4), I elected to use a specified number of iterations. I did this because using SIFT descriptors, the classifier was always able to achieve a false positive rate of 0% on the training data, so regardless of the threshold placed on the FPR the algorithm would never have run for more than one iteration. Because the algorithm took a very long time to run, I was concerned that it would run indefinitely if I ran it until no more hard negatives were detected. So, I chose a specific number of iterations, after which I would stop the algorithm and evaluate the classifier on the test data.

Below is a summary of the average precision on the values for the parameters on linear SVMs. Precision-recall graphs of some of the parameters are below.

Representation
Iterations
Lambda
Average Precision
Graph
Normalized Patches
2
100
2.6%
n/a
Normalized Patches
4
100
4.2%
n/a
Normalized Patches
8
100
4.0%
n/a
SIFT
2
100
21.7%
n/a
SIFT
4
100
25.7%
n/a
SIFT
8
100
30.7%
n/a
SIFT
16
100
28.0%
Graph 1
SIFT
16
10
19.0%
Graph 2
SIFT
16
1000
34.6%
Graph 3
SIFT*
6
100
31.4%
Graph 4

* For this algorithm, I modified step 2. The normal step 2 ignores the previous iteration's negative examples and samples its own. For this modified step 2, the algorithm accumulates the negative examples so that negative examples sampled at the present iteration are used in all subsequent iterations.

Graph 1
Graph 2
Graph 3
Graph 4

For the graphs, the algorithm is generally quite accurate when first recalling the faces. (The first faces recalled are the patches which the algorithm is most confident are faces.) Precision drops rather rapidly as recall increases. It should also be noted that none of the algorithms successfully recalled all of the faces in the test data. The algorithms which ran for 16 iterations recalled almost 70% of the faces, while the algorithm which ran for 6 iterations recalled only 45%.

It is evident that using SIFT descriptors are a much better representation of faces than normalized patches. This is probably because using the acutal image patch relies heavily on spatial relationships between pixels. These spatial relationships may break down if the face is presented in a different manner (such as a different viewing angle). Using SIFT descriptors relies on the image gradients, which are not as sensitive to such changes. This allows the algorithm to be many times more accurate (improving average precision from 2-4% to 20-35%) when using SIFT descriptors.

It seems that precision generally improves as the number of iterations increases. This is expected becayse as the classifier is refined, retuning it based on negative examples that it previously classified incorrectly as positive examples will generally improve its performance. However, it should be noted that for the SIFT descriptors with lambda of 100, the algorithm running for 8 iterations had a higher performance than the one running for 16 iterations. This could be because as the algorithm is repeatedly retuned, it becomes too biased towards the last negative examples it saw, which could degrade its performance. This seems likely because the algorithm which accumulated negative examples, rather than ignoring previous negative examples, outperformed both, even though it ran for only 6 iterations.

It seems that precision generally improves as lambda increases. (If lambda is too large, it will degrade performance, but no such lambda is seen here.)

I also attempted to run the algorithm using a nonlinear SVM. I ran the algorithm on normalized patches with lambda = 1.0, type = rbf, and sigma = 1.0 for 8 iterations. Average precision was 3.0% (worse than two of the linear runs of the algorithm). This was probably because the negative examples were not being accumulated, so the algorithm was being biased towards the last negative examples that were seen. I also ran the nonlinear version of the algorithm with the same parameters but using the SIFT representation of the patches. Unfortunately, it failed to detect any faces in the test data.


Results

Some example detections are below. Note: there may be multiple faces in the image, but each detection is treated separately, so at most, only one face will be detected at a time.

In this image, a real face is detected. Note that the algorithm's bounding box does not perfectly correspond to the ground truth bounding box, but it overlaps significantly.

In this image, a face in profile is detected.

In this image, a drawn face is detected.

Below are three examples of incorrect detections: