CSCI1430 Project 4: Face detection with a sliding window

Soravit Beer Changpinyo (schangpi)

1. Introduction

The goal of this project is to build a face detector based on the sliding window model. The basic idea of the sliding window model is that each image patch can be classified as being object or non-object. Our face detector is inspired by two important papers: Dalal-Triggs 2005 and Viola-Jones 2001. The first paper uses a linear classifier trained on strong, SIFT-like features such as HoG descriptors. The second paper instead uses a non-linear classifier. It also proposes a cascade framework to help alleviate the computational cost introduced by the non-linear classifier. To achieve a high-accuracy detector, we will make use of both strong features as described in the first paper, and a complex classifier as described in the second paper. We will also be implementing a method for mining hard negatives to be used by the classifiers.

2. Algorithm

The following are the main steps of the algorithm as well as the design decisions I have made:

2.1 Mining Hard Negatives

The process was the same for linear and non-linear classifiers. In the first pass, I picked 2700 random negatives from the training data. Using the model trained by those negatives, I iteratively retrieved hard negatives and retrained the detector. Specifically, I picked at most 10 best negative crops from each training image, i.e. the current detector thought it was a face with high confidence. For each stage, I limited the number of negative crops to 1000. I used only 10 best negative crops per image to preserve variety of hard negatives. If an image is very hard, we can get more of its negative crops in the next stage. The hard negatives from different stages were then combined (random negatives excluded) for testing.

2.2 Feature Representations

I converted each crop, both positive and negative, to HoG features using code from http://www.mathworks.com/matlabcentral/fileexchange/28689-hog-descriptor-for-matlab. I used HoG because getting HoG features is computationally less expensive than getting SIFT features.

2.3 Classifiers

Features extracted from 2.2 were then passed to SVM classifiers for training. I used both linear and non-linear classifiers (RBF kernel). I made sure that the number of positive and negative crops was the same.

2.4 Stopping Criteria

I stopped retrained the model if either of the following was true: (1) the current detector detects less than 20 false positives among all training images. (2) the number of false positives does not decrease. (3) the number of iterations reached 8.

2.5 Testing

Our benchmark is the CMU+MIT test set. This test set contains 130 images with 511 faces. I set the start scale to be 2, scale factor to be 1.25, and step to be 4.

3. Results and Discussion

3.1 Hard Negatives and Classifiers

First, I would like to show the top hard negatives at different stages. It becomes clear why the complexity of the classifier is important. The non-linear classifiers are better at getting rid of hard negatives that appeared in the previous stage. On the other hand, in the case of linear classifiers, some crops showed up over and over again. The reason is that some crops share a lot of basic patterns with faces, and thus cannot be distinguished by a simple linear hyperplane.
The top 25 hard negatives at different stages for non-linear classifiers

Stage 1 (random)

Stage 2

Stage 3

Stage 4

Stage 5

The top 25 hard negatives at different stages for linear classifiers

Stage 1 (random)

Stage 2

Stage 4

Stage 6

Stage 8

3.2 Average Precision and Comparisons

To actually see how much mining hard negatives helps improve the accuracy and to achieve a fair comparison, I used the same number of training data for random negatives and for hard negatives, which is 2*3871 for the linear case and 2*1326 for the non-linear case.

The results show that using hard negatives instead of random ones improves the accuracy by about 10%. Types of classifiers have more impact on the accuracy; using non-linear instead of linear classifiers gives about 20% improvement in each case.

Random negatives-Linear classifiers

Hard negatives-Linear classifiers

Random negatives-Non-linear classifiers

Hard negatives-Non-linear classifiers

3.3 HoG parameters

Choosing the right parameters to extract HoG features is really important. To illustrate such fact, I tried the model that uses 3x3 number of HoG windows per bound box (resulting in 81 dimensional feature vector) instead of 4x4 (144 dimensional feature vector). The number of bins was fixed to 9. The model was retrained 10 times by non-linear classifiers. The average precision score is 0.687. This value and the results in subsection 3.2 suggest that having strong, complex features is more important than having hard negatives.

3.4 Number of Training Data

Increasing the number of training data also helps improve the performance. Without hard negatives and in non-linear classifiers case, the result on 2*2700 training images (instead of 2*1326) is quite close to our best performance.

3.5 Sample Results

We have successfully created a face detector with high accuracy. Some non-real-human positives are shown below.

Strange positive results

I notice that most false positives are: (1) parts of humans, especially parts of a face: eyes, mouth, chin, ears, knees, elbows. (2) round objects: soccer ball, speaker, letter 'O', glasses. This suggests that we might want to add images that contain such features as negatives training data.

Some other errors are due to the fact that the bounding box that covers a face is too big. This might be caused by inaccurate non-maximum suppression.

Lastly, some false positives look really like faces (e.g. the last image).

Interesting negative results

3.6 Discussion

Even though 0.81 is a reasonable accuracy score, there are many rooms for improvements. One can try higher dimensional feature vector by adjusting HoG parameters. I believe this will give a strong boost in the performance (see 3.3). Moreover, parameters for SVM and for detector can be even more optimized. Lastly, one can certainly use more training data to improve the performance. Adding training data with some specific features as discussed in the last subsection might also help get rid of unreasonable false positives, though might not significantly improve accuracy. Hard negatives only provided moderate improvements, and it is not likely that mining harder negatives will improve the accuracy significantly.