Project 4: Facial Detection with a Sliding Window

Tala Huhe (thuhe)

Overview

Facial recognition is one of the primary applications of computer vision. Given an image, a facial detector tries to find all of the faces in that image. As with many other computer vision tasks, this is one that is conceptually simple for humans but difficult for computers.

One method for facial detecting is classifying independent image patches as either faces or non-faces. By varying the size and position of patches from one test image, one can hope to find all the faces within that image. Such a model is called a sliding-window detector.

In this project, we implement a slidng window face detector for the CMU+MIT test set. Though sliding windows are conceptually simple, they seem to work well enough for our purposes.

A sample run of our facial detector on a group photo. Each box represents one positive detection, and its brightness is dependent on reported confidence.

Algorithm

Below is a rough outline of the training pipeline:

We sample our initial training data for positive (faces) and negative (non-faces) patches.
We extract feature vectors from the initial training patches.
We train a linear SVM using the initial feature vectors. This is a very rough classifier that we can use for facial detection. We will swap this out with a non-linear RBF SVM for comparisons.
In order to improve our classifier, we run our classifier on images with no faces. The results are all false positives, which we use to re-train our classifier. We repeat this process a few number of times.
We run our classifier on the testing set, and look at the results.

Positive and Negative Training Data

We use the Caltech Web Faces project for our positive training sets. Below are some sample crops we use as positives during training:

As for negative training data, we use a combination of Wu et al. and the SUN Database. Below are some sample training negatives:

Feature Extraction

We use SIFT vectors to represent features in our images. They are constructed from a given bin by computing histograms of oriented gradients in multiple directions and scales. We sample the SIFT bins uniformly from our image.

Some nice properties of these vectors are invariance to scale, orientation, and affine transforms. They are also partially invariant to brightness shifts. Because of these properties, they are a fairly standard way of representing features in images. We extract SIFT vectors at densely at regular intervals in our testing image.

Detection

In order to process one image, we extract crops of various sizes and at various positions. We can then use our classifier to test whether each crop contains a face. If we combine these independent crops for all images, we will have found all faces for one image.

Of course, some crops will overlap and we might wind up finding multiple detections for one face. In order to counteract this effect, we will only show the crop with the highest reported confidence if multiple overlap.

Mining Hard Negatives

The goal of this step is to lower the generalization error of the SVM by providing it more points. We can do this by finding false positives for our SVM and training it on that data. In order to find false positives, we simply run our classifier on a random portion of our negative training set. Since every positive will be a false positive, we can retrain our classifier based on this data. Below are some crops which were regarded as hard negatives by a linear SVM:

As we can see, most of the hard negatives contain face-like features. However, some (notably in row 6) actually are faces from different perspectives. By training these as true negatives, we might actually be confusing our classifier as to what is and isn't a face.

Testing Results

Our algorithm performs reasonably well for our given test set. The following results are constructed with a linear SVM with no hard negative mining.

Fifteen of the most confident true negatives.

Five rather confident false positives that we detected. These seem to resemble faces, for whatever reason.

We plotted each image with all recognized positives, colored based on confidence. The most confident regions are yellow, whereas the least are black. The first thing to notice is that faces will have the yellowest boxes. Also, notice the extremely high number of false positives for this classifier. This is due to the precision-recall graph not penalizing for many false positives. Finally, notice that in some images, there isn't much of color difference between faces and non-faces. This suggests that our classifier has a hard time deciding and hasn't been trained enough.

Below is the precision-recall curve for the above classifier. It reports an average precision of 0.272, or 27.2%.

Linear vs Non-Linear

Below is the precision-recall curve for a non-linear SVM, with no hard negative mining.

It reports an average precision of 0.436, or 43.6%. Even without achieving true recall, this classifier is able to provide a 16.4% increase over the linear alternative.

Mining Hard Negatives

We mined for hard negatives using both the linear and non-linear SVMs. Below are the results of the linear and non-linear SVMs after two rounds of mining:

Left: linear; right: non-linear.

Tweaking Values and Final Results

We notice that we never achieve a full recall of 1 on our graphs. This is due to our threshold cutoff for the SVM being too high. By relaxing this value, we can get a higher average overall precision. We took the classifiers from above, and fixed the value by 3 to generate the graphs below:

Left: linear; right: non-linear.

We can also tweak the scale factor in order to improve performance. This will allow us to find faces of intermediate sizes. Below are our results with tweaking scale factor, but not relaxing the threshold:

Left: linear; right: non-linear. Trained with 2 rounds of mining hard negatives and with scale factor 1.25.

We can get some marginal increases by combining the two:

Left: linear; right: non-linear.

Here are some visualizations of our highest classifier above (AP = 0.803). They are colored in the same way as our visualizer above. Notice now that there is a dramatic color difference between faces and non-faces. This means that our classifier is a lot more confident about which faces it finds.

In some cases, way too many positives were found for each image. This is due to the relaxed threshold. Below is one such image: