Face Detection with a Sliding Window
Kyle Cackett
Face
detection with a sliding window is performed by training a classifier using
known examples of faces and non-faces then using the classifier to separate
faces from non-faces in a test set. To train the classifier we first
transform known examples and non-examples of faces into a high dimensional
space. This transformation is designed to separate the faces from the
non-faces in the high dimensional space (that is, we hope that the faces will
be clustered away from the non-faces in the higher dimensional representation).
We then feed the high dimensional coordinates of the faces and non-faces
along with a label vector to an SVM which uses this information to construct a
boundary between the faces and non-faces used to categorize new inputs to the face
detector as “face” or “non-face.”
Representation
I
experimented with many basic representations for image patches before using the
recommended SIFT descriptors. I tried using the image patch itself and
the gradient of the image patch. These representations were not
expressive enough to adequately separate the face crops from the non-face
crops. I achieved accuracies of less than 5% with these representations.
I chose to use SIFT features as the high dimensional representation of my
crops. For each window I extracted a single 128 dimensional SIFT vector
which is used to describe that window.
Mining Hard
Negatives
Initially
the classifier uses known crops of faces as positive examples and random crops
from images that have no faces as negative examples. To better find the
boundary between faces and non-faces in the high dimensional space we must
search for negative examples of faces that lies as close to the boundary as
possible (this will create a much more accurate classifier). To find the
desired negative examples we first create a classifier using random crops of
non-faces then run the classifier on a set of images without faces and use all
the examples that detector incorrectly identifies as faces to train a new
classify (with the hope that this new classifier will better identify the
boundary between faces and non-faces). This process can be repeated any
number of times. When mining hard negatives I executed this process only
once and used 2000 false positive crops to train a new classifier.
The Classifier
SVM
classifiers can be linear or non-linear. Linear classifiers use a hyperplane in high dimensional space as the boundary
between faces and non-faces. Non-linear classifiers can have arbitrarily
complex boundaries created by applying arbitrary operations to the training
data via a kernel matrix before creating the classifier. However,
non-linear classifiers are much more complex to compute and therefore it is
impractical to use all the training data available to construct the classifier.
I experimented with both linear and non-linear classifiers, results are
included below.
Results
All results shown below use SIFT
representations for image crops. I first
experimented with linear classifiers. I
initially trained with 1000 positive and negative examples (training data is 2000
crops). With lambda of 1.0 and the
default detector parameters I achieved an accuracy of .241 (shown below).
This classifier used to generate the
curve above was trained by taking the first 1000 negative examples from the set
of images without faces. I decided to
limit the number of negative training crops taken from each scene to 10 in an
attempt to diversify my negative examples.
Leaving all other parameters the same I achieved an accuracy of .308 (shown
below).
I then added a second stage to
training the classifier in which I mined hard negatives using the classifier
created above. By adding the new
negatives found by mining the negative data set to the current training data I
was able to train a linear classifier that achieved an accuracy of .395 leaving
all other parameters the same.
Finally, I trained a non-linear
classifier. I used an rbf kernel with lambda 5 and sigma 1000 (I selected sigma experimentally
so that the kernel matrix would contain a variety of values not just zeros and
ones). I had one training stage and used
a total of 2000 positive and negative training crops (4000 crops total). When running the detector on the test set I
lowered the step size to 2 and the beginning scale to 2 but did not adjust the
scale factor or the confidence level.
With these parameters I achieved an accuracy of .720 (shown below). By adjusting the scale factor I presumably
could have obtained an even greater accuracy.