The goal of the project is to train a recognizer for human faces, based on positive and negative training data in the form of image crops.
Algorithm
The major components of the algorithm are the feature representation of image crops, and the iterative retraining of the linear & nonlinear classifiers. The basic pipline breaks down as follows:
1. Choose an appropriate feature representation for the image crops.
2. Train an initial classifier on random subsamples of the positive and negative training data.
3. Iteratively retrain the classifier by including 'hard negatives' (false positives from the initial classifier on the negative training set) in the training subsample.
4. Evaluate on testing data.
For the image features, I used both SIFT features (from vl_dsift) and HoG features (my own implementation) and found similar performance.
Parameters:
I used a number of feature parameters for my HoG features, and had to tweak the SVM parameters on a per-feature-parameter-basis, but I held the following values constant:
Number of positive/negative training crops (before hard negatives): 6000, on both linear and nonlinear SVMs.
If using hard negatives, number of retraining iterations: Just the one.
Sliding window test set parameters: All defaults. step size 4. scale factor 1.5, min scale 50, start scale 3.
Results
SIFT with Nonlinear SVM:
The best accuracy value I managed to get was 0.545. This was achieved with SIFT features, a nonlinear SVM, and 150 mined hard negatives. I also tried the same settings with 500 hard negatives, but got a slightly lower result, which I attribute to my inability to tune the SVM parameters correctly.
The same settings without hard negatives yielded an accuracy of 0.528, which is slightly lower as expected. It is unclear whether the improvement is due to the characteristics of the hard negatives, or simply the addition of data (or noise in the my imperfect relative tuning of the parameters).
SIFT with Linear SVM:
I found that the addition of 500 hard negatives impacted performance significantly for the linear classifier, which achieved accuracies of 0.402 and 0.444 without and with, respectively. Again, the difference is partially due to the advantages that having more training data at all yields. I am less unsure of the tuning of parameters here, though, as I only had one degree of freedom with the linear SVM.
HoG Descriptor:
I implemented a HoG descriptor as well, and tested it on my linear classifier without mining hard negatives (so to be compared to the .402 SIFT baseline). I experimented with 5 sets of parameters:
HoG 1:
- 9 orientation bins per histogram.
- cells of 3x3 pixels.
- cell normalization blocks of 3 cells.
- a block stride of 3 cells, meaning that my HoG windows do not overlap.
Accuracy: 0.396
HoG 2:
- 9 orientation bins per histogram.
- a larger cell of 9x9 pixels.
- cell normalization blocks of 2 cells.
- a block stride of 1 cell, meaning that my HoG windows overlap halfway.
Accuracy: 0.465
Interested that the coarser overlapping hog outperformed the finer non-overlapping hog, I sought to determine whether the cell size of the overlapping was responsible, which leads me to:
HoG3:
- 9 orientation bins per histogram.
- cell of 9x9 pixels.
- cell normalization blocks of 2 cells.
- a block stride of 2 cells, so no overlap.
Accuracy: 0.247
The coarse, non-overlapping hog performed quite poorly.
HoG4:
- 9 orientation bins per histogram.
- cell of 3x3 pixels.
- cell normalization blocks of 3 cells.
- a block stride of 2 cells, so a 1/3 overlap.
(This is HoG1 with overlap added)
Accuracy: 0.409
Indeed, adding overlap/more windows improved performance.
Finally, I noticed that many of the hog histogram bins had no elements, so I wondered if I could decrease the number of histogram bins drastically without affecting performance too much.
HoG5:
- 3 orientation bins per histogram.
(Otherwise identical to HoG2)
Accuracy: 0.414
Which seems surprisingly reasonable for such a reduction in feature size.