Final Project: Real Time Gaze Tracking in a Retail Environment

by rmcverry
10/10/11

Introduction:

Consumer shopping habits is an important field of research for product and retail store marketers. One particular area of interest is to study how consumers browse and absorb product arrangement and store displays as well as to study whether or not product arrangement impacts consumer decisions. Previously, consumer habit studies had to be done by undercover human observers on the store floor or in a camera control room. These individuals were paid to observe how consumers interacted and reacted to displays. They specifically noted how consumer's eyes and gaze traveled around a display. We can now use computers to track consumers gaze and can build cheap and cost effective systems that can allow product and retail store marketers to obtain even more data on their consumers' shopping habits. My project was to develop an on line gaze tracking system that can be deployed with a basic webcam and low end computer.

Specifications and Criteria:

I hoped to follow the basic principles of engineering design in selecting my final project's specifications, criteria and restraints.

Retail Environment:

Frame and Composition Considerations: The webcam should have a wide angle lens because while aisles can have varying widths, they can be quite narrow which will force consumers to stand and browse shelves at very close distances. Thus it is important for the camera to capture as wide of a frame as possible. The background will either be another shelf full of products or an open space. The open space could also contain other shoppers, it may be helpful to only focus on the shoppers closest to the camera. Additionally we can assume that the camera will remain stationary and in place. Most retail environments are well lit (often with overbearing and sharp lighting), this is not guaranteed however because some stores such as Abercrombie and Fitch are notoriously poorly lit.

Implementation Considerations: One of the constraints that I imposed on myself was that my program should work in real time. This would speed up the marketers' work by performing analysis as the data collected. However this constraint forced me to speed conscious. I had to minimize the time it took to analyze a single frame so that I can work on the next frame as soon as possible.

Cost Considerations: Ideally such an implementation would use one relatively inexpensive camera and a low end computer.

Implementation:

I decided to write my project's code in Python using the open source library OpenCV. Python enabled me to rapidly prototype my project and explore and learn from various OpenCV examples. This is the only decision I made that would explicitly reduce performance. I found the OpenCV library to be pretty simple to use and implement and Python allowed me to debug my errors quickly. It would be advantageous to use a fast numerical and statistical math library such as Numpy/SciPy in a future release to speed up any linear algebra and statistical operations.

The general approach I took was to first use OpenCV's version of the Viola-Jones boosted cascade classifier to detect a face. Find features on this face to track using OpenCV's version of Harris Corner feature detections. And finally I used a basic geometrical model – the head modeled as a sphere - and the relationship between the features to determine head rotation. Finally we use more geometrical reasoning to learn where the head is facing on a vertical plane directly behind the camera.

Pipeline:

If there was not a face in the previous frame
Grab a frame from a webcam
search for the face in the frame
if face is not found - start over - else
find good features to track inside the face's bounding box
pair up features, to try to create as many symetrical feature pairs as possible.

If there was a face in the previous frame.
Search for a face in a cropped window whose pixel coordinates are determined by the location of the previous face plus some padding.
Track the features from the previous frame to determine their new location
Calculate motion from this new location

Face Detection:

A face is detected using OpenCV's variation of Viola-Jones Ada-Boost cascade classifier wrapped in the library function cv.HaarDetectObjects(). My implementation used a pre-trained face classifier that I downloaded from OpenCV's website (see notes at the end of this writeup). Detecting a face still can be a costly operation, but I reduced the runtime for each frame after the face is detected. For each subsequent frame after a face has been detected, I only search for the face locally. I used a padded search window around the previous detected face's location to decrease the amount of time it takes to relocate the face. See results for implementation specific information.

Finding Good Features:

Once a face has been located for the first time (the last frame did not detect a face), I then search within the bounded face coordinates to find good features to track. I first copy the subimage containing only the face, and then take either the first or second Gaussian pyramid level to get access to lower level features. I find robust features using the library function cv.GoodFeaturesToTrack(), which is a modification of Harris corner detection. I then supplement these features, with features find around the mouth.

I wanted to get as much symmetry as possible between my features and thus I chose to explicitly find spatial symmetry across the vertical midline by searching for the mouth using a pre-trained mouth classifier in a similar manner to finding the face. I limited the search for the mouth to the lower half within the face sub-image to reduce search time. I then cropped to this level and found more explicitly from both halves of the mouth using cv.GoodFeaturesToTrack().

Pair Up Features:

The next step is to pair up features. Since I modeled the head as a sphere (see Geometrical Model below), I wanted to take advantage of X-axis and Y-axis symmetries to quickly calculate and approximate head rotation. Thus, I wanted each feature to be associated with a near-symetrical pair. To do so, I used the face's frames left-top to right-bottom diagonal to divide the features into two groups. I then used a greedy algorithm that flips a feature from the first group across the diagonal and find the nearest neighbor in the second. Then it pops the two paired features from their respective groups and puts in to an ordered feature list. This step only happens during the online initialization phase, and note that the distance between each pair is recorded.

Tracking the Face and Features

Now that we have a face and features, we quickly track them across each subsequent frame. The face is found again to track translation motion by the subject. It is tracked within a subwindow as previously described. Note that the mouth is not found, nor are new features. However, I track the previously using OpenCV's library function cv.CalcOpticalFlowPyrLK which employees and pyramidal Lucas-Kanade method to quickly track features from frame to frame. A new distance is found for each feature pair and that is used determine rotational motion below.

Geometrical Model and Reasoning

To approximate head rotation, I used a very basic model. I called the head a sphere and centered the sphere directly in the face's frame. Then I could look for near-symmetrical features across both the x and y axis. The important of this symmetry is show below, where d is the original x component of the distance between two features in the image, d' is the current distance, and theta1 and theta2 represent angles in the depth plane from the x,y origin. Phi is the rotational angle that we are looking for,

d = x1 – x2 = sin(theta1) – sin(theta2)
d' = x1' – x2'r = sin(theta1 + phi) – sin(theta2 + phi)
d' = cos(phi)(sin(theta1) – sin(theta2)) + sin(phi)(cos(theta1)-cos(theta2)))

if we choose two near-symmetrical features across the y axis then cos(theta1) – cos(theta2) will approach zero. Then we can find isolate phi by taking the ratio of d' over d, which leaves us with:

d' / d = cos(phi)

or

phi = arccos(d'/d)

Thus to find horizontal and vertical rotation we can find the ratio between the original distance across the correct axis between the two features, and the current distance and then take the arc tangent.

Results:

The online initialization phase took less than 0.5 secs, averaged around 0.4 secs
The tracking step took about .2 secs or 5 frames a second.

The results seemed correlated to my motions, but not precise at all. This is probably because my model was not robust enough and errors would accumulate

Screenshot

blue square - current face frame
white dot - current feature location
green dot - prev feature location

Example Output:

1st col duration (secs), 2nd col x (ft), 3rd call y (ft)

go track
tracked
find mouth features
found mouth features
0.44 -3.0 7.07179676972
0.18 0.157568896023 4.79658801394
0.18 0.212155578549 4.72071440823
0.16 0.209731038507 4.71159583877
0.17 0.200674501491 4.71392270485
0.16 0.212285028784 4.67818015072
0.16 0.198946057771 4.67552897229
0.15 0.194352198886 4.67583101318
0.16 0.205477675059 4.65128805954
0.16 0.242197982622 4.67371771667
0.17 0.262352339648 4.68274899381
0.17 0.249420498757 4.67998769717
0.15 0.266545233027 4.69351635541
0.16 0.262625429657 4.69709801092
0.15 0.256093302317 4.66023688651
0.16 0.28655571098 4.67454955407
0.15 0.300295303596 4.63255286658
0.16 0.294018871151 4.60311139158
0.15 0.272590575999 4.59337417874
0.15 0.273671387702 4.5904357018
0.16 0.274429240153 4.58342379902
0.15 0.269678692404 4.57397031712
0.16 0.31046946868 4.57287905163
0.16 0.489875533822 4.41150045015
0.16 0.678800386373 3.99446740157
0.16 0.783758807568 3.91209347142
0.15 0.885156428515 3.83851045074
0.16 0.972098193634 3.7863119425
0.14 1.0084736988 3.80369185556
0.16 1.08943727703 3.82578979664
0.16 1.11645068556 3.81924548206
0.17 1.13073753289 3.82480262576
0.15 1.1439274364 3.82066417456
0.14 1.14918839768 3.80431319953
0.15 1.16559928773 3.80069895437
0.16 1.14481397565 3.80841469566
0.16 1.13706117367 3.81802213655
0.15 1.13260404503 3.82132390533
0.16 1.1149772371 3.8243271767
0.16 1.10299884748 3.83484400492
0.15 1.08734172553 3.84690562774
0.15 1.05949834563 3.83366998351
0.16 0.984033072408 3.86303828005
0.15 0.871572804079 3.90395740739
0.15 0.797252277148 3.92393788395
0.16 0.691500352026 4.0521723532
0.17 0.564223086385 4.26988048301
0.14 0.397159264238 4.34530049297
0.15 0.242870713862 4.35219689549
0.16 0.174605219415 4.31648368779
0.16 0.331477369077 4.29013944177
0.15 0.475947994753 4.28515002605
0.15 0.649552039018 4.27317393117
0.15 0.978011131131 4.27528072702
0.15 1.14162202956 4.08036692772
0.15 1.32735608721 3.8651454803
0.16 1.44882136636 3.72089958988
0.14 1.5764008596 3.60811101286
0.14 1.74074081461 3.53093408969
skip - too many features drifted
go track
tracked
find mouth features
found mouth features
0.15 1.83995859839 3.5169008416
0.13 0.954156413147 4.7352172372
0.14 1.54500070835 4.77205657378
skip - too many features drifted
go track
tracked
find mouth features
found mouth features
0.14 1.58847816107 4.69966848035
0.2 0.166116278503 4.90296466843
0.1 0.172447185932 4.87656820625
0.12 0.171829460098 4.87341725362
0.13 0.295586347572 4.8301969351

Future Improvements:

- Allow for more than one face, in the interest of speeding up my program's pipeline I threw out every face except for the first face detected.
- Threshold face depth to only allow faces that are close to the camera, this can be quickly implemented by employing a minimum threshold to reject faces with small bounding boxes.
- Product packaging may include faces. The software will need to discriminate between real faces and faces on packaging. Two quick fixes include: rejecting faces that don't move, or eliminate faces that appear as textures from a product packaging image database.

Sources:

Cascades