Tracking and Structure From Motion
by Sam Swarr (sswarr)

CS 143 Fall 2011

title

Finding Features to Track

The first step was to find suitable features to track. Harris corners work very well for this, so I used the supplied Harris corner detector to find corners on the first frame of the hotel sequence. Of the ~3000 corners that were returned I kept the 500 that had the highest strength. (I also removed the ones corresponding to the stumps on the ground as I did not wish to track those):

lost_points
The initial points to be tracked. The ones in red left the frame during tracking (or were otherwise removed).

Implementing the Kanade-Lucas-Tomasi(KLT) Tracker

The next step was to track the initial points over the frames of the video using a KLT tracker. At each frame, the tracker computes the x and y derivatives of the frame in addition to the difference between the current frame and the previous frame (known as the temporal gradient). By assuming that neighboring pixels move similarly and that there is brightness constancy between corresponding points in sequential frames, we can use some linear algrebra to produce an optical flow for both horizontal and vertical movement. The optical flow of a frame details how each pixel in an image is moving at this point in time:

Horizontal Optical Flow Vertical Optical Flow
hor_op_flow vert_op_flow
A composite of the horizontal and vertical optical flows.
comp_op_flow

After computing the optical flow of a frame, the points being tracked had their positions updated according to the optical flows' values at that point's position. Each point's position as each frame was stored.

20_point_paths point_overlay
The paths of twenty random tracked points over the frames. The final frame overlaid on top of the paths.

Structure From Motion

Now that the points have been tracked over the frames of the video, we can infer the 3-D structure of the building. Using the general process outlined in section 3.4 of this 1992 paper by Tomasi and Kanade I calculated rotation and shape matrices. I used equations 16 and 17 from this 1997 paper by Morita and Kanade for some of the intermediate steps in calculating the rotation and shape matrices.

The resulting shape matrix contains the three-dimensional coordinates for the tracked points. The rotation matrix contains the camera axes at each frame. If we plot these:
view1 view2
view3 view4

The red lines show the camera's position and direction at each frame. By rotating the model, the rectangular shape of the building was evident.