Computer Vision, Project 5: Tracking and Structure from Motion

Bryce Richards


Project Description
In this project, we reconstruct the 3-D shape of a house from a series of pictures taken from a video.




Algorithm Design

The algorithm consists of three main steps. First, we select a number of keypoints on the house to track. Second, we track these points from frame to frame as the camera position moves. Lastly, using this optical flow, we reconstruct the 3D shape of the house. These steps are described in more detail below.

Steps 1: Select Keypoints We used a Harris corner detector to select points to track. We chose to track the 300 keypoints with the highest Harris corner strengths. This was enough points to provide a detailed 3-D reconstruction, but few enough points that they were all easy for our tracking algorithm to follow.



All 300 keypoints layed over the first video frame



Step 2: Feature Tracking We implemented a Kanade-Lucas-Tomasi tracker for the detected keypoints. This involves using the x-, y-, and t-gradients of each pair of images (successive frames in the video) to predict where a point in the first image will be in the second image. More specifically, if I is the image intensity function, then we have the approximate relation: I(x+u, y+v, t+1) ~= I(x,y,t) + Ix*u + Iy*v + It*1, where u and v are the point's x- and y-displacement, and Ix, Iy, and It are the x-, y-, and t-gradients of the image. This relation (along with the assumption that a point's movement matches that of its neighbors) allows us to calculate u and v for every keypoint.



Some tracked points left the field of vision as the camera panned across the house. To deal with this, we deleted any keypoint that came within 5 pixels of the edge of the picture. Below are the original keypoints that were deleted due to this reason.


These points eventually leave (or come close to leaving) the frame



Below we show the keypoints at frames 1, 10, 20, 30, 40, and 50. Some drift in the keypoints is evident, but overall the tracking is accurate.

Frame 1




Frame 10



Frame 20



Frame 30



Frame 40



Frame 50


Step 3: Structure from Motion Using the keypoint tracks from step 2, we ran the affine structure from motion procedure described in "Shape and Motion from Image Streams under Orthography: a Factorization Method" (Tomasi and Kanade 1992).

Below are several views of the resulting 3D reconstruction of the house. The red lines indicate the reconstructed view of the camera from frame to frame.



<


<



Here is the reconstructed view of the camera alone, from three different perspectives.