CSCI1430 Project 5:
Tracking and Structure from Motion

Soravit Beer Changpinyo (schangpi)

1. Introduction

The goal of this project is to reconstruct the 3D shape of an object from a series of video frames. There are three main components to this: first, keypoint selection, in which we use a Harris corner detector to select keypoints; second, feature tracking, in which we implement a Kanede-Lucus-Tomasi tracker, which involves computing optical flow between successive video frames; third, structure from motion, in which we implement the affine structure from motion procedure and then eliminate the affine ambiguity by applying metric constaints. Structure from motion algorithms are described in section 3.4 of the paper "Shape and Motion from Image Streams under Orthography: a Factorization Method" (Tomasi and Kanade 1992), and section 2.3 of the paper "A Sequential Factorization Method for Recovering Shape and Motion from Image Streams" (Morita and Kanade, 1997).

2. Algorithm

The pipeline of the algorithm is as follows.
(1) Use a Harris corner detector to select 500 keypoints from the first video frame.
(2) Compute the optical flow between successive video frames based on the Kanade-Lucus-Tomasi tracker, solving the system of linear equations


(3) Track each keypoint through the flow by looking up its displacement in the frame and adding that to its position to get position for the next frame
(4) Construct the registered measurement matrix from the keypoint tracks, compute its singular-value decomposition, select parts of the resulting matrices whose product gives the measurement matrix of rank 3, and combine those matrices into two factors: camera parameters and 3-D points.
(5) Eliminate affine ambiguity by solving another system of linear equations, which is based on the assumptions that image axes are perpendicular and of unit length.

In step 2, the system of linear equations is derived based on three assumptions: small motion - points do not move very far; brightness constancy - projection of the same point looks the same in every frame; and spatial coherence - points move like their neighbors. My implementation in this step is fast because (i) To find the sum of 15 x 15 pixel intensities, I use a box filter, (ii) We can solve for each optical flow for image pixels all at once, using the inverse formula for 2 x 2 matrices.

In step 5, my implementation follows section 2.3 of the paper "A Sequential Factorization Method for Recovering Shape and Motion from Image Streams" (Morita and Kanade, 1997).

3. Results and Discussion

3.1 Keypoint Selection

3.2 Feature Tracking

3.3 Structure from Motion

First, we look at only the 3D points from different view points
We can also look at the meshgrid plot from different view points and the 3D path of the cameras. We can see that picking key points to track is very important. Here we have drifting points, so the building walls look longer than they are. Overall, our algorithm performs very well.
The camera position for each frame is given by the cross product kf = if x jf. For consistent results, we normalize all kf to be unit vectors. The following are plots for each dimension of kf.