CSCI1430 Project 5:
Tracking and Structure from Motion
Soravit Beer Changpinyo (schangpi)
1. Introduction
The goal of this project is to reconstruct the 3D shape of an object from a series of video frames.
There are three main components to this: first, keypoint selection, in which we use
a Harris corner detector to select keypoints; second, feature tracking, in which we implement
a Kanede-Lucus-Tomasi tracker, which involves computing optical flow between successive video frames;
third, structure from motion, in which we implement the affine structure from motion procedure
and then eliminate the affine ambiguity by applying metric constaints. Structure from motion algorithms
are described in section 3.4 of the paper "Shape and Motion from Image Streams under Orthography:
a Factorization Method" (Tomasi and Kanade 1992), and section 2.3 of the paper
"A Sequential Factorization Method for Recovering Shape and Motion from Image Streams"
(Morita and Kanade, 1997).
2. Algorithm
The pipeline of the algorithm is as follows.
(1) Use a Harris corner detector to select 500 keypoints from the first video frame.
(2) Compute the optical flow between successive video frames based on the Kanade-Lucus-Tomasi tracker,
solving the system of linear equations
(3) Track each keypoint through the flow by looking up its displacement in the frame and
adding that to its position to get position for the next frame
(4) Construct the registered measurement matrix from the keypoint tracks, compute
its singular-value decomposition, select parts of the resulting matrices whose product gives the measurement
matrix of rank 3, and combine those matrices into two factors: camera parameters and 3-D points.
(5) Eliminate affine ambiguity by solving another system of linear equations, which is based on the
assumptions that image axes are perpendicular and of unit length.
In step 2, the system of linear equations is derived based on three assumptions: small motion -
points do not move very far; brightness constancy - projection of the same point looks
the same in every frame; and spatial coherence - points move like their neighbors.
My implementation in this step is fast because
(i) To find the sum of 15 x 15 pixel intensities, I use a box filter,
(ii) We can solve for each optical flow for image pixels all at once,
using the inverse formula for 2 x 2 matrices.
In step 5, my implementation follows section 2.3 of the paper
"A Sequential Factorization Method for Recovering Shape and Motion from Image Streams"
(Morita and Kanade, 1997).
3. Results and Discussion
3.1 Keypoint Selection
3.2 Feature Tracking
3.3 Structure from Motion
First, we look at only the 3D points from different view points
We can also look at the meshgrid plot from different view points and the 3D path of the cameras.
We can see that picking key points to track is very important. Here we have drifting points,
so the building walls look longer than they are. Overall, our algorithm performs very well.
The camera position for each frame is given by the cross product
kf = if x jf. For consistent results, we normalize all kf to be unit vectors.
The following are plots for each dimension of kf.