Structure From Motion

Kyle Cackett

Building a 3D model of an object from a sequence of video frames can be broken into three main phases. First, a reliable set of key-point features must be detected, then the motion of these features must be tracked through the series of frames, finally the matrix of key-point locations tracked throughout the series of frames can be decomposed into two matrices representing the motion of the camera and the 3D structure of the object.

Detecting the Key-points

Good features to track are those whose motion can be estimated reliably. It is impossible to track the motion of areas of no contrast because if we examine a window surrounding the area in two frames we will see no difference between the frames. It is easy to track image motion using edges as key points if the motion is perpendicular to the edge boundary, any component of motion along the edge boundary will be impossible to detect (this is an example of the aperture problem). Corners make up a reliable set of features because they do not suffer from the aperture problem. We use a Harris corner detector to find salient image features. Harris corner detectors work by computing Ix and Iy (the x and y derivatives of the intensity of the image). These derivatives can be used to construct a matrix whose eigenvalues are large whenever a corner is present. We then plug the matrix into a corner response function which is large if both eigenvalues are large and of comparable magnitude. Finally, we examine areas of the image with high corner response and perform non-maximum suppression to distill the area down to a single point. The 500 points with the highest corner response (which we will track for our structure from motion algorithm) are shown below.

Optical Flow

To track features throughout the series of frames we use a Kanade-Lucas-Tomasi tracker. This feature tracker assumes brightness constancy between frames and that points spatially near the feature have the same motion as the feature itself. These assumptions lead to an over constrained linear system which we solve using least squares. The brightness constancy assumption gives the following equation:

Where x,y is the location of the feature, t is the time corresponding to the frame in question, and u,v is the motion between frames. We Taylor expand the left term to get:

Which leads to the following under constrained system:

Assuming nearby points undergo the same motion gives the following approximation:

To solve this using least squares let the terms in the equation above be represented as Ax=b. Then the solution is given as:

This equation is solved at each time step at each pixel and the u,v are added to the key points to plot their motion throughout the video.

The following figure shows the set of key points that moved off screen during the sequence of frames:

The next figure shows the motion of 20 points throughout the sequence of frames:

Structure from Motion

To determine the structure of the object from the motion of the key points we use a series of matrix factorizations and multiplications described by Tomasi and Kanade. First the key points are centered by subtracting the average of all key points from a single frame from each key point. This gives a registered measurement matrix. We then use the components of the single value decomposition of this matrix to get matrices representing transformations of the camera movement and shape of the object. Finally, we apply the metric transformation to each of these matrices to find the actual movement ofth ecamera and the shape of the object. The movement of the x, y, and z components of the camera as a function of time are given in the three plots below:

The predicted 3D structure of the house is given below. Note that the corner detector used some stumps in the background as key points and because these points moved a great deal between frames the optical flow algorithm did not correctly track these points (it did not record these points moving out of frame). The first 3D view uses the key points that are stumps. The second 3 views shows the house with the stump key points removed. Neither set of 3D points does a fantastic job representing the house but if we discount the stumps the model resembles reality much more closely.

With Stumps (note the points to the far left that stretch the predicted structure out):

3 views without stumps: