Structure From Motion
Kyle Cackett
Building
a 3D model of an object from a sequence of video frames can be broken into
three main phases. First, a reliable set
of key-point features must be detected, then the motion of these features must
be tracked through the series of frames, finally the matrix of key-point locations
tracked throughout the series of frames can be decomposed into two matrices
representing the motion of the camera and the 3D structure of the object.
Good features to track are those whose motion
can be estimated reliably. It is
impossible to track the motion of areas of no contrast because if we examine a
window surrounding the area in two frames we will see no difference between the
frames. It is easy to track image motion
using edges as key points if the motion is perpendicular to the edge boundary,
any component of motion along the edge boundary will be impossible to detect (this
is an example of the aperture problem). Corners
make up a reliable set of features because they do not suffer from the aperture
problem. We use a Harris corner detector
to find salient image features. Harris
corner detectors work by computing Ix and Iy (the x
and y derivatives of the intensity of the image). These derivatives can be used to construct a
matrix whose eigenvalues are large whenever a corner
is present. We then plug the matrix into
a corner response function which is large if both eigenvalues
are large and of comparable magnitude.
Finally, we examine areas of the image with high corner response and
perform non-maximum suppression to distill the area down to a single
point. The 500 points with the highest
corner response (which we will track for our structure from motion algorithm)
are shown below.
To track features throughout the series of frames
we use a Kanade-Lucas-Tomasi
tracker. This feature tracker assumes
brightness constancy between frames and that points spatially near the feature
have the same motion as the feature itself.
These assumptions lead to an over constrained linear system which we
solve using least squares. The
brightness constancy assumption gives the following equation:
Where x,y is the location of the feature,
t is the time corresponding to the frame in question, and u,v
is the motion between frames. We Taylor
expand the left term to get:
Which leads to the
following under constrained system:
Assuming nearby points
undergo the same motion gives the following approximation:
To solve this using
least squares let the terms in the equation above be represented as Ax=b. Then the solution is given as:
This equation is solved
at each time step at each pixel and the u,v
are added to the key points to plot their motion throughout the video.
The following figure
shows the set of key points that moved off screen during the sequence of
frames:
The next figure shows the motion of 20 points throughout the sequence of frames:
To determine the structure of the object from the motion of the key points we use a series of matrix factorizations and multiplications described by Tomasi and Kanade. First the key points are centered by subtracting the average of all key points from a single frame from each key point. This gives a registered measurement matrix. We then use the components of the single value decomposition of this matrix to get matrices representing transformations of the camera movement and shape of the object. Finally, we apply the metric transformation to each of these matrices to find the actual movement ofth ecamera and the shape of the object. The movement of the x, y, and z components of the camera as a function of time are given in the three plots below:
The predicted 3D structure of the house is given below. Note that the corner detector used some stumps in the background as key points and because these points moved a great deal between frames the optical flow algorithm did not correctly track these points (it did not record these points moving out of frame). The first 3D view uses the key points that are stumps. The second 3 views shows the house with the stump key points removed. Neither set of 3D points does a fantastic job representing the house but if we discount the stumps the model resembles reality much more closely.
With Stumps (note the points to the far left that stretch the predicted structure out):
3 views without stumps: