Structure From Motion

Purpose:

Given a series of input images, we attemp to derive the 3d point location of features based on correspondence throughout the image sequence.

Algorithm:

Main Idea

Our pipeline operates as follows:
1. Find good feature points to track throughout the image sequence. We use Harris corners.
2. Use optical flow to track the points throughout the images, discard points which leave the image at any point.
3. Given a series of tracked points, reconstruct 3d positions of the tracked points.

Finding Features

The features we choose to track across the frames of the image sequence are the 500 strongest results from maximum surpressed Harris Corners. Harris Corners are very good features based on the equations used to solve the tracking problem. Without going into the details of the math, we can intuitively understand why they are good for this task. They are invariant to translation and rotation, two motions we expect to see frequently. We will also see in the tracking section the fact that Harris Corners are located in the regions of the image with high orthoganal gradients makes them ideal for our tracking criteria. Here are the features we decided to track:

Kanade-Lucas-Tomassi Tracking

To track the features throughout the image sequence we use optical flow. We operate under the assumption that most observerd motion is small. Further, we assume brightness constancy meaning points will have similar intensity between successive frames. Lastly, we make the assumption that spatially local pixels move uniformly accross frames. We form the Taylor Series approximation of I(x, y, t) = I(x + u, y + v, t + 1). Then by examing the spatial and temporal in 15x15 patches, we can then solve the overconstrained equations and estimate the optical flow field. At each iteration, we then move our tracked features according to optical flow. If we look at the points tracked through images, then we notice that most points do fairly well, except those in the little blocks in the background. This is exactly what we expect, however, because those blocks violate our assumption of small movements from frame to frame.

If we examine a few points more closely and visually track their motion we get the following result:

And just because it looks interesting here is the plot of all the tracked points paths. An interesting thing to note is that most of the paths are smooth, except the points that we lose due to too large of movement. Once those points lose contact with their orriginal strong gradients, the local spatial area, provides no clues and the motion becomes jagged since optical flow is then pretty much just guessing. Although the lines for these points are not very good, it is interesting that there is still somehow enough information to get the general direction that the points did head, although to a much smaller scale.

Structure From Motion

We know that a camera projection matrix for each frame projects the same 3-space coordinates into two image space coordinates in each frame for each feature. Using the knowledge we can aggregate all of our information into a large matrix, and attempt to factorize the matrix into a projection matrix (all frames matrices together in an organized fashion) and the actual 3 space coordinates. (Note due to the factoring there is an additional matrix between them) While these matrices are not the correct dimension due to noise, we can extract the 3 first columns of each to retain the most important data. Finally, we impose the restriction that our camera's coordinate system are orthonormal, to remove the affine ambiguity. (This is a very hand wavy description of how this process works, to get a general idea of how it works. See the papers for details). Here are the results. The first set of images depict 3-space plots of the featured points. The next set of images shows a reconstructed and texuture mapped mesh, along with the camera path, and direction.

One Last Point

For some reason my house is reflected over the y-axis. Another student suggested that it is because the camera does not really have a sense of front and back, and either way would be identical observations so it cannot recover which way it is seeing it. In our constraint equations, we just force the camera constraints to be orthonormal, not necesarily in the right direction.