The first step in the structure from motion pipeline is keypoint selection. Keypoints are selected using a Harris corner detector, which scores points based on the magnitude of the eigenvalues of the second moment matrix at each point. An example set of keypoints for the first frame of a video sequence are shown below.
I did not spend any time modifying this portion of the pipeline. More information about corner detection can be found in Shi and Tomasi 1994, Good Features to Track.
The next step in the structure from motion pipeline is feature tracking. To accomplish this I implemented the Lucas-Kanade optical flow algorithm. The key assumption of this algorithm is brightness constancy--a point in one frame should have the same value after translation in the next frame. This is represented by the equation below.
Here, I(x,y,t) is the value of the image at column x, row y and timestep t. We can approximate the right side of this equations using a trucated taylor expansion.
Here, Ix is the image gradient in the x-direciton, Iy is the image gradient in the y-direction, and It is the temporal gradient. Substituting this into the first equation we can solve for u and v.
If we assume that the pixels in a the same neighborhood as the pixel at x,y have the same displacement then we have a system equations with n equations and 2 unknowns.
We can also introduce weights to the neighboring pixels, so that the closest pixels will contribute most to the calculation of u and v. Then we can can solve this equation using least squares. This final form of the equation is show below.
My implementation of this equation uses a 15x15 pixel window and a gaussian weighting function with a standard deviation of 3. Once you have calculate the matrix on the left hand side, you can solve for the optical flow at every pixel using simple matrix inversion and multiplication. To track a point through this flow you can then look up the displacement vectors at this point and translate it by the specified amount.
To see the effectiveness of this approach, below is a random selection of 20 keypoints, which have been tracked through 51 frames. I have overlayed 3 of those frames with the paths that the keypoints travel along.
In addition, below are all of the keyframes that left the scene at some point during the sequence.
While the basic feature tracking algorithm works well for most of the good features in the test video, there are some cases in which it fails. Below is the same path plot as above, but instead of looking at 20 random features, it displays the background features. It's clear that the tracking of these features fails. In most of these cases the features should end up off of the screen, but instead end up stuck in low texture regions.
The first step in improving the overall quality of the tracking is to iterate on the above algorithm. With an iterative approach, you solve for the optical flow at each pixel in the image. You then warp the image towards the next image in the sequence and you repeat. The final flow is the combination of the flows computed at each iteration. Below are the tracking results for 10 iterations of the Lucas Kanade optical flow algorithm.
And below are the features that went out of bounds using iterative Lucas Kanade.
Now it's easy to see that these features are correctly tracked, and all end up off of the screen. Another issue that arises is tracking features with large displacements. Below are the results of tracking the background features again, but this time we only use every fourth frame. Even with the iterative tracking, the algorithm fails to track these features over large jumps.
To remedy this problem, I implemented a coarse to fine optical flow algorithm. I first construct an image pyramid of two adjacent frames. I then use the iterative Lucas Kanade to solve for the optical flow at the coarsest level of the pyramid. I interpolate these values up to the next level of the pyramid and use them as my starting point for another run of the iterative Lucas Kanade. This continues until I have reached the finest resolution of the pyramid. This enables me to track features with larger displacements. The results are show below.
Once we have tracked features through the entire video sequence, we can try to determine the structure of the scene. For this purpose we use the techniques described in Tomasi and Kanada 1992, Shape and Motion from Image Streams under Orthography: a Factorization Method. To help deal with the inherent ambiguity in the reconstruction process, we also implement some of the algorithms described in Morita and Kanade 1997, A Sequential Factorization Method for Recovering Shape and Motion from Image Streams. The results from using these methods are shown below.
![]() Side View |
![]() 3/4 View |
![]() Front View |
![]() Top View |
In addition to recovering the 3d locations of the features, we also learn the relative viewpoint of the camera. Below are three plots of the relative x, y, and z components of the look vector.