Structure From Motion - Kayle Gishen

Overview

Structure from motion is the process of using different viewpoints of the same object to infer its 3D structure. The general procedure is to determine key points in a sequence of observations, in this case video, use optical flow to track the features and finally decompose their 2D motion into a proposed 3D structure.

Algorithm

We begin buy running a Harris corner detector on the first frame of the video sequence. From the returned list of features, we choose the strongest 500.

The next step is to use the Kanade-Lucas-Tomasi tracker concept and compute the optical flow between the frames. Then we use the resulting flow to interpolate the movement of the tracked features.

The optical flow is computed using image convolutions with a 15x15 matrix of ones. This is done since Matlab is optimized for convolutions and not loops, so the performance difference is considerable. The inversion step of the optical flow is done within a loop of the size of the frame as it is a reasonably fast process and is simple to understand.

During the interpolation phase, features which move out of frame are removed as their tracks cannot be interpolated based on the flow to the next frame since the points are no longer in the image.

Finally using Singular Value Decomposition and Cholesky decomposition we can remove the affine ambiguity from the tracked points and recover a believable structure from the video sequence. The method for computing this is from A Sequential Factorization Method for Recovering Shape and Motion from Image Streams, Morita and Kanade, 1997.

Results

Predicted 3D Points

Predicted Camera Movement

X Movement

Y Movement

Z Movement

Reconstructed Mesh w/ Texture

Conclusions

The combination of a Harris corner detector and a KLT track provide well defined data for structure from motion in the simple case used here. We did make the assumption that the camera is orthographic which simplified the algorithm by not having to deal with perspective transforms. In the final mesh, there are points which shouldn't be there, but as there was no human interaction interface to remove these points from the beginning, the mesh reconstruction assumes they are part of the object. The overall structure can be seen, however, some of the points appear to be miss aligned. This can be reduced by using iterative interpolation to better 'guess' the movement of the tracked features.

One limitation of structure from motion is that it is incapable of recovering the detailed structure of on object present only in depth and shading cues in the image frames.