Structure from Motion

CSCI 1430 Project 5

Structure From Motion using the Kanade-Lucas-Tomasi Method
Reese Kuppig (rkuppig)

The objective of this project was to create an algorithm to extract the shape of objects in a scene from a series of still images (essentially the frames of a video). Shape is extracted from the still images by tracking the motion (through optical flow) of Harris feature key-points, which are then processed by the Kanade-Lucas-Tomasi method to generate a hypothesized camera motion and finally to generate a texture-mapped 3D representation of the scene.

Algorithm Pipeline:

The structure-from-motion algorithm implemented for this project breaks down into several key steps:

1. Isolate a set of the strongest Harris features in an image.
2. For each frame in the video, calculate the optical flow over X, Y, and T.
3. Use the calculated optical flow for each frame to predict and track the motion of each keypoint, removing points that wander out of frame.
4. Use the Lucas-Kanade structure-from-motion method to process the keypoint locations through the video frames.
5. Remove the affine ambiguity from the predicted 3D keypoint locations using the process described by Tomasi.
6. Render a textured 3D mesh from the hypothesized 3D keypoint locations.

Harris Feature Keypoints:

Harris analysis of image features yields a set of keypoints centered over distinct features in the given image, features which should be easy to recognize and track from frame to frame. Generally speaking, the detected features are patches of distinct texture, or more commonly distinct corners in the image, which is why this process is also termed Harris corner detection. Areas of consistent intensity cannot be tracked because no direction of motion can be determined, and edges produce the "aperture problem" illustrated by the barber pole illusion. Corners are optimal features to track, since they possess a gradient in both possible directions of motion, and therefore 2D motion can effectively be extracted.

Features (500-count) detected in the first frame of the sequence:

Feature Tracking Using the KLT Method:

The KLT method makes several key assumptions that are critical to the success of the algorithm:

1. Brightness constancy: the projection of the same point looks the same in every frame
2. Small motion: points do not move very far between frames
3. Spatial coherence: points move like their neighbors

The brightness constancy constraint ensures that the same patch or feature can be detected in both frames, since it would be exceedingly difficult to track features that substantially change appearance between frames. The small motion constraint ensures that features remain close enough between frames that they can be corresponded across frames and tracked consistently. The spatial coherence constraint allows the algorithm to evaluate an overconstrained system of equations by least-squares to determine the u,v (x,y) optical flow from frame to frame. Since the pixel in question and all of its neighbors are assumed to move in the same direction, they each contribute an equation to help determine u,v. The optical flow is computed between each pair of consecutive frames, and the predicted position of the Harris keypoints is updated at each frame.

Finally, to simplify calculations we have assumed an orthographic camera, so there remains an affine ambiguity to the shape and motion calculations. The Tomasi method eliminates this ambiguity, allowing the next step, where 3D structure is plotted.

Predicted locations of features in last frame of the sequence:

It is possible to utilize incomplete tracking data for tracking, in the case that a feature move out of frame or is occluded by another object, but this project did not implement this feature. As such, features that moved out of frame were removed from the feature set. The starting positions of these features are highlighted in the image below:

The tracked motion of 20 randomly selected features, from the first frame (position represented by circles):

to the last frame (ending positions marked by X's):

Structure from Motion Results:

With around 500 key points, structure can be reasonably extracted from the scene motion. As you can see, the right angle of the building corner is accurately represented, as well as some of the window and roof detail. The red lines represent camera direction vectors, tracing the camera path through the frames. The dark gray areas have unusual shape since a few stray key-points existed in those areas and accurate motion could not be recognized for those points due to the lack of texture.

Camera direction decomposed into x,y,z components over the image sequence: