CS 143: Tracking and Structure from Motion

Evan Wallace

Algorithm

This project extracts a 3D model from a video sequence. It tracks certain points throughout the video and uses them to construct a 3D point cloud along with the coordinate frames of the camera at each video frame.

The video sequence used

Keypoint Selection

Features are discovered using a Harris corner detector. Corners are good features to track because they can be localized to a point. Motion is ambiguous along an edge because motion of the edge can only be observed perpendicular to the edge.

Feature Tracking

I implemented a Kanade-Lucas-Tomasi tracker which calculates optical flow, then moves the features in the initial frame along with the optical flow throughout the video sequence. The details of KLT tracking can be found on the project page. This was very fast but had the problem of drift; errors would accumulate over time and by the end of the sequence the points would be far off from where they should have been.

The optical flow visualized as (r, g, b) = (u, v, 0) + 0.5

The tracked feature points throughout the sequence

Structure from Motion

The last step is to use the feature points, whose motion we now know throughout the video sequence, to calculate the coordinate frames of the camera and the 3D locations of the points. I implemented two structure from motion methods.

The first method I implemented was Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992). It assumes an orthographic camera where all view rays are parallel (and there are no perspective effects). This means there is no camera position, only a camera direction. The details can be found on the project page.

The second method I implemented was the 8-point algorithm. This models a perspective camera, which has a 3D track through the scene. However, it is a lot more susceptible to noise and working with real-world data is difficult. Because of this, I first tested it on a synthetic video sequence generated from different views of a static point set (a cube with a few extra points to disambiguate its orientation).

The epipolar lines in both images for the test data

The MATLAB code below takes two N × 2 matrices f1 and f2 and computes the fundamental matrix F, the epipoles e1 and e2, the baseline b, and the rotation matrix R. The fundamental matrix is a 3 × 3 matrix that encodes the transformation between the two cameras. The epipoles are the locations of the other camera in each image (marked with an asterisk in the image above). The baseline is the direction of camera 2 in camera 1's coordinate frame (it is impossible to determine the magnitude of the translation). The rotation matrix R defines camera 2's coordinate frame relative to camera 1's coordinate frame.

% Create matrix A
A = [
    f1(:, 1)' .* f2(:, 1)'
    f1(:, 1)' .* f2(:, 2)'
    f1(:, 1)'
    f1(:, 2)' .* f2(:, 1)'
    f1(:, 2)' .* f2(:, 2)'
    f1(:, 2)'
    f2(:, 1)'
    f2(:, 2)'
    ones(1, n)
]';

% Solve for f
[~, ~, V] = svd(A);
f = V(:, end);
F = reshape(f, [3 3])';

% Resolve det(F) = 0 constraint by SVD
[U S V] = svd(F);
S(3, 3) = 0;
F = U * S * V';

% Get the coordinates of the epipoles
[U, ~, V] = svd(F);
e1 = U(1:2, 3) / U(3, 3);
e2 = V(1:2, 3) / V(3, 3);

% The baseline (translation direction) comes from the epipole
b = [e1; 1];
b = b / norm(b);

% R is the rotation from frame 1 to frame 2
[U, ~, V] = svd(F);
W = [0 -1 0; 1 0 0; 0 0 1];
R = -U * W * V';

The translation vector b and the rotation matrix R define the coordinate system of camera 2 relative to camera 1. This is all the information we need to triangulate the 3D positions of each point. The 3D location is the intersection of the rays through that point in the imaging plane of both cameras. If the rays don't intersect exactly the closest point between the two rays is taken.

Results

The reconstructed 3D mesh (requires WebGL to rotate)

The dark part hanging off the house is caused by points on the background that drifted during tracking. I tried a bigger search window and iterative KLT, but they didn't improve the results.

Perspective camera reconstruction is unfortunately very sensitive to noise. I was unable to track feature points through real video sequences precisely enough, so I couldn't get perspective camera reconstruction to produce satisfactory results. The perspective reconstruction worked well on the test data, however, which is shown below. The lines show the rays traveling out from the imaging plane of each camera, and where they intersect is the 3D location of one of the feature points.

The reconstructed 3D mesh (requires WebGL to rotate)