Project 5: Tracking and Structure from Motion

Brian Thomas

Baseline Version

The baseline version uses a KLT (Kanade-Lucas-Tomasi) tracker for keypoints found by a Harris corner detector. Then, structure from motion is determined using Morita and Kanade's 1997 paper ("A Sequential Factorization Method for Recovering Shape and Motion from Image Streams", Section 2.3).

Keypoints in first frame

2D path over the Frame Sequence

The lines show the path. The green circle indicates the final state. The final image is displayed in the background.

Points that moved out of frame

A conservative point removal system was put into effect. If the point got close enough to the edge that the optical flow's summation was not fully covered in the image, then the point was removed. (Here, this means we had 7.5-pixel margin around the image border which acted as a pixel "kill zone".)

Predicted 3D locations

Somehow, the output is flipped from the image! Otherwise, the result appears reasonable.

It is easier to see the reconstruction by just looking at a scatter plot. Those images, corresponding in position and orientation, to mesh versions above, are shown below.

Note how the bottom is square!

(It's a bit more difficult to see, but still visible, on the mesh.)

Predicted 3D path of the cameras

The k_f were first normalized to be unit vectors.

Discussion

The algorithm performed well and gave a reasonable reconstruction of the image in the video. Unfortunately, since we do not have ground truth on the reconstruction, we can at best say that the reconstruction looks subjectively "pretty good" (aside from being flipped).

Pyramidal Iterative Refinement

For graduate credit, iterative refinement (Lucas-Kanade) was implemented, and an image pyramid was integrated. (Thus, the pyramid uses LK iterative refinement on each step to obtain global optical flow.) First, to demonstrate the iterative refinement, the original hotel scene was re-run:

2D path over the Frame Sequence

Predicted 3D locations

The reconstruction on this set was fantastic. The walls were smooth, except where there was not supposed to be smoothness, eg the windows poking out of the roof. And, out of all the reconstuctions performed, this is the only one that correctly eliminated the background floor. (No dot on the floor actually remains through the entire sequence, so all the points should be eliminated in the current algorithm.)

Next, to show that the pyramidal structure worked correctly, large movement between images was simulated by giving only every 5th frame of the hotel sequence to the algorithm:

2D path over the Frame Sequence

Predicted 3D locations

Weirdly, the z values were flipped at the end! So I flipped them back and took these pictures. Like the original, the result is also mirrored. In the end, the results still look better (smoother walls, for instance) with this tracker than the original, even with significantly less data!

Of course, it's not quite as good as all the data. For instance, this one still thinks some of the floor is in all the frames, as evidenced by the final picture above.