Structure From Motion

Overview

The point of this project was to recover 3d structure from a series of frames (video). There are three main components of doing this: selecting keypoints to track, tracking these points across frames, and structure from motion.

Keypoint Selection

This portion of the project was given to us in the form of a Harris corners detector. I took the 500 best interest points from the detector for use. These interest points will then be tracked across frames in order to get multiple, corresponding x,y coordinates. These x, y coordinates will then be used in the structure from motion section to generate a 3d reconstruction.

Tracking

This portion of the assignment consisted of tracking these keypoints across frames. We did this by utilizing the Kanade-Lucas-Tomasi Tracker. What this means is computing the optical flow for each frame and then using this optical flow to shift the keypoints. The tracker has the following assumptions: brightness constancy, small motion, and spatial coherence. The optical flow is computed utilizing the x-gradient and y-gradient of a frame and temporal gradient of sequential frames. The interest points that we found in the previous section are useful here because cornery places in the images are the places that will work best when utilizing gradients to compute optical flow.

Following is a visualization of one frame of optical flow:
Any points that moved outside of the frame during the course of the shifting were discarded. These points are visualized in the following image:

The following is a visualization of 20 random keypoints' movements superimposed over the first frame:

Structure from Motion

This section is concerned with generating the 3d coordinates of the 2d tracked points. What this means is recovering the camera parameters and 3d points from the observed 2d points. We do this by creating a measurement matrix composed of the observed 2d points. This measurement matrix is then factorized. From the results from the factorization, we can arrive at a possible decomposition to get the camera parameters and 3d points. This gives a possible 3d reconstruction, but off by an affine transformation. This all follows the method layed out by Tomasi, Kanade, and Morita. To attempt to eliminate this affine ambiguity, we enforced the constraint that the image axes must be perpendicular. This was done using the method described by Morita and Kanade (1997).

Following are three different views of the generated point cloud with an associated surface map (including camera directions indicated by red lines):

Following are three plots, representing the three dimensions of the estimated camera position through the frames (x, y, and z versus frames).

Affine Ambiguity Details

1. No metric constraints: This led to an extremely flattened 3d reconstruction of the hotel.

2. Metric constraints as explained in Morita and Kanade (1997) 859-860: I utilized the system of equations described in the paper (G, l, and c). The Morita and Kanade paper gives a full explanation of the algorithm. Eliminating this affine ambiguity vastly improved my results (pictured above).

Conclusion

Overall, this method functioned well. Most of the keypoints were well tracked throughout the frames (except for a number of points in the less-textured portion of the images). Further, the structure from motion generated a reasonable point cloud and surface map for the sequence of frames.