In this project, we take a series of related images (the frames of a video) and attempt to track feature points across each image. Using the tracked feature points and the movement of the camera, we attempt to extract the 3D structure of the object which is being viewed.
Algorithm
Step 1: Select keypoints to track through the series of images. In order to find correspondences between the images, we want to select points whose motion can be estimated reliably, such as corner points. For this reason, I used the Harris Corner Detector to find the keypoints. Below are the 500 points in the first frame for which the Harris Corner Detector was most confident (that they were corners).
     
    
Step 2: Track the keypoints across the frames. In order to perform feature tracking, I implemented the Kanada-Lucas-Tomasi tracker. This tracker first computed the optical flow of the image. Optical flow is the movement of each individual pixel from one image to another. Using the optical flow, the tracker could interpolate the new position of each keypoint, and by doing so could track the keypoints through the frames.
In order to compute optical flow, the KLT tracker assumes that each pixel will be of the same brightness in both images, the movements are small, and that, in general, points move like their neighbors. It then solves a system of equations to get the vertical and horizontal movement of each pixel. See the Feature Tracking section on the assignment page for the equations. The tracker builds an optical flow matrix for both the vertical and horizontal movements of each pixel. It then uses the current position of each keypoint to interpolate that points vertical and horizontal movement.
When the camera is moving, some of the keypoints found in the first frame will disappear off the edge in subsequent frames. Below are the 473 points from the first frame that remain visible throughout the sequence.
     
    
Below are the same 473 points, having been tracked across the sequence to the last frame. Note that while the points on the building appear to have been tracked accurately, the points on the ground in the first frame (lower left corner of the image) have not been tracked particularly well. Most of them should have disappeared off the edge.
     
    
Below are the 27 keypoints from the first frame that disappear off the edge of the image at some point during the sequence of frames.
     
    
Below are the tracked paths across the frames of 20 random points. In the first frame, the point is at the beginning of its path. In the last frame, the point is at the end of its path.
     
    
     
    
As seen in the images, most of the points remain in the same relative location on the structure. The motion of the camera (slightly moving down and to the left while rotating so that the building goes from being at a tilt to standing vertically) causes the points in the center of the image on the building to make small movements, while the points nearer to the edges of the building make larger movements. Most movements appear to be some sort of circular arc. The exception is the point that was on some sort of ground corner in the first frame (see the lower left of the image) and is improperly tracked. This point moves erratically, in a different direction from the others (it moves counterclockwise, while the others move clockwise). It is probable that this occured because the movement of the background was much greater than the movement of the building.
Step 3: Using the tracked keypoints, the structure of the object in the image can be estimated. This is done using the equation D = AX, where D is the matrix of the observations (the x and y coordinates of each tracked feature point across all frames), A is the matrix of cameras (the matrix of the camera properties for each frame), and X is the matrix of the true (x, y, z) coordinates of each tracked feature point. D was created from the observations in Step 2, and A and X can be extracted from D.
The singular value decomposition (SVD) of D is taken, so D = U * W * Vt (Vt being the transpose of V). Because A and X are both rank 3 matrices, D is also a rank 3 matrix. This means that, if the tracking were done perfectly, the first 3 columns of U (U3), the first 3 x 3 block of W (W3), and the first 3 columns of V (V3, with V3t being the first 3 rows of Vt) can be used to reform D (D = U3 * W3 * V3t). Because D is an approximation, this will actually be an estimate.
Using these matrices, A and X can be approximated. A = U3 * (W3 ^ 0.5) and X = (W3 ^ 0.5) * V3t. At this point, affine ambiguity must be eliminated. See section 3.4 of Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992) for an overview of this algorithm, and section 2.3 of A Sequential Factorization Method for Recovering Shape and Motion from Image Streams (Morita and Kanade, 1997) for the steps to eliminate affine ambiguity.
Below are 3D plots of the structure and path of the camera as estimated by the tracked keypoints. In these images, the structure and motion of the camera have been mirrored. The camera moves down and to the left, but in these plots, it appears to move down and to the right. Likewise, the left part of the structure appears on the right, and vice versa. Note that the camera has no concept of where true "up" is.
     
    
Below are plots of the structure from above and below. You can see that the corner of the building juts out from the image. Also, note that the depth of the background could not be accurately determined. The camera lines appear to radiate out from the image. This is because we are using an orthographic camera, which can determine the camera's direction from the scene, but not its distance.
     
    
     
    
Below is a plot of the scene from below the camera's motion.
     
    
Below is a plot of the scene from the side. Note the poorly approximated background.
    