CS143 Introduction to Computer Visions

Project 5: Tracking and Structure from Motion (by Margaret Kim, mk20)

Objective

To reconstruct the 3D shape of an object from a series of observations (a video). There are three essential components to this: keypoint selection, feature tracking, and structure from motion.

Keypoint Selection

To characteristics of good features are:

  1. Repeatability -- The same feature can be found in several images despite geometric and photometric transformations
  2. Saliency -- Each feature is distinctive
  3. Compactness and Efficiency -- Many fewer features than image pixels
  4. Locality -- A feature occupies a relatively small area of the image; robust to clutter and occlusion

Such properties are fulfilled by "corners" of the pictures. If we were to look at the picture with a small window, we should be able to detect a corner by observing a large magnitude of change in intensity while shifting a window in any direction. The Harris corner detector is dependent on this idea, and we use the Harris corner detector to select keypoints (which are corners).

The following is a picture of a house which indicates all of the corners detected by the Harris corner detector.

Keypoint Selection

Feature Tracking

To track the movement of the detected keypoints, we implement Kanade-Lucas-Tomasi (KLT) tracker. This essentially involves computing optical flow between successive video frames and moving the selected keypoints from the first frame along the flow field.

With the KLT tracker, we make the following three assumptions:

  1. Brightness constancy -- projection of the same point looks the same in every frame
  2. Small motion -- points do not move very far
  3. Spatial coherence -- points move like their neighbors

The first assumption the KLT tracker makes is brightness constancy. A point should have the same value after translation in the next frame (where I is the image function):

Equation 0

Take the Taylor expansion of I(x + u, y + v, t + 1), where Ix/Iy is the x/y-derivative of the image and It is the temporal gradient:

Equation 1

Therefore:

Equation 2

And by the first equation:

Equation 3

This is only one constraint when we have two unknowns (u and v). We get more by assuming that nearby pixels at points pi, i ∈ [1, 225] (in a 15×15 box around the pixel) move with the same u and v:

Equation 4

We can solve this overconstrained linear system via least-squares (abbreviating the above to Ad = b):

Equation 5

This results in two equations with two unknowns for each pixel. We can solve for u and v by inverting the 2x2 matrix on the left-hand side and multiplying it by the vector on the right-hand side.

The following is a visualization of the optical flow for all the frames of the house video:

Optical Flow

Once we have the per-pixel optical flow, we can easily track a point through the flow by looking up its displacement in the frame and adding that to its position to get the position for the next frame. We used interp2 to do the lookup.

Feature Tracking Feature Tracking 20 pts Feature Tracking 20 pts, last frame

We also discarded any keypoints if their predicted location moves out-of-frame (or anywhere near the borders, to be safe). The following picture highlights the discarded keypoints.

Discraded Keypoints

Structure from Motion

Now, we use the discovered keypoint tracks as input for the affine structure from motion procedure described in Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992). The overview of the algorithm is:

  1. Given: m images and n tracked features xij
  2. For each image i, center the feature coordinates
  3. Construct a 2m by n measurement matrix, D
    1. Column j contains the projection point of point j in all views
    2. Row i contains one coordinate of the projections of all the n points in image i
  4. Factorize D:
    1. Compute SVD: D = U W VT
    2. Create U3 by taking the first 3 columns of U
    3. Create V3 by taking the first 3 columns of V
    4. Create W3 by taking the upper left 3 by 3 block of W
  5. Create the motion (affine) and shape (3D) matrices: A = U3*W3½ and X = W3½ V3T
  6. Eliminate affine ambiguity
To eliminate the affine ambiguity (i.e. apply the metric constraints) by discovering QQT (eq. 16 of Tomasi and Kanade), let L = QQT and solve for L using least squares. Finally, compute the Cholesky decomposition of L = QQT to recover the appropriate Q. A more detailed explanation of this "metric transformation" can be found in Section 2.3 of A Sequential Factorization Method for Recovering Shape and Motion from Image Streams, Morita and Kanade, 1997.

Here are the results of reconstructing 3D structure using the above algorithm.

3D features 3D features 3D features

The red lines represent the 3D path of the camera in the following three images: XY view, YZ view, and XZ view.

Camera XY Camera YZ Camera XZ

In Conclusion

We can construct 3D images from videos by the 3 steps: keypoint selection, feature tracking, and structure from motion algorithm.