To reconstruct the 3D shape of an object from a series of observations (a video). There are three essential components to this: keypoint selection, feature tracking, and structure from motion.
To characteristics of good features are:
Such properties are fulfilled by "corners" of the pictures. If we were to look at the picture with a small window, we should be able to detect a corner by observing a large magnitude of change in intensity while shifting a window in any direction. The Harris corner detector is dependent on this idea, and we use the Harris corner detector to select keypoints (which are corners).
The following is a picture of a house which indicates all of the corners detected by the Harris corner detector.
To track the movement of the detected keypoints, we implement Kanade-Lucas-Tomasi (KLT) tracker. This essentially involves computing optical flow between successive video frames and moving the selected keypoints from the first frame along the flow field.
With the KLT tracker, we make the following three assumptions:
The first assumption the KLT tracker makes is brightness constancy. A point should have the same value after translation in the next frame (where I is the image function):
Take the Taylor expansion of I(x + u, y + v, t + 1), where Ix/Iy is the x/y-derivative of the image and It is the temporal gradient:
Therefore:
And by the first equation:
This is only one constraint when we have two unknowns (u and v). We get more by assuming that nearby pixels at points pi, i ∈ [1, 225] (in a 15×15 box around the pixel) move with the same u and v:
We can solve this overconstrained linear system via least-squares (abbreviating the above to Ad = b):
This results in two equations with two unknowns for each pixel. We can solve for u and v by inverting the 2x2 matrix on the left-hand side and multiplying it by the vector on the right-hand side.
The following is a visualization of the optical flow for all the frames of the house video:
Once we have the per-pixel optical flow, we can easily track a point through the flow by looking up its displacement in the frame and adding that to its position to get the position for the next frame. We used interp2 to do the lookup.
We also discarded any keypoints if their predicted location moves out-of-frame (or anywhere near the borders, to be safe). The following picture highlights the discarded keypoints.
Now, we use the discovered keypoint tracks as input for the affine structure from motion procedure described in Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992). The overview of the algorithm is:
We can construct 3D images from videos by the 3 steps: keypoint selection, feature tracking, and structure from motion algorithm.