Algorithms
The algorithm is separated into a few parts. Feature Tracking First, good feature points to track in the image is determined by a Harris corner detection algorithm. I select the 500 strongest feature points given by the Harris corner detector to use for tracking the features of the image. The image below shows the initial 500 key features that were selected at the first video frame to represent the structure within the image. (I used a series of images that portray a hotel building.)
After having the key features for one image, I use the Kanade-Lucas-Tomasi tracer for the keypoints I detect. I first compute the optical flow between successive video frames to determine how the image shifts, then I move the selected keypoints from the first frame along the flow field, essentially tracking the movement of the keypoints. The flow fields are calculated with the assumptions of brightness constancy and that all nearby pixels shift in the same way, which produces the following two equations.
Equation 1:
Equation 2:
Then, combining equations 1 and 2, I can solve for the flow fields u and v via least-squares:
After having the flow fields, I can track the key points across successive video frames. I use interp2 to do the lookup for how much the key points shift, for more accuracy. The image below shows the final positions of 20 randomly selected key points that I use, and the paths they traversed to get there.
Note that some of the feature points actually drifted out of the scope of the image, and are discarded. Here is an image showing all the feature points in frame 1
that moved out of scope and were discarded.
Structure from Motion After obtaining the feature points across all video frames, the 3D structure of the object in the image can be modeled with the Factorization method. Several steps is required to achieve this. First, the feature coordinates of each frame need to be centered, so that a 2N*P matrix D, the measurement matrix, can be produced. N refers to the number of frames, and P the number of features. Then we take the singular vector decomposition (matlab function svd) of D and obtain U, W, V, such that D = U W V'. We create U3 by taking the first 3 columns of U, V3, by taking first 3 columns of V, and W3 by taking the upper left 3*3 block of W. With U3, w3 and V3, we can obtain the motion (affine) and shape (3D) matrices of the image, where M (Motion matrix) = U3 * W3 ^ 0.5, and X (shape matrix) = W3 ^ 0.5 * V3'. This M and X are close to the true solutions, but they are unique only up to an affine transformation. A 3*3 nonsingular matrix A transforms the M and X into the true solutions, by the following equations:
actualM = M * A actualX = inv(A) * S
To find A, we need to observe that the rows i and j must satisfy the normalization constraints, which provides us with a system of 3 * number of frames overdetermined equations. This is described by detail in the following excerpt of the Toshihiko Morita and Takeo Kanade paper -- A Sequential Factorization Method for Recovering Shape and Motion From Image Stream . Note that in the excerpt, Mhat refers to the M that was described above. And M is the actualM that was described.
To summarize, we get A by arranging the elements of M into G as described in figure (16) in the paper, where each element is obtained by following step (17). Then we obtain the matrix c according to figure (16). With G and c, we can obtain l with the equation in figure (18). With l, we can obtain L which is described in figure (14). Since we know that L = A * A' from figure (13), we can use the Cholesky factorization to obtain the lower traingular matrix A, which is what we want. Finally, with A,we can obtain actualM and actualX with the equations described in figures 9 and 10.
Structure from Motion Results
The following images are the 3D presentation of the feature points that were tracked, from different angles:
Laying them on top of a 2D delaunay triangulation and a 3D texture-mapped mesh, it looks like this:
We can also simulate the camera positions and directions. The images below show the 3D images (at different angles) with the blue spots showing the positions of the camera across the different frames, and the red lines showing the direction of each camera. Note that the camera moved from right to left.