Project 5: Tracking and Structure from Motion

Lu Zeng (lz7) - Fall 2011

Dance involves both structure AND motion.

This project begins with a video -- a series of images. By tracking the movement of trackable feature in the video, we can obtain the three-dimensional structure of the object portrayed, in this case, the hotel, as well as the camera direction.

First, we select the features.

A good feature is distinctive: if a point is a feature, moving in either the x or y direction results in a large change in intensity; intuitively, this is a corner. Points near edges exhibit great change in one direction, but no change in the other.

Mathematically, this means that the autocorrelation matrix has two large positive eigenvalues. A flat, featureless space has small eigenvalues, and an edge would have one very small and one very large eigenvalue. We could find the two eigenvalues for the 2x2 matrix by solving the quadratic equation, but there is an alternate Harris measure (call it 'r'), which is simply the ratio between the determinant and trace.

One conceivable advantage to solving for the eigenvalues is to be able to choose points on the smaller eigenvalue instead of on largest 'r.' A modification to harris.m to return the smaller eigenvalue of the autocorrelation matrix at every point resulted in the points chosen on right. The original harris.m resulted in the points on the left.

Points in the set differences are highlighted in green or blue. There isn't too big of a difference, but threshholding on the smaller eigenvalue gets all the small round indents in the lower left.

Tracking

I ended up using the features found by choosing the 500 points with greatest Harris measure. Now we track their movement across the frams with the Kanade-Shi-Tomasi algorithm.

This algorithm makes three key assumptions.

A point is the same brightness in all frames.
Points make small movements.
Points move like their neighbors.

The derivation is shown on the assignment page, but a sketch is as follows. Let (u, v) be how much the pixel at (x, y) moves, from time t to t+1.

(1) gives that I(x, y, t) = I(x + u, y + v, t+1), since brightness is constant. Taylor expansion results in one equation, two unknowns, (u,v). In words, we end up saying that the intensity of point p at time t is equal to the opposite of the image gradient at that point, linearly combined and weighted by the amount of movement in each direction.
To add more constraints, we use (3), and replicate the equation for a small window of points around the initial point of interest.
Use the usual method of least squares, and we're back to a system of two equations, two unknowns. Invert and solve.
With (u, v) in hand for each (x', y') at time t, set (x, y) at time t+1 to be (x' + u, y' + v).

Of the 500 keypoints we began with, 473 persist. The ones that have fallen off are plotted here on the first frame in blue, along with the ordinary points that do not fall off, in red. In the other image is the paths of twenty random keypoints and their paths.

Structure from Motion

For this project, we assume that the camera's distance to the scene does not change, and that changes in depth of points in the scene is small compared to the distance to the camera. We take many sets of coordinates in 2-d and try to estimate the points they came from in 3-D and the affine parameters of the camera. The points' locations through all the frames make up the measurement matrix, M, which is a product of the camera directions, A, and the 3-d locations, X.

We center the measurements by subtracting the average at every point. This means we don't have to consider the camera's translation from frame to frame.

With this centered matrix, Tomasi and Kanade prove in their 1992 paper that the measurement matrix has rank at most 3. Using Singular Value Decomposition, we can factor D into three matrices, the middle of which has eigenvalues on the diagonal in descending order. If our data had no noise, the first three eigenvalues would be the only non-zero eigenvalues. With noise, we simply take the first three columns of the left matrix, first 3x3 principal of the center matrix, and the first three rows of the right matrix.

The product of these three matrices is now rank 3, at maximum. To split it into two matrices (our proto-, undisambiguated-AX), we multiply the 3 columns from the left matrix by the square root of the 3x3 principal, and the square root of the 3x3 principle by the 3 rows from the right matrix.

This AX is ambiguous: if C is any invertible 3x3 matrix, we can say I = C*inv(C); AX = AIX = AC*inv(C)X = AC * inv(C)X. We have to find the C that satisfies certain metric constraints -- C = QQ' is symmetric, meaning that we have only 6 unknowns to find with the least squares method, as before. Having found these six elements, we can construct the whole C, then use Cholesky decomposition to find Q. The results, sans affine ambiguity, are AQ and inv(Q)*X.

Here is the mesh of the structure, from several vantagepoints, with the texture from the original image mapped. The center one looks most like the orignal hotel: the windows are nearly in the right places.

One thing I did that requires noting is that I removed (manually) the points that began on the circular dents in the lower left quadrant. This resulted in a mesh that looked a lot more like a house.

Below are the camera orientations for each frame, plotted from three orientations.