CS143
Computer Vision
Project 5 - Tracking and Structure from Motion


Scott Newman (senewman)
December 13, 2011

Project Overview

The goal of this project was to deduce the 3D shape (structure) of an object from a series of observations, such as a video. This is a useful and important task in computer vision which has many wide reaching applications, including computer-guided surgical training, virtual tourism, urban planning, and criminology, to name a few. In this particular project, the main goal was to deduce structure from a series of images of a rotating building.

Algorithm

The algorithm to detect structure from motion consists of roughly two distinct steps - the tracking stage and the structure-from-motion stage.

In the first stage, various keypoints of object are detected in the video's first frame and then "tracked" as the animation continues (i.e., the building rotates). Keypoints are points that are easy to track and lend to the structure of the object in the image, as described in Shi-Tomasi 1994. So, the first stage consists of two steps:
  1. Keypoint selection
  2. Feature tracking
Corners in the image make for good keypoints, and a Harris corner detector was used to find corner keypoints. Here is the first frame of the image series, overlayed with 500 keypoints found using the Harris corner detector:


These keypoints are then tracked as the building rotates upwards.

A Kanade-Lucas-Tomasi tracker was used to track these interest points. At each frame, what is known as the optical flow is computed, and then the original corner points are moved along the flow lines. In short, the optical flow is a gradient vector field which represents the apparent motion of the objects in the scene. The KLT tracker computes this by making a few assumptions about the motion of the objects in the image series, such as brightness constancy and small relative motion.

Brightness constancy means that a pixel that moved should have the same brightness as it did in the previous frame. Restated mathematically,


The goal, then, is to find u and v such that this equation holds true. u and v represent the motion of a point in the x-direction and y-direction, respectively. In reality, this "I" function is not closed-form and probably is hard to compute, so we try to find it using a linear approximation by taking the Taylor approximation of I(x+u,y+v,t+1), giving us one equation in two unknowns (u and v):


How do we solve this equation? We make another assumption - namely, that pixels in a small relative neighborhood around the point we're trying to track move with the same u and v (this is the "small motion" assumption mentioned above). If an object is not moving fast or in odd patterns, then this assumption makes sense - broadly speaking, that small "chunks" of an image move together similarly. If we choose a 15x15 neighborhood around the tracked point, we get an overconstrained system of linear equations we can solve using a least-squares method. 20 random initial keypoints were chosen and then their tracked motion was plotted on the initial image and the final image (the green star is the starting position). You can see how these tracked points moved from the first frame to the last frame and that the tracked corners in the last frame correspond to the same corners in the first frame:


One problem that arises is that tracked keypoints might move out of the frame (such as corner points on the bottom right of the building) that rotate far downward. These are dropped once they leave the frame. The following is an image of all the initial keypoints that moved out of frame along the image sequence at some point (the green star is the starting position):


When computing the optical flow and moving the initial corner points along the flow field, interpolation must be used, or an error occurs in which point lookup occurs with a non-integer value. Interpolation gives more accurate results than mere rounding, which will cause drift.

After the corner keypoints have been selected and tracked through the series of images, structure from motion can be deduced. The paper "A Sequential Factorization Method for Recovering Shape and Motion from Image Streams" describes the algorithm in detail. The basic idea behind the structure-from-motion algorithm is to factor the tracked points matrix W using a singular-value decomposition, give an optimal rank-three approximation of W, and use least-squares to eliminate affine ambiguity and recover W's true factors, the shape and motion matricies which gives the shape of the building and the camera motion. This is exactly what we want. For more details, see the paper mentioned directly above.

Implementation and Results

My implementation of the previously described algorithm is straightforward, though I did perform one particular optimization that helped make computing optical flow very speedy. We use least squares to derive the following equation:



We must solve this equation for each pixel to find optical flow (since each pixel has a different neighborhood). One way to do this is to use for loops to loop through all the pixels in the image, yet this is slow in Matlab. My optimization was to use Matlab's vectorized arithmetic operations (such as ".*") to compute the products Ix*Ix, Ix*Iy, etc. in order to solve this equation for all the pixels at once, instead of one at a time. If we multiply the right matrix with the inverse of the leftmost matrix, we get the [u, v] matrix we want:


Since Matlab doesn't support a vectorized inverse operation (or inverse on block matricies), I had to manually calculate the invese by hand. This is simple - there is a formula for it:


So, I just used Matlab's ".*" and "./" operations to do this for each pixel in the image. Matlab's built in matrix operations are MUCH faster than looping, so this made computing the optical flow very fast - it takes less than a second for each frame.

Besides the flow optimization, the rest of my implementation just directly follows the papers and doesn't do anything particularly special. I noted that the initial keypoint selection detected some very odd corner points to the left of the building that seemed to just float in space, and this gave the resulting structure of the building a weird pointing jutting corner that wasn't actually part of the building at all. Since these points were all to the left of the building, I corrected for this by merely pruning those points before tracking them over the image sequence.

Neighborhood size slightly influenced optical flow. If it's too low, you don't have enough local information. if it's too high, your assumption of small motion becomes invalid, since the entire image doesn't move exactly the same (just small neighborhoods). A neighborhood size of 30x30 gave good results.

Here is my resulting wireframe structure of the building, from three different viewpoints (calculated using 500 initial keypoints and a neighborhood size of 30x30 for calculating flow):



Note the right angle of the building's adjoining walls. The red lines represent the camera motion. Here are graphs of the normalized camera in each direction (x, y, or z) for each frame:



Thanks for reading!