In this project, we are trying to take a sequence of images taken form a video made with an orthographic camera, tracking certain points, to learn the structure of objects in the scene just by the estimated position shifts from a specific set of "keypoints" in the images. The movement of these keypoints are tracked according to the model and algorithm proposed by Kanade-Lucas-Tomasi, using Optical Flow. The big three assumptions of the model are (1) brightness constancy (ie. points in space roughly keep the same intensities over time), (2) small frame-to-frame motion, and (3) spatial coherence (close pixels move similarly). Doing some math, brightness constancy gives the following equation for each tracked point in the older image in a frame pair:
, where u and v respectively represent the horizontal and vertical optical flows of a pixel, I_x and I_y are the horizontal and vertical gradients, and I_t is the element-wise difference of the new frame minus the old frame. The sums are taking all the pixels in a patch surrounding the image. All that's left is solving for u and v. Once we have the per-pixel optical flow, we track a point through the flow by looking up its displacement in the frame using some type of interpolation (linear works well given the assumption of spatial coherence), and adding that to its position to get the position for the next frame. We also have to get rid all keypoints that move out of the image frame.
Once we've determined the locations of the keypoints kept throughout the whole stream, we can proceed to obtaining the 3d points of the scene, and thus the structure of the whole scene. I used the paper Shape and Motion from Image Streams under Orthography: a Factorization Method, (specifically section 3) for reference. The paper was very helpful overall, although I had to expand the metric constraints given in equation 16 in section 3.3 to match up with the constraints used for computing the lower triangluar matrices forming the 3x3 noisy-to-true transformations for the rotation and point location matrices.
Basic Results
We used a synthetic image sequence of a hotel for testing the Here are the keypoints taken from the top 500 most "corner-y" points returned by the Harris corner detector (left: inital keypoints, right: initial keypoints dropped because they exited the image, bottom: trajectories of 20 keypoints painted onto the first image):
All of these points seem to be reasonable, mostly sticking to the corners of the house parts, and a few points on those bumps to the left (more on those later). And these are some views of the 3d reconstructions of the house:
Overall, they are reasonably accurate, capturing the straight lines and 90-degree angles that you would expect from almost any man-made building. In addition, the keypoints seem to be giving a rotational change, which visually matchs the frame sequence (specifically it is counter-clockwise and upwards). You might notice though that one small problem is that the reconstruction includes the black space from the bumps to the side, which aren't very useful, and in fact happend to be noisy drifting points (see drift correction section for how I fixed this issue).
Drift Correction
It's hard to show it, but some of the tracked points (specifically, the bumps towards the bottom left), loose their footing and don't stay on the same spot, potentially weakening our structure from motion system. One quick and easy way to avoid this is by comparing the current surrounding patch for each tracked point to the patch surrounding where it was originally. Ideally, we would want something more invariant to rotation (like SIFT) or translation, but I just used an L1 norm on the difference pixel-by-pixel and it worked out pretty well in this case. One potential cost is that we might potentially lose some good points, but a small number (as below), shouldn't be a problem.
Below: (left) final points without drift fixing, (right) final points with drift fixing.
And here are some reconstruction shots with points removed---the overall shape (angles and lines) are maintained. Although the world appears to be flipped upside down, this actually isn't a problem because everything is the same relatively, which is what the algorithm gives. As a result, we have a much more trim version of our 3d house, avoiding the unnecessary area to the left of the house given by the bump points:
Iterative Refinement
One posssible way to improve tracking, (at a given frame), is to update the values for our displacements (u and v) several times until their difference between iterations is small. It doesn't seem to have made too big of a difference, probably because it was already doing a pretty good job with this synthetic hotel image sequence. One positive thing to note is that it seemed to get rid of the contribution from some of the aformentioned drifting points and maintains the overall structure of the house (angles, etc.).