Tracking and Structure from Motion
by Eli Bosworth (eboswort)

My first step was to track points across the frames of the video. I used a harris corner detector to choose a group a points to track. This returned a huge number of points. I used 400 because that seemed like plenty to work with. This is an example of what my 400 starting points looked like:

Next I used a Kanade-Lucas-Tomasi tracker to track these keypoints over the 51 frames of the video. The algorithm calculates optical flow using 3 gradients and the assumption that things have only moved small amounts. Here is a visualization of the 3 gradients, from left to right they are the x-gradient, the y-gradient and the temporal gradient:



I use some nifty linear algebra to get from these 3 gradients to the optical flow of the image, but the basic idea is that I observe the temporal change at a pixel and try to guess which direction it moved based on the gradients at that pixel. Here is an example of what the resulting optical flow looks like, separated into the x direction on the left and the y direction on the right:


At each frame I use this optical flow to move the 400 points that I originally selected. This way I can track certain points as they move throughout the movie. If any of the points I'm tracking get too close to the edge of the screen. I just throw them out, since I have plenty and it's difficult to deal with points whose positions you only know for some of the frames. Below is a picture which shows 20 points and how they were tracked across the 51 frames, as well as a picture which shows the points which will move out of the frame and be dropped at some point in the video


Now that I have the 2d positions of a group of points in all the frames of the video. I can use that information to figure out the 3D locations of those points as well as some information about the cameras. I start by subtracting the average center position in each frame from the positions of the points in that frame. Then I stack all those points into a big matrix and factorize it. From this I can figure out a camera matrix, A and a points matrix. If I turn those resulting points into a 3D shape I get a disappointing result:

This is because I haven't gotten rid of the affine ambiguity. In order to fix this I need to find a linear translation of A where the axes of the cameras are perpendicular. I apply this trasfomation to both A and the points matrix. This makes the points much more reasonable. The point cloud looks like this:



Here you can see the point cloud turned into a 3D shape with the first frame mapped onto it. The red lines show where the camera was pointing for the various frames of the video

I also experimented with one of the extra credit options. Attempting to improve KLT tracking with an image pyramid. The idea behind this is to help the KLT algorithm deal with movements that are larger than just one or two pixels. I took every third frame of the hotel video in order to create larger differences between frames and ran my normal KLT algorithm on it. Here's a video of my results without an image pyramid. I added code to run the KLT algorithm on smaller versions of the frames and incorporate those results into the final optical flow. Here are my results with a 3 level image pyramid. With the pyramid the point tracking goes a little better. If you look at the points on the chimney you can see that with the pyramid they are able to do a slightly better job of tracking. While these results are certainly not impressive, I think they show the potential of using an image pyramid to increase KLT's accuracy with big movements if the pyramid was used in a more sophisticated way.