Project 5: Tracking and Structure from Motion

CS 143: Introduction to Computer Vision

Li Sun(li)

Overview

In this project, we reconstruct the 3D shape of an object from a series of observations (a video).

The basic flow of this project is as follows:

Detect keypoints on the first image
Track the keypoints for motion information
Reconstruct 3D objects from motion

Keypoints detection and selection

We detect keypoints with a Harris corner detector, and select a set of points with relatively high response.

There are two strategies for selecting keypoints. One is to select a certain number of keypoints with highest response (quota strategy), and another is to select keypoints with reponse higher than a threshold (thresholding strategy). It seems the latter one is more preferable.

In real situations, we want to select keypoints with larger response so that they can provide more information for us to recover the 3D structure. However, the number of useful points may vary due to different properties of objects, such as geometric complexity, size, etc, and thus it's hard to decide the appropriate number of keypoints we want in our algorithm. By setting a threshold, we can easily keep points with strong enough response.

The following two images show keypoints selected by quota strategy and thresholding strategy.


Fig1: Keypoints selected by quota strategy (quota=500)	Fig2: Keypoints selected by thresholding strategy(threshold=0.0001)

In both figures above, the selected keypoints with weakest response have response strength around 10^-4, but we can see that the keypoints selected in Fig2 are more resonable than those in Fig1. Firstly, the small objects at the left bottom corner of the image should be either all important or all unimportant intuitively. Obviously points selected in Fig2 match better with this intuition. Secondly, more keypoints on top of the house are selected in Fig2 than Fig1, which makes reconstruction of the top easier.

Feature Tracking

In this stage, we track all selected keypoints by a Kanade-Lucas-Tomasi Tracker. This essentially involves calculating the optical flow between successive video frames and moving the selected keypoints along the flow field.

The following two images show the 2D path of 20 random selected keypoints in the first frame and the last frame. The initial position of each keypoint is labeled with a circle, and the final position is labeled with a cross, and the actual position of the keypoints in both images are in red.


Fig3: first frame	Fig4: last frame

We can see that all keypoints, except for those disappeared during the movement, are tracked quite accurately.

The following image shows keypoints which moved out of frame along the sequence. Those points are in red, and their paths before disappearing are also plotted.

Fig 5: Discarded keypoints and their paths (in red)

Structure from Motion

In this stage, we reconstruct the 3D structure from the key points we tracked in each video frame. By using the method intruduced in Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992), we get the following 3D structure and the camera path during the video sequence.

Fig 6: Reconstructed 3D structure

Even though it's not very accurate, but we can see the shape of a house in this 3D structure. Here are some other views from different angle of the 3D structure.

Fig 7: Side of the house

Fig 8: Front of the house

For the front view of the house, the right dark part are the backgrounds. The reason it's also recovered is because there are some keypoints on the background detected. We can't avoid this problem with our baseline method because the algorithm isn't able to tell the background from foreground.

One may notice that the 3D structure is reversed around(left-right-wise) from the original picture. That is just some visualization problem, and could be solved by modifying the visualization method.