Project 5: Tracking and Structure from Motion

CS 143: Introduction to Computer Vision

Hang Su


Dec. 9th, 2011











Figure1 4 frames from image sequence


Figure2 Reconstructed 3d structure (mirror image, correct solution see README)

0 Overview

The mission in this project is to reliably track the key points in an image sequence and recover the corresponding 3D structure. The whole pipeline is consist of three major components: keypoint selection, feature tracking, and structure from motion.

1 Keypoint Selection

Keypoints are selected using Harris corner detection. For the KLT tracker used in this project, Harris corners are optimal because both of them rely on the Hessian matrix of local gradient has two big eigenvalues, which suggests significant appearance changes in perpendicular directions. I choose first 500 Harris corners as keypoints, Figure3 shows them overlaid over the first frame. Other choices of keypoints are tried and compared in section 4.

Figure3 500 Harris corners overlaid over the first frame (Sizes indicate the strength)

2 Feature Tracking

I implemented a Kanade-Lucas-Tomsi(KLT) tracker for the keypoints detected in last step. KLT tracker uses three assumptions to make point tracking a solvable problem. Using KLT tracker to find corresponding pixels between successive frames, we can get optical flow for each frame. (Figure4 Figure5)

Figure4 Optical flow visualization

Figure5 Motion magnitudes

Finding corresponding keypoints between successive frames we get 2D path of each keypoints through the whole sequence. (Figure6) Some keypoints move out of frame at some point along the sequence. (Figure7)

Figure6 2D path of 20 random keypoints

Figure7 Keypoints move out of frame in the sequence

Though not apparent in this sequence, vast motion cannot be tracked reliably using only KLT tracker. I implemented iterative refinement method and pyramid KLT tracker to solve this problem. Performance comparisons are shown in section 4.

3 Structure from Motion

The tracked keypoints provide stereo information with which we can reconstruct the scene’s 3D model. Here we use the method described in [1] & [2].

Figure8 Estimated 3d points (upper row) and fitted model (lower row) in three views. Red lines in lower figures show the estimated camera direction for each frame.

We don't have a camera position because we're using an orthographic camera model, but we do have camera direction. Camera directions are plotted as a line in Figure8. Figure9 shows the normalized camera direction in each dimension.

Figure9 Camera position estimation

4 Bells and Whistles

4.1 Other keypoint selection methods

[3] claims restricting smaller eigenvalue to be bigger than certain threshold can extract more reliable keypoints than the restriction used in Harris corner detection when these keypoints are used for tracking. As shown in Figure10, the detection results of Harris corner and Good features to track are quite similar, in sense of both keypoint locations and their strength, so are the reconstruction results.

Figure10 Comparison between Harris corner and Good features to track
First figure shows the location of harris corners and good features to track; latter two show their strength.

Other types of features such SIFT are not appropriate for this task. (Figure11)

Figure11 SIFT features

4.2 Iterative refinement

Original KLT tracker uses only first-order Taylor approximation. Iterative refinement can be used to eliminate error introduced through this. The algorithm is as below: (taken from lecture slide)

I use the termination condition that the average change magnitude is larger than 90% of last round. In Figure12 I show some comparison between original results and results using iterative refinement. It’s apparent that the tracked keypoints are much more reliable with iterative refinement.

First frame

Last frame

3D structure

Figure10 Comparison between original KLT tracker (upper row) with refined version using iteration (lower row)

4.3 Pyramid KLT tracking

Using Pyramid can improve the tracking accuracy when there are large movements in the sequence. The framework is shown is Figure11.

Figure11 Pyramid tracking (figure taken from Learning OpenCV, OReilly 2008)


Our dataset doesn’t have large movements. In order to test the performance of pyramid tracker, I use only half of the images in the sequence to make the sequence sparser. Figure12 shows the comparison between original tracker and pyramid tracker on this sparse sequence.


First frame

Last frame

Without pyramid

With pyramid

Figure12 Pyramid LKT traker
Using pyramid improves the tracking accuraccy for keypoints that have large movements


[1] C. Tomasi, and T. Kanade. Shape and motion from image streams under orthography: a factorization method. In IJCV, 1992.

[2] T. Morita and T. Kanade. A sequential factorization method for recovering shape and motion from image streams. In PAMI, 1997.

[3] J. Shi and C. Tomasi. Good Features to Track. In Proc. CVPR, 1994.

Last update: Dec. 9th, 2011