Project 5: Tracking and Structure from Motion

CS 143: Introduction to Computer Vision

Justin Ardini (jardini)

Overview

The goal of this assignment is to reconstruct a 3D model of an object based on a series of observations (video frames) under orthographic projection. The pipeline for this reconstruction involves three major steps:

  1. Select keypoints to track from the first image.
  2. Track the motion of these keypoints across all video frames.
  3. Use the tracked motion and known camera parameters to build a 3D model of the object.

I will outline my approach for implementing each of these steps, then show the final results.

Keypoint Selection

To select keypoints, I apply a Harris corner detector to the first frame and keep the 500 most confident results as keypoints. Harris corners are ideal because their motion can be estimated reliably from frame to frame, unlike edges or interior points.

Visualization of 500 keypoints over the first frame

Feature Tracking

To implement feature tracking, I use the Kanade-Lucas-Tomasi tracker. This tracker assumes brightness constancy, small motion from frame to frame, and spatial coherence in order to reliably track points. Given these assumptions, the tracker computes optical flow between video frames and moves keypoints along these flow fields.

Left: 20 initial keypoint locations and paths showing tracked motion across all frames
Right: Final keypoint keypoint locations shown over the final video frame

Some keypoints travel outside the video frame boundary as they move; I simply discard these points.

The 27 points above went out-of-frame at some point and were discarded

Structure from Motion

I implement structure from motion using an approach devised by Tomasi and Kanade (1992). Here is a quick summary of the procedure. First, center the keypoints in each frame and use the centered points to build a measurement matrix, M. Next, compute the singular value decomposition of M, svd(M) = UWV' and save the first three singular values, along with the first three columns of the U and V. Use these to recreate two matrices, one representing the rotation of the camera (the movement matrix), and another representing the 3D points of the object (the shape matrix).

The shape and motion matrices determined from the above procedure are not unique because they do not factor in geometric constraints on the image. By forcing the image axes to be perpendicular with scale 1, I solve the resulting matrix equation to determine the metric constraints on the shape and motion matrices. Applying these constraints gives a point cloud representing a 3D object and a matrix representing the camera motion.


Four views of the triangle mesh formed from the object's point cloud

The red lines point represent the direction of the camera across all frames

Conclusion

The algorithm succeeded in creating a 3D model of a house from 51 video frames, however the 3D model is by no means ideal. As expected, the resulting mesh is more dense at object regions where many keypoints are tracked and less dense far from tracked features. The model looks very good from the camera view, but the inconsistent depth becomes apparent at other viewing angles. The mesh could be improved by using more camera angles or more sophisticated approaches to eliminate error in reconstructing the object.

Finally, thanks to Professor Hays and his TAs for making this a great course! Here's a celebratory cat.