Project 5: Tracking and Structure Through Motion

Tala Huhe (thuhe)


Overview

Scene reconstruction is a fundamentally difficult problem in computer vision. Since images are 2D (projective) projections of a 3D scene, one dimension is essentially lost in this entire process. However, with multiple perspectives of the same scene, we can hope to perform a 3D reconstruction. In this project, we are given a video clip of a house. We then use the clip to deduce the original location of points on that house.


Algorithm

Below is a rough outline of our algorithm:

  1. We detect and extract potentially important points in the first frame of the video.
  2. We track the movement of these points throughout the video.
  3. We evaluate the movement of the tracked points in order to generate the original point locations. Along with these, we calculate the camera directions.

Keypoint Selection

We select interest points in the image using a Harris corner detector. This basic detector finds points in the image where a small shift in any direction causes a large change in brightness. We apply this on our image, then select the 500 points with the highest cornerness scores. In other words, we use this to select the 500 most distinctive points in the image. Below are our selected keypoints overlayed over the first frame of the video:


Feature Tracking

In order to track feature point movement throughout the video, we use a Kanade-Lucas-Tomasi tracker. This algorithm proceeds by computing the optical flow between two successive frames. Optical flow is defined to be the apparent motion of features in the scene. Below is our optical flow computation implementation. The red channel represents horizontal flow, with no flow at half brightness. The green channel represents vertical flow.

Flow from the first 12 frames of the video.

We then estimate motion of our feature points using optical flow. Essentially, we move each point in the direction of the flow. Here are the movement of 20 random keypoints:


Drag on the slider to view different frames of the video. An example of an invalid feature point is highlighted in red. The feature moves along the vector field straight out of the image.

Structure from Motion

We reconstruct the structure of the objects through Shape and Motion from Image Streams under Orthography: a Factorization Method. To make calculations easier, we the camera had no perspective, and that our images were all parallel projections. With this, we are able to compute the direction of each camera through the algorithm outlined in the paper above.

6 views of the 3D reconstruction of our house.

Below is the relative movement of the camera on each axis: