Tracking and Structure from Motion

Keypoint Selection

Harris corner detection was used to detect approximately 4000 keypoints in the first frame of the video. The 500 strongest keypoints were used. These keypoints were prime candidites for tracking: they were generally heterogenous in both the x and y directions, minimizing the effect of the aperture problem.

The valid (fully trackable) keypoints in the first frame.

Feature Tracking

To track the keypoints through the video, the Kanade-Lucas-Tomasi optical flow estimation was implemented. The x, y, and time (t) gradients of the intensity (I) of the image were computed for each pixel at each frame. I_x, I_y, and I_t were used to estimate the x-components (u) and y-components (v) of the optical flow at every point in the video by the relation


The optical flow at each pixel in one frame of the video.		The magnitude of optical flow at each pixel in one frame of the video.

The 2D path of 20 random keypoints over the course of the video.

Keypoints that moved outside the bounds of the image at any point in the video were excluded from further analysis.

The keypoints that moved out of the frame at some point in the video.

Structure from Motion

The affine structure of the hotel in the video was estimated using the method described in Tomasi and Kanade (1992).

An overview of the algorithm

Center the keypoint coordinates for each frame.
Construct a measurement matrix D by stacking the matrix containing the x coordinates of the features at every frame and the matrix containing the y coordinates of the features at every frame.
Take the singular value decomposition of D: D = U W V'. Reduce the rank of U, W, and V to 3 by setting
- U = U(:, 1:3)
- W = W(1:3, 1:3)
- V = V(:, 1:3)
Create motion (A) and shape (X) matrices.
- A = U W^0.5
- X = W^0.5 V'
Apply the constraints of orthographic projection to eliminate affine ambiguity.

Three angles of the reconstructed 3D image.

The 3 components of the camera position, plotted over time. The independent axis of each corresponds to frame number (time), and the dependent axis of each corresponds to position.