CS 143 Project 5: Tracking and Structure from Motion
Nabeel Gillani

Overview

For this assignment, I followed the recommended pipeline, but opted to track a smaller number of points than the recommended amount. I tracked only salient features on the house in order to generate a strong representation (and resulting mesh) of the 3d reconstruction. I computed the initial Motion and Shape matrices and removed affine ambiguity according to the papers linked off of the assignment page.

Algorithms and Decisions

Computing Optical Flow

I vectorized this entire function in order to significantly increase runtime (from 10 minutes to approximately 15 seconds). I start by finding the appropriate gradient matrices for the current image. I then compute the matrices corresponding to the cross-term gradients (i.e. IxIx, IxIy, IxIt and IyIt) and use a box filter with window size equal to 15 to sum over the elements of the matrix. I multiplied by the inverse matrix on both sides to compute u and v. I then use u and v to find the appropriate offset by which we should increment each tracked point. Below is the set of points I decided to track (note the absence of points on the bumps near the ground):

Tracked Points - Frame 1

250 selected points

After tracking the progression of the keypoints throughout the frames, I then visualized a vector of 20 randomly selected points and their trajectories. In the image below, the red lines (composed of circles) represent the paths that each point travels on, the green "+" symbol (which is tough to see, but on the opposite side of the blue "+" on a given red streak) denotes the point's starting point, and the blue "+" symbol denotes the ending point.

2D Trajectories for Points

Number of Points depicted = 20

Because I tracked mainly points on the house itself, not that many of them moved out of frame. The perpetrators are depicted below as green "+" signs:

Points that moved out of frame

Number of points: 6

Finding Structure from Optical Flow Points

Once I built the matrix of tracked x and y points for each frame, I then implemented the algorithm described in both Tomasi and Kanade's 1992 paper as well as Morita and Kanade's 1997 publication. In referencing the first paper, I used the notes provided to first define the matrix D as the 2|F|x3 matrix of tracked x and y points "stacked" on top of one another. I then subtracted from each row point the mean of that row. After computing the SVD of D, I took the appropriate rows / columns of the resulting output matrices in order to construct first renditions of shape and motion matrices for our tracked points, infused with affine ambiguity.

To remove affine ambiguity, I referenced the second paper. Here, I first set up the matrix equations that force the rows of M to abide by the normalization constraints of an orthogonal camera to solve for the matrix L. I then took the Cholesky decomposition of L to arrive at C. Right multiplying the camera motion matrix by C and left multiplied the 3D point matrix S by C inverse was the last step in removing the ambiguity and arriving at our final motion and shape matrices.

Results

Predicted 3D Points for Different Views

Bottom View
Corner View
Top View

Camera Position X, Y, and Z movement over all frames

Normalized Camera Positions

Texture mapped + mesh final 3D reconstruction

Number of final points: 243

Analysis

Starting with the point reconstructions, we can see from these distinct views that the world point representations of the tracked points are indeed, in fact, fairly accurate. The image may appear sparse due to the limited number of points tracked, but looking at the corner view indicates the window and chimney forms.

The plots for the camera position vector changes over each successive frame make intuitive and observational sense. When viewing the images in sequence to see the entire video, the camera appears to move "right" (in the positive x direction), "down" (negative y direction), and somewhat "outward" in a curved fashion.

The resulting texture mapped figure is also a reasonable reconstruction of the original item. The red streaks here are the plotted camera position vectors (directed inwards towards the center of the house). The mesh here looks coarse, but looking closely at the texture and mesh intersection points reveals a nice correspondence between the more prominent corner features in the image and the intersection points of the mesh.

Summary

Overall, selecting a small number of Harris corners (~250) to track yielded better results, albeit a finer mesh, than my implementation with ~500 initial points. Vectorizing the optical flow computation was crucial in fast testing and development. The algorithms presented in the papers were also straight forward. An easy experiment I didn't try was changing the window size - something I expect would alter the relative shift of each point being tracked over successive frames. Overall, I enjoyed this project and hope to learn more about the different applications of SFM in the future.

Shoutouts

Shoutout to edwallac for helping me understand the assignment better, and especially to vmoussav and psastras for helping me try and track down the most viscious bug known to mankind. They are beasts (in a good way), and I will love them forever.