The goal of this project was to experiment with orthographic structure from motion. Given a sequence of input images, shot under orthographic camera projection, we would like to recover the 3D shape of the scene.
The first step in the pipeline was to detect "good" features to track. Basically we want to find points in the image which are easy to track throughout different frames - these are most likely cornery things. Once these points have been found, optical flow is used to estimate their tracks throughout the image sequence. For this project, we implemented the Lucas Kanade optical flow algorithm to track the points. Tracked points falling outside the image during the sequence were simply discarded.
Once the features had been tracked, the 3D structure was estimated using matrix factorization methods, as described in Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992).
Instead of using the Harris corner detector provided, I implemented the similar, but slightly different Shi-Tomasi corner detector as described in Good Features to Track (Shi and Tomasi 1994). This detector examines the eigen values of the Harris matrix (of gradients), and counts points as corners if the eigenvalues of the matrix are larger than some value.
In the hotel test sequence used for this project, the regular LK optical flow algorithm does a very poor job at estimating the motion on the ground, even though there appear to be good features to track on the small dots (which are in fact detected by the corner detector). To detect poorly tracked features, I assume that the movement from frame to frame is relatively small, thus an affine transformation from one frame to the next should fit fairly well. Using this assumption, I estimate a best fit affine transformation matrix of feature points from the previous frame to the current frame. The affine matrix is estimated using Random Sample Consensus (RANSAC), with a total of 100 iterations per frame.
For each iteration of the RANSAC algorithm, three points are randomly sampled (six values), and an affine transformation is estimated. The transformation is then appplied to the previous points and the sum square error between the optical flow calculated point locations of the current frame and the affine transformed points is computed. Points with a high error are removed from the tracked point list.
Below are some of the results of the structure from motion algorithm.
The x, y and z translations of the camera per frame.
Yellow points are are points which move out of frame. Green are keypoints at frame 1, which follow the blue path to frame 51 where they reach the red point.