In this assignment, we implemented an algorithm for reconstructing the 3D shape of an object from a series of observations (a video), as described in the paper Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992). There are three essential components to this: keypoint selection, feature tracking, and structure from motion.
The first step of the algorithm is to select feature points from the first frame of the video, which we will use to construct the 3D structure. We do this by running a harris feature detector on the first frame of the video and selecting the 500 strongest features which we will use as our keypoints. Here is an example of feature selection of the first frame of the "house" sequence:
The second step of the algorithm is to track the keypoints from the previous step through every frame of the video, to finally obtain the corresponding set of keypoints for all frames. We do this by computing the optical flow of every frame and combining this with the temporal differential, defined as the difference between two frames, to obtain the direction of motion for every feature point. If any points appear to leave the image frame during tracking, we discard them.
Here is an example showing the optical flow of a frame. We use a sequence of optical flow values to track the keypoints over the video. The image on the right shows the detected paths of twenty random keypoints (the end of each path has been marked with a green X):
optical-flow tracks of 20 keypoints
Through out the sequence, if any of the keypoints move outside the frame, we discard them and not use them for our algorithm. The following image shows (marked in red), the points from the first frame that moved outside the image:
The following animated gif shows all 500 keypoints as they have been tracked over N frames:
Structure From Motion:
The majority of the algorithm is done in this step. The algorithm first translates the tracked point to the origin of our estimated world-space. We obtain a matrix D, that contains the x and y coordinates of all tracked feature points in its columns, and each frame in its rows, so for N frames and P points, D is 2N x P. Using singular value decomposition, we obtain two matrices A, which represents the position of the camera eye point through out the frames, and X, which represents the estimated 3D world-space positions of all keypoints. Finally, we remove the ambiguities by finding a matrix G (as discussed in A Sequential Factorization Method for Recovering Shape and Motion from Image Streams, Morita and Kanade, 1997.) to find the ideal A and X.
The following photos demonstrate the final 3D structure from various viewing angles. The red lines represent the 3D look vector of the camera across the frames, which forms the path of the camera as it moved around the house: