The baseline version uses a KLT (Kanade-Lucas-Tomasi) tracker for keypoints found by a Harris corner detector. Then, structure from motion is determined using Morita and Kanade's 1997 paper ("A Sequential Factorization Method for Recovering Shape and Motion from Image Streams", Section 2.3).
The lines show the path. The green circle indicates the final state. The final image is displayed in the background.
A conservative point removal system was put into effect. If the point got close enough to the edge that the optical flow's summation was not fully covered in the image, then the point was removed. (Here, this means we had 7.5-pixel margin around the image border which acted as a pixel "kill zone".)
Somehow, the output is flipped from the image! Otherwise, the result appears reasonable.
It is easier to see the reconstruction by just looking at a scatter plot. Those images, corresponding in position and orientation, to mesh versions above, are shown below.
Note how the bottom is square!
(It's a bit more difficult to see, but still visible, on the mesh.)
The algorithm performed well and gave a reasonable reconstruction of the image in the video. Unfortunately, since we do not have ground truth on the reconstruction, we can at best say that the reconstruction looks subjectively "pretty good" (aside from being flipped).
The reconstruction on this set was fantastic. The walls were smooth, except where there was not supposed to be smoothness, eg the windows poking out of the roof. And, out of all the reconstuctions performed, this is the only one that correctly eliminated the background floor. (No dot on the floor actually remains through the entire sequence, so all the points should be eliminated in the current algorithm.)
Next, to show that the pyramidal structure worked correctly, large movement between images was simulated by giving only every 5th frame of the hotel sequence to the algorithm:
Weirdly, the z values were flipped at the end! So I flipped them back and took these pictures. Like the original, the result is also mirrored. In the end, the results still look better (smoother walls, for instance) with this tracker than the original, even with significantly less data!
Of course, it's not quite as good as all the data. For instance, this one still thinks some of the floor is in all the frames, as evidenced by the final picture above.