Scene Recognition with Bags of Words

Structure from Motion

Charles Yeh, Dec. 2011

KLT Tracker

A KLT tracker is first used to track several key points, chosen with a harris corner detector, across a sequence of images.
The detector was run, then points were thresholded by strength. This resulted in:

The KLT tracker works by making 3 assumptions: that a pixel is the same brightness across two images, local groups move together, and there are no large motions. It uses the intensity derivatives and temporal gradient to estimate the direction and magnitude of motion at each pixel. Red symbolizes motion rightwards while green symbollizes motion downwards in the following visualizations of optical flow. Drag to view specific frames

These estimated optical flows are summed up across time for each tracked point. The following are 20 randomly chosen paths of tracked points.
The left image is the first of the sequence, and the right image is the last of the sequence. The bottom right two points seem to drift upwards incorrectly.

Tracked points which move outside of the image at any point of the sequence are removed. The following points (shown in the first image of the sequence), moved outside the image dimension at some point.

The next step is to use the tracked keypoints to determine the affine structure from motion. First, a measurement matrix D is created of the coordinates of points over all images. The singular value decomposition is found for D and the motion and shape matrices are calculated:

D = UWV

M=U(W^.5), S=(W^.5)V

The orthogonal nature of i and j vectors are then used as constraints to calculate a matrix L=QQ'. M and S are adjusted with

M=MQ, S=(Q^-1)S

The top half of M are the camera's x axes while the bottom half are the y axes. The look vector(z) axis is determined by crossing the two. The 3D points are shown here. The red lines are the camera's look vectors. Camera position is not available from orthogonal input so only direction is shown. Every other look vector is also removed for easier vieweing.

These are the camera points in the X, Y, and Z dimensions respectively. The vertical axis is the value and the horizontal axis is the image number within the sequence.

To test the effectiveness of the algorithm, I ran a small image sequence of a spinning sphere with a super-low polygon count.

Thresholding did not seem to work since the strength of nearly all the features were < .0001. The optical flow was highly inaccurate, as expected.

As a result, the tracked point paths were also highly inaccurate.

The ultimate result seemed highly random and incorrect, as expected.