CS129 / Project 7 / Final Project - Video Stabilization

For this project, I investigated video stabilization. Video stabilization is the act of removing unpleasant camera shake, jitter and rolling shutter from a source video, to produce a more pleasing output video. Camera shake and jitter is usually caused by the cameraperson using a handheld camera, such as a phone camera, with a less-than-steady hand. Rolling shutter is caused by (usually cheap) camera electronics capturing different scan lines of an image at different times, rather than at the very same instant. Video stabilization techniques have especially satisfactory results when the camera has a steady average trajectory over time, but chaotic trajectories over shorter windows.

The following videos show an example of a shaky input video, and a stabilized version of the video using state-of-the-art video stabilization (in this case, it is built in to YouTube and available to all YouTube videos).

Stabilization Pipeline

For the first step of my implementation was to determine the frame-by-frame correspondences. I used the harris interest point detector from project 6, and RANSAC, to estimate homographies between pairs of frames. We require very accurate homography estimates, and subpixel accuracy is not just desirable, but necessary. Without sub-pixel accuracy, we can experience drift, whereby small errors in the estimated homography accumulate over time. An example of this would be if a camera is at a viewpoint V1, moves to a viewpoint W, then moves back to V2, where V1=V2. The accumulated errors in moving from V1->W->V2 may taint our estimate, causing V1 and V2 to appear as two different positions. I am not certain as to whether it is the best approach to use the harris interest point detector, since it does not detect interest points with sub-pixel accuracy.

I mitigated the possible error by not just considering frame-to-frame homographies, but by considering the average homography for windows of frames. For example, for frame i, I calculated the homography between frame I and all frames from i-1 to i-n for some n. I then averaged these homographies. I found that the best results were video dependent, but typically the greater the value of n, the better the estimated homography. It was also important to threshold homographies based on the number of successful points matched – if a homography was calculated but with only a small number of correct interest point matches, then it would be excluded.

The pseudocode for the pipeline I used is outlined as thus:

determineInterestPoints(frames)
extractFeatures(frames)
for i = 1 to size(frames)
	homographies = []
	for j = i-n to i
		homographies = homographies ++ calculateHomography(frames(j), frames(i))
	homographies = mean(homographies)
In my implementation, I found that n=3 is sufficient to get a good estimate for the homography. At this point, we have an estimate for the homographies between frames for the entire duration of the video. I found this step to be very temperamental, and required the tweaking of many parameters to ensure that a sufficient number of interest points were selected, and that results weren’t tainted by incorrect homographies at any point.

The following video shows the frame-to-frame homographies using a fixed camera viewpoint

With a perfect implementation, there should be no jitter whatsoever when using a fixed camera viewpoint. However, due to imperfect interest point detection and homography calculation from the first step in the pipeline, the homographies are imperfect estimates and so we retain a small amount of jitter. The YouTube algorithm does a much better job at determing the frame-to-frame homographies.

Step 2: Stabilization

The second step of my implementation was to determine a better camera trajectory than the shaky trajectory from the source video.

The fixed viewpoint as shown above does not produce a particularly pleasing output video. Instead, we want to derive a camera trajectory that is close to the trajectory of the input video, but removes the high frequency jitter.

For example, the following image shows derived trajectories using the technique on the videos shown above:

Advanced techniques, such as the L1 optimization technique used in Google’s YouTube video stabilization, try to mimic camera trajectories used by professional directors. In particular, the YouTube technique uses linear programming to derive an optimal trajectory that consists of a combination of static, linear, and parabolic camera trajectories. Such an approach gives the camera a more natural feel.

I implemented the naïve approach, whereby the derived camera trajectory is just a Gaussian filter on the source video trajectory. Such an approach will remove the high frequency jitter. It will not, however, produce a great result, particularly compared to the YouTube result.

The following videos show the result of applying the Gaussian filter to derive a camera trajectory, using kernels of radius 10, 60, and 120 respectively.

It is clear from the videos that no matter what the derived camera trajectory is, there are still small high-frequency jitters present in the video due to inaccuracies in the homography calculation. A better approach to determining interest points in the video frames would reduce these inaccuracies, and in turn produce better results.

In conclusion, it is clear that the most important step in video stabilization is to determine the most accurate frame-to-frame homography possible for videos. Given perfect homographies, then we can further apply techniques such as L1 optimization to derive interesting and pleasant camera trajectories. In this work, I found that mediocre results can be obtained using relatively simple interest point matching implementations.