Super Resolution from Image Sequences

Sam Eilertsen (skeilert)
May 2011

Super Resolution

Super resolution, the enhancement of image resolution by intelligent means, is an extremely important and challenging topic. All digital capture systems are limited by their pixel resolution (and other capture media like film have their own resolution limits). The standards for capture, processing, and display of digital images is constantly growing as technologies improve. As display resolution increases, older content becomes increasingly obsolete.This is particularly relevant in video technology, where in the last decade standard definition (480p) video has been been rapidly replaced by high defnition (720p and 1080p), which is already waning as capture resolution are pushe even higher (quad-HD, 4K).

"Unitelligent" methods for increasing resolution are very simple: create a high-resolution image by sampling the pixels of the lower-resolution image, applying some sort of filter to smooth out jaggedness, and possibly some sort of sharpening. Little can be gained by this: sampling does nothing to improve resolution, smooting decreases apparent sharpness, and sharpening quickly degrades the image. Fundamentally, quanitizing an image is a process of throwing away information, once an image has been sampled at a certain resolution the information needed to accurately construct higher resolution is lost. Clearly, something smarter is needed.

The technique of super-resolution I have attempted to implement is that of deriving a high-resolution image from a frame sequence of images of the same subject, rather than a single frame. The assumption is that because of small movements of the camera between frame captures, more information is captured in the sequence than is available in a single image, and it is possible to reconstruct a higher-resolution image by combining the frames. An important priniciple leveraged is reconstruction: that when the high-resolution image is down-sampled in a similar manner to the original sampling of the image, the result should be similar to the original, and the error can be iteratively corrected.


The algorithm implemented is described in Irani and Peleg. The pipeline consists of three steps: the camera movement is corrected for so the images align, a guess is constructed by averaging the aligned images, and the guess is iteratively corrected by generating low-resolution images from high resolution and back-projecting the error.

For the first step, images are aligned based on translation as rotation (motion of elements within the frame is not considered, although a more advanced implementation could do so). This must be done to sub-pixel accuracy since the algorithm is based on pixel over, so the familiar SSD approach will not help us. Instead this is done by solving a system of linear equations construced based on several factors including the sum of differenced between the images and the gradient of the image (equation 2 in Irani and Peleg). This is based upon the assumption that motion is confined to small translations and rotations.

A guess calculated simply by applying the calculated transformations to the images and averaging them. Irani and Peleg demonstrated that with enough refinements even an arbitrary guess can produce a resonable result.

Refinement takes place in two steps. First, an image sequence is derived from the guess, based on the calculated displacements of each frame in the original sequence. The derivation of this sequence simulates the imaging process. Each pixel in the guess is projected into each new frame using by a point-spread function. This is a Gaussian whose parameters are determined by the behavior of the imaging system. This process is similar to the convolution and downsampling used in bicublic scaling. However, in this case the image is not convolved by a matrix, but by a smooth function.

Once a sequence has been derived, error is calculated by comparing each derived image to the corresponding image in the sequence. This is projected back onto the original using a back-projection kernel, which is often the same as the point-spread function, although in some circumstances another value is used.