CS 129 Project 1: Image Alignment with Pyramids

Daniel Moreno (damoreno)
September 17th, 2012

I have implemented the automatic alignment algorithm in C++. It features two different alignment metrics: SSD (sum of squares differences) and MI (mutual information). The software is essentially multi-scale but it can be run in single-scale mode with a command line option. Additional options allow to crop he image by a fixed margin and to perform automatic white balance.

Metrics

SSD is the simplest metric available. The idea is to compare each pixel in one image with their correspondences in another image just by the square difference of their values.

MI is defined using entropy: MI = h(img1) + h(img2) - h(img1, img2). In this case the idea is that the joint entropy h(img1, img2) will be minimum when the images are perfectly aligned because one image can be easily predicted from the other. When the joint entropy is minimum , MI reaches its maximum.
The entropy is defined using probabilities: h(img) = -P(img) log P(img). More details can be found in "Alignment by Maximization of Mutual Information" by Paul Viola. In my implementation I have approximated the probabilities using intensity histograms. In the case of the joint entropy we need the 2-dimensional joint histograms of both images.

After experimenting with both metrics my conclusion is that MI is more robust and it performs better or equal than SSD. I say that SSD is less robust because it its accuracy is affected by the noise usually present at the borders of the test images, whereas MI seems mostly insensible to them. To mitigate this problem, all the images are temporally cropped while running the optimization loop so that the metrics only consider the main content of the images.

Usually the difference in alignment between SSD and MI is a couple of pixels. The following is an example where MI completely fails, given the crack on the image, but MI correctly align without further issues:

SSD	MI

Multi-scale search

In multi-scale mode a pyramid is built for each of the three images. The first level in the pyramid has the original image and additional levels are built halving the resolution of the previous level. The number of levels is automatically selected to assure that the smallest image has a width of at least 128 pixels. The search begins in the top level of the pyramid with an extended search window where all the possible translations are evaluated (brute-force). The search in the next levels begin with the solution of the previous level and the search window drastically reduced to consider only a 3x3 neighborhood. Each level has double resolution than the previous so the solution of one level is easily transferred to the next on by doubling is pixel dimensions. A typical alignment of a high-resolution image (e.g. 3741x3228 pixels) takes around 8 seconds.

White-balance

White-balance is performed as a post-processing step, after alignment, to improve the color quality of the images. The algorithm searches the maximum of each color channel and maps that point to white (RGB = 255, 255, 255) by scaling each channel independently.

Normal	White-balanced

Results Images

Other experiments

The described algorithm gives good results in all the images I have tested, however, when zooming them in one can see that some regions are not perfectly aligned, specially close to the borders. The first guess is that an integer pixel displacement might not be the best model to match the different channels. In an effort to improve this issue, I implemented the complete algorithm described in "Alignment by Maximization of Mutual Information" by Paul Viola. This algorithm maximizes the MI using stochastic hill climbing with its gradient. The algorithm involves estimating probability densities with the Parzen Window method in order to find the MI gradient from the current transformation. I was attracted with this idea because it can easily be used with more complex transformations. In my case I was using an affine matrix to transform coordinates from one image to the other with subpixel precision.

Despite that the algorithm works, the result does not look much better than the previous I had. Maybe a linear transformation is not enough to model the transformations or maybe extra testing is required.