Overview

Your average camera has a very limited field of view, requiring great discretion on the part of the photographer regarding which sector of his surroundings he wishes to capture. However, several photos taken in different directions from the same vantage point can be stitched together to simulate a wider field of view, allowing a continuous view of surrounding scene geometry in one image. This process can be done by hand, or automatically, by identifying which points in different images represent the same locations in the original scene, and transforming each image such that all of these correspondences are aligned. Below, I outline this process and show some results from my implementation.

Algorithm

In this description, I'll assume we're stitching only two images, but it's easy to see that stitching among multiple images is simply a matter of iteration. This process is based on a paper by Brown, et al.

Defining the Correspondences

  1. First, we must find "interest points" within both images: points within a scene which are distinctive enough that they should be identifiable between several different views of the scene. Points with this property can be detected as corners within a scene, which are generally unique features. The Harris interest point detection algorithm is discussed in more detail here.
  2. For each image, find a subset of these points which is approximately evenly spread across the image and contains a good number of points which are strongly indicated by the corner detection. This process is referred to in the paper as "adaptive non-maximal suppression", or ANMS.
  3. Define a feature descriptor for each of the remaining points. In my implementation, this involves greyscaling a 40-by-40 pixel patch around each point, resizing by a factor of 1/5 (using imresize with bicubic interpolation for a weighted average at each sample), and then normalizing each pixel (subtracting the mean, dividing by the standard deviation) to better tolerate changes in intensity between images.
  4. Finally, determine matches between these features, a matter of n2 comparisons between all features. Because matches which simply exceed some cutoff could be spurious depending on the situation, we use Lowe's ratio test: correspondence is indicated by a very small ratio between the second and first closest match. In my code, I test for 2nd/1st match ratios of less than a threshold of 1/10.

Recovering the Homography

Given any set of perfect correspondences between two images, our goal is then to warp one to match the other to provide the illusion of continuity in the new scene. This is accomplished by finding the homography that maps between the two images: a projective transformation, which is modeled as a 3x3 matrix with 8 degrees of freedom. The 8 unknowns in this matrix can be solved as a linear system based on values from four or more valid correspondences (i.e. four sets of correspondence points, as determined above). This is discussed in detail in section three of Paul Heckbert's Projective Mappings for Image Warping, and this formulation has greatly informed my implementation.

However, being able to derive the translation from four correspondences is not enough; we must also be able to find the best four correspondences from which to draw this homography. This can be accomplished by using the RANSAC algorithm using the set of correspondences derived in the section above. In my implementation, this works like so:

Once the best homography is found, the final panorama is obtained by passing all pixels in one image through the transformation, and overlaying this result with the second image. Voila!

Results

Following is a series of results from my program, first from some of my own images, and then the test data. It turns out that I'm extremely bad at taking panoramic photos; almost all of the photos I took seemed to have geometric aberrations that my transformations could not completely rectify. Several of the test panoramas look really good, though there are a few noticeable artifacts (non-overlapping edges, ghosted objects) due to slight movement of the vantage point between images, variation in color between the segments, or changed scene content between images.
November 27th, 2012