CS129 / Project 6 / Automated Panorama Stitching

In this project, I have written an automated image stitching tool, which is a simplified version of the work from Brown, et al. "Multi-Image Matching using Multi-Scale Oriented Patches". The tool is completely written in Javascript and can be accessed from here.

Keypoint detection

Keypoints are selected using the Harmonic mean score. Given the auto-correlation matrix \(A\) of the image \[ A=G*\left[\matrix{I_x^2 & I_xI_y\cr I_xI_y & I_y^2}\right] \] where \(I_x\) and \(I_y\) are the image intensity derivatives, and \(G\) is a Gaussian kernel, the Harmonic mean score \(f_{HM}\) is defined as \[ f_{HM}=\frac{Det(A)}{Trace(A)}\]
A point \( (x,y) \) in the image is a corner if \(f_{HM}(x,y)>threshold\).

Local non-maximal suppression

After computing the corner score for each pixel, only those keypoints that are maximum in a \(3\times3\) neighborhood are preserved and the rest are discarded.

Subpixel precision

Once a location is identified as a keypoint, its position is refined by fitting a quadratic function in a \(3\times3\) neighborhood and finding its maximum. Finite differences are used to approximate the derivates of the score function. The final keypoint location \(X_m\) is given by \[X_m=X_0-[Hf_{HM}(X_0)]^{-1}\nabla f_{HM}(X_0) \]

Adaptive Non-Maximal Suppression (ANMS)

The keypoint detector described above may produce thousands of keypoints. ANMS is used to select a subset of them which have high corner score and which are evenly spatially distributed in the image. ANMS is simple to implement, first a minimum suppression radius is computed for each keypoint and then all the keypoints with radius less than some selected value are discarded. The minimum suppression radius is given by\[r_i = \min_j |x_i-x_j|\ \ \ s.t. f_{HM}(x_i)<0.9f_{HM}(x_j) \]

All keypoints (n=20715) Local non-maximal suppression (n=2159) ANMS (n=500)


Descriptors

In my implementation descriptors are sampled from a square patch centered at the keypoint location, orientation is not considered (all patches are sampled up-right). Each descriptor is represented as a \(8\times8\) normalized matrix. Values in the descriptor matrix correspond to intensities in the original image sampled from a \(24\times24\) patch. Normalization is carried on by subtracting to each value the mean intensity and dividing it by the standard deviation. Sub-pixel locations are handled using bilinear interpolation.

The following image shows an example of descriptors rendered as image patches:

Image matching

Each time a new image is added, its keypoints are detected and descriptors extracted. Next, the new set of descriptors is compared pairwise with the descriptors of the previously added images. For each image pair, RanSaC is used to estimate a homography transformation that agrees with the largest number of keypoints. If a transformation with enough number of inliers is found, that image pair is considered a match and the transformation is saved.

After all images have been processed, it might happen that there are no matching image pairs, that some matches were found being possible that some images were matched to more than one other image, and finally it might also happen that images were matched in several disjoint subsets (e.g. image 1, 2, and 3 might form a subset, image 4 might be unmatched, and image 5 and 6 might be a different subset). In order to handle all these cases and to produce more or less the same result independently of the order in that the images were added, I have come up with this algorithm:

  • Select the image that have more image matches as root, if there are several possible roots, pick up any of them as root.
  • Create an empty list and add the root image to it
  • Repeat
    • Select an image (not in the list) with the largest number of inliers with any of the images in the list
    • If one image is found, add it to the list and repeat, otherwise stop

The idea is that the images with more matches are in the largest subset of matched images and we want to pick up this set to make the panorama, all other subsets and unmatched images will be ignored. Once we have the root image, we search one-by-one images that match best with the current panorama list and we add them. At the end all the images that can be matched to the root (directly or indirectly) are in the list and the selection order ensures that the best transformation found was used (the transformation that has more inliers) from the possible many available to transform each image to the root image plane. Later, the panorama will be rendered by stitching only the images that were matched to the root and the panorama will be created in the root image plane, which intuitively, being the image with more matches, is located at the center of the panorama.

Multiscale

In order to make the tool invariant to image scale, I have built a pyramid for each input image. Keypoints are detected at all levels in the pyramid and descriptors are extracted at the corresponding keypoint level. This extension is simple to implement and because the number of keypoints after ANMS is the same, it does not affect the algorithm speed after the detection stage. Keypoint detection gets slower with the number pyramid levels but, because the image is reduced to \(1/4\) at each level, detection in each level requires less time than in the previous one.

Results

Given datasets



Extra images


Multiscale