Finding correspondences between image scenes is something that we've learned to do exceptionally well - we can recognize the Empire State Building from just about any angle, and synthesizing the two images that our eyes receive into a quasi-3D model is done without any conscious effort. How might we construct a solution to enable a computer to achieve decent performance on this task? The standard methods to this end all follow the same basic pipeline, beautifully illustrated below.
The pipeline architecture has a conceptual elegance, along with some computational advantages in practice. Here you'll get an overview of the function of each stage, along with a discussion of an algorithm I've used to implement that stage in a pipeline of my construction, and at the end you'll see the results of some tests of that pipeline on real data.
Identified Keypoints are in red
The first stage in the pipeline aims to identify locations within an image that are distinct from their surroundings and therefore easily localizable should the image scene undergo some mild transformations. Thinking about three types of small image patches from an image of a stop sign will help illustrate this property: a uniform patch of red within the sign, a patch containing a section of the post, and a patch containing the cross of the "T". Given just these patches, can you locate them within the original image? For the first, you can't even infer the scale of the patch from its content -- it could be any uniform region within the stop sign. For the second, there is less ambiguity, since we know it must reside along the sign post, but without cues from elsewhere in the scene, we are unable to locate exactly where our patch was taken from in the original image. The third, however, is totally unambiguous, since there is only one patch in the scene even resembling the "T" patch. This was an easy task - locating patches within the image from which they were taken - but even then, the first two made the task difficult. Now it becomes clear that to cope with translational, rotational, or projective transformations of the scene onto the image plane, and still be able to locate our patch within an image, we need to find patches like the "T".
What is a suitable measure of this property, a certain unambiguity under transformation that is often called "cornerness"? One way is to look at the image gradients present within the image - the directions where image intensity changes the fastest, which usually indicate edges in the scene. Uniform patches will have very small to no gradients in any direction, patches with a single edge will have a strong gradient normal to that edge but very little in the direction parallel, and "corner" patches will have strong gradients in more than one direction. This sets out a good baseline for measuring the suitability of locations within an image for use as a keypoint in our pipeline, and the implementation of this idea that I've built here is the Harris Corner Detector, circa 1988.
The principle behind this algorithm is to take an image and for each pixel generate a "cornerness score" from the gradients dominating the small neighborhood around that pixel. A quick means of computing the X and Y gradients is filtering with a kernel obtained by convolving a Gaussian with a gradient filter in that direction. From these gradients one can derive a matrix describing the self-similarity of the neighborhood around the pixel under translations, and the eigenvalues of this matrix reveal the distribution and strength of the gradients present in the patch. Knowing that the determinant of a matrix M
is the product of its eigenvalues and the trace of a matrix is the sum of its eigenvalues, we use the simple formula score = det(M) - K*trace(M)^2
to incorporate that information into a scalar to represent the cornerness of the pixel at the center of the patch. The parameterization K = 0.4
was chosen here for best accuracy. You can see that the image at right contains a great many red keypoints - a subsquent non-maximum suppression step also had to be carried out to thin out the set of keypoints and isolate only those with the best score in each small neighborhood. This greatly increases the accuracy of our matches and decreases the computational cost of running the pipeline.
The second stage of the pipeline takes the keypoints derived in the first stage and is charged with representing those locations of the image in a form that promotes invariance to the various transformations the scene contents might undergo as they are projected onto a different camera. A simple feature might just consist of the patch surrounding the keypoint in the image. One problem with this choice is that the pixel level contains so little information and context that very small changes in brightness or rotation can render such naive feature unrecognizable even when taken from the same part of the image scene but from a different image. For this reason, I have implemented a feature extraction strategy based on the SIFT features described by David Lowe a decade ago; it consists of a grid 4x4 patches around the keypoint within which histograms of gradient magnitudes are aggregated over 8 different orientations. In capturing merely a distribution of orientations, we are able to capture the structural elements of the keypoint without being overly sensitive to the pixel-level detail.
The algorithm used to extract these features is simple: filter the entire image with 8 oriented "derivative of Gaussian" filters, and around each keypoint, pull out the surrounding 16x16 pixel patch. Within that patch, there are 16 4x4 patches. Within each of those patches, construct a histogram of the responses of those pixels to each oriented gaussian derivative filter, and with each pixel's contribution to the histogram being weighted by its distance from the keypoint using a Gaussian-weighted average. We then derive 16 8-vectors for each keypoint, each of which is a histogram of orientations within a 4x4 patch in the keypoint's neighborhood. These vectors are then concatenated and normalized to generate one 128-dimensional feature for each keypoint.
The input to this stage is a pair of feature sets, and the goal is to find the most likely correspondences between features across the sets. This will generate a sparse set of correspondences between the two images from which the sets were extracted, enabling further processing to model the scene if one wishes to extend the pipeline. We first conceive of a measure of "distance" between the features - any number of imaginative schemes might be constructed to take advantage of particular properties of the features or structure of the data, but since our features here are represented as 128-vectors, a simple scheme would be to compare Euclidean distance.
In my implementation of this stage, we loop through one of the feature sets, finding the two features from the other set that are closest to it in terms of Euclidean Distance. The ratio of these distances is also calculated as a measure of our confidence in this match - a close tie between the two closest matches should indicate that there was some ambiguity in those features, and we should be hesitant in declaring a certain match. We use these confidences as a final stage of vetting our data to pick only the features we are most confident about to return as corresponding features between the two input images
My implementation was tested on a pair of images of the Notre Dame cathedral in France, taken from different angles, distances, and even containing different occulsions in the foreground. The matching accuracy was a solid 94% on the 100 most confident matches, and you can see below that each stage of the pipeline played a key role in exploiting the structure of the scene to pick out reliable matches