CS 143 / Project 2 / Local Feature Matching

Feature matching refers to the act of recognizing features of the same object across images with slightly different viewpoints. The goal of Project 2: Local Feature Matching has been to implement an instance of the local feature matching pipeline:

  1. Find a set of distinctive key-points.
  2. Define a region around each key-point.
  3. Extract and normalize the region content.
  4. Compute a local descriptor from the normalized region.
  5. Match local descriptors.

Implementations of this pipeline blur a few of these steps. In my own implementation, steps (2) and (3) are mere matters of defining a parameter, while steps (1), (4), and (5) are of greater interest.

(1) Finding Key-Points

There are a number of different ways to find key points in an image. I elected to use Harris corners detection, largely due to its simplicity.

The Harris corners detection algorithm is based on a somewhat intuitive fact: image intensities adjacent to an object corner are generally dis-similar to the intensities at the corner. One way to find corners, then, is to compute image gradients and determine where there are large changes in all directions. The greater the gradient, the more likely a particular point corresponds to an object corner.

To the right are the (magnified) results of my own Harris corners detector, run on the left image of the Notre Dame cathedral shown above. The detector tends to fire at corners and edges of the building, and in the center, at the paned round glass window.

(4) Computing a Local Descriptor

As with detecting key points, there are quite a few ways to compute a local descriptor for a key point. A local descriptor is a set of data that localizes a key point, i.e. information to make it stand apart from other key points.

For this project, I implemented SIFT (scale-invariant feature transform) descriptors. SIFT descriptors are computed by determining the intensity gradients of a small window, 16x16 in my implementation, surrounding a point of interest (key point). This 16x16 window is further subdivided into a 4x4 batch of 4x4 regions; eight gradients for the pixels in each of these regions are "binned" together, resulting in an 8x4x4 = 128 dimensinal descriptor.

(5) Matching Local Descriptors

The last step, matching local descriptors from one image to local descriptors in the corresponding image, is non-trivial but also not especially interesting.

Matching descriptors is a two-step process.

  1. First, compute the "nearest neighbors" of each descriptor in one image with the descriptors from the other. The distance metric could depend on the descriptor contents; for simplicity's sake, I opted instead to compute a more generic Euclidean distance. The cost of doing so: using more information about the descriptor contents would allow for better feature matching. The benefit: it would be very easy to swap out the type of descriptor used, without changing this part at all.
  2. Second, perform a "ratio test" by computing the ratio of the distance to the nearest neighbor to the distance to the second-nearest neighbor. The purpose of this test is simple: avoid matching a point that is very similar to many different points in the other image. If there exists a point that is not distinct from any other point, it cannot possibly be matched well.

Performance and Implementation

The decision to use the given algorithms described previously was made early on. Particular details about the implemtation, however, made large improvements in the performance of this feature-matching pipeline.

The above pair of images demonstrates changes across one particular change that produced a nearly 10% improvement in performance. The first image shows matched points before the change; the second image shows them afterwards.

As stated above, a SIFT descriptor is computed using the gradients in a 16x16 window surrounding the point of interest. My early implementations computed the window of the image, then performed filtering operations in order to compute the gradients. However, this order of computation causes inaccuracies at the edges of each window: filtering operations must interpolate pixel data beyond the edge of a window, using one of several different strategies to do so. As a result, the computed gradient at these edges was not necessarily equal to the true gradient. Changing the computation order resulted in a nearly 10% improvement to about 87%, shown in the images above.