CS 143 / Project 2 / Local Feature Matching

Examples of detected features

The Local Feature Matching project strives to find corresponding points in different images of the same object or scene. Such corresponding points have numerous practical applications, such as a automatic generation of a 3D model of an object based on matching points in pairs of images or object recognition.
In the first stage of the project, the computation of interest points, we look for points in both images, for which we seek to find correspondences. In particular we optimize the interest point detection, such that we find the most descriptive and most discriminable points in the image.
In the second stage of the project we find descriptors for the interest points, also known as 'features'. Having successfully determined distinctive points in both images, we need to approach the problem of matching these points. Even if we found points in both images that referred to the same relative location of the object of interest, we could not use these interest points without knowing which interest points correspond to each other. In order to find the correspondences between interest points we assign each point a feature vector, computed from its neighborhood in the image.
In the final stage of the project we finally match interest points from image one to interest points in image two by comparing their features. Specifically we will match interest points, such that their features are as similar as possible.

Computation of interest points

The interest points, which we will use to find corresponding locations in both images, need to be as discriminative as possible. An interest point, which is very similar to its surroundings, is not very useful. In particular we strive to find interest points that exhibit maximum change for making a small move in its neighborhood i.e. we select points that are the most dissimilar from their surroundings. In order to find such points we apply a variant of the algorithm of Harris corner detector, which provides a measure of 'cornerness' of a point based on an approximation of the second order derivative matrix for the function indicating a point's dissimilarity from a point in its neighborhood.
We first compute the gradient of the image in the X and Y direction by convolving it with a Gaussian derivative. Then we compute the 2nd derivative matrix by taking the outer products of the gradients. Subsequently we convolve with a Gaussian blur filter. Lastly, we apply a formula from the Harris Corner Detector that computes a property of the Hessian's eigenvectors efficiently in terms of element-wise multiples of matrices, yielding a useful measure of 'cornerness' of a point.

Optimization by non-maximum suppression

One of the problems with the computation of interest points based on 'cornerness' is that we might find patches with high cornerness in the same neighborhood, which will in turns yield similar features and make matching points very difficult. To ameliorate this problem we perform local non-maximum suppression; for a cluster of interest points we only pick the one with the highest 'cornerness'. Specifically we choose a cutoff value to express the image in binary and subsequently run an algorithm to detect connected regions. For each connected region we only use the one interest point with the highest cornerness. The exact cutoff threshold is subject to manual tuning.

We see in the above images that without local non-maximum suppression we detect a lot of interest points that are too similar to one another, which will likely lead to mismatches in the interest point matching algorithm. To improve performance we restrict each cluster of close interested points to the local maximum. As we can see in the image of the implementation using local non-maximum suppression, each maximum has only one interest point associated with it.

Extraction of features

Trivial implementation using normalized image patches

For the first simple implementation of the features we simply use the normalized patch around the interest point. We can assume that for small changes in the viewing angle of the original object/scene a local patch will only change slighly. Hence, we can expect two local patches from the same corresponding interest points to look more similar to each other than to other patches.

Examples for normalized local patches extracted from the image

Improved implementation using SIFT features

In the optimized version of our implementation each interest point we compute a SIFT feature: We partition the neighborhood of each point into a grid of 16 x 16 points, in which we measure the local dominant orientation and its strength. I.e. for point on the grid we compute a vector with a real length, which points in a direction in the XY plane. Subsequently we partition the 16 x 16 points into a 4x4 grid, for which we compute histograms of the dominant orientations, weighted by the local magnitudes. In the computational implementation we initially convolve the image in the X and Y direction with a sobel filter to approximate the gradient. We obtain the dominant orientation of the pixel by applying the function 'atan2' to the computed gradients and we obtain the magnitude (i.e. the length of the vector) by taking the square root of the sum of squares of the gradients. We compute the across the entire image and for obtaining the SIFT features simply collect these precomputed properties in a neighborhood and create a histogram.

Magnitudes of local gradients:

Orientations of local gradients:

Feature matching

Simple nearest neighbor matching

In the simplest implementation of the feature matching algorithm, for each feature in image one we pick the feature in image two that has the smallest Euclidean distance. One problem with this approach is that it does not take into account how likely a feature may be confused with a similar feature.

Matching performance with local patches and nearest-neighbor matching. Correct matches are indicated with green circles, incorrect matches are indicated by red circles Even though we are not yet using SIFT features, the detection is around 90% accurate. However, the paramters have admittedly been overfitted to the specific image pair and do not generalize well to other images.

Matching based on NNDR

We obtain a slighly better matching algorithm by, instead of minimizing the distance of the current feature from the feature it is compared to, minimizing the ratio of its distance and the distance between the next closest patch.

We observe that for an addmittedly small number of poits we achieve full accuracy for the SIFT descriptor. However, the parameters have been tuned specifically to this image pair and do not generalize well to other images.

Feature matching for other datasets

We see in the example of Mount Rushmore that most of the features are not detected or rejected for our parameters. While the eye of Thomas Jefferson is matched correctly, the second interest point is completely astray. We conclude that our parameters are overfitting for the Notre Dame image pair and do not generalize well.

For our house the matching works surprisingly well. While we do not have a ground truth to verify our correspondences, most matches look resonable. The house seems to have local structures resembling those of Notre Dame that makes the optimized parameters appropriate for this dataset as well.

For the Pantheon the algorithm completely fails, not matching a single pair successfully. The optimized parameters are not appropriate for this image.

We can infer from these examples that our model is currently overfitting the Notre Dame example.

Conclusion

The project has shown that in order to match corresponding points in two images it is necessary not only to choose points to work with that have the highest discriminability compared to their surroundings, but also to use features that exhibit the highest possible invariance. Our normalized local patches have proven to be a sufficient first approximation, but as they are not ratationally invariant, they perform much worse than the SIFT features, which are histograms of local orientations and are hance more invariant in both scale and orientation.