Example of local feature matching for images of Notre Dame. 91 points were correctly matched; 29 points were incorrectly matched.
The goal of this assignment was to create a local feature matching algorithm as a simplified version of the SIFT matching pipeline. To do this, we first had to detect interest points using Harris corner detection. We then had to implement a SIFT-like local feature descriptor. Finally, we needed to match the features using the nearest neighbor distance ratio test.
The Harris corner detector algorithm is as follows:
where the cornerness is the difference between the determinant and trace at a given pixel.
I calculated the derivatives using two 3-by-3 Sobel filters. The larger Gaussian
I used to blur the outer products was a 25-by-25 Gaussian with standard deviation
of 2.25 (for the Notre Dame pair). I determined the threshold by using the
graythresh
method, and used this threshold to convert the cornerness
values from grayscale to black-and-white (i.e. real-valued to binary). I then ran
bwlabel
on the binary image to get connected components, and chose
the pixels with the maximum cornerness value from each component as keypoint locations.
I ignored any pixels for which a feature_width-by-feature_width window centered
at that pixel would not fit inside the bounds of the image. The confidence value
for each potential feature location is the cornerness value at that point.
Each feature consists of a 4x4 grid of cells, each feature_width/4, centered around an interest point. For each cell, an 8-bin histogram is formed, storing the local distribution of gradients.
The histogram is constructed by first calculating the magnitude and direction of the gradient of each pixel in the cell. The gradient magnitudes are then weighted using a feature_width-by-feature_width gaussian with standard deviation feature_width/2, to weight the gradients of pixels closer to the interest point more heavily than those that are farther away. The weighted magnitudes are then added to the appropriate bin, each with a width of 45 degrees, but are weighted again using linear interpolation (i.e. multiplied by (1-d), where d is the distance between the orientation and the center value of the bin to which it belongs.
This creates an 8-dimensional vector of weighted sums of gradient magnitudes for each cell, meaning that each feature is a 4x4x8 = 128 dimensional vector. This vector is normalized, values are capped at 0.2, and then the vector is normalized again.
Weighting the magnitudes of the gradients by a Gaussian roughly doubled the number of matches found, while preserving the percentage of good matches (i.e. true positives). Using linear interpolation seemed to decrease the percentage of good matches in the Notre Dame pair (maybe because of the differences in orientation of the interest points?), but seemed to help in other test cases where scale, but not vantage point, changed between images. Normalizing and capping the feature vector considerably increased the precentage of good matches
Features were matched by computing the distances to all features in the second image for each feature in the first image. The ratio between the first and second closest features was used to determine whether the pair was a true match: if the distance to the first over the distance to the second was less than 0.75 (for the Notre Dame pair), then the closest feature in the second image was considered a match to the feature being considered in the first image.
The confidence used for pairs was 1 minus this ratio (i.e. 1-NNDR); the larger the distance between the first- and second-closest features, the more confident we are that the first is a true match.
Below are some more examples of the pipeline. These seem to match better than the Notre Dame pair, most likely because there is less change in orientation. This makes sense, since I didn't account for any changes in orientation in my algorithms.