CS 143 / Project 2 / Local Feature Matching

Local feature matching relies on three major steps:

  1. Interest Point Detection
  2. Feature Description
  3. Feature Matching

In step one, we focus on detecting interest points in images. To do this, we employ a simple version of the Harris corner detector. Let I be an image; we calculate derivatives Ix and Iy by convolving the image with derivatives of a Gaussian filter in the x and y directions, respectively. We then generate the structure tensor A as

A = Ix^2IxIy
IxIyIy^2

in an elementwise fashion. The eigenvalues of this matrix represent rates of change: for a corner, we want both eigenvalues to be large, indicating the imagine is changing rapidly in all directions, which makes the feature notable. To this end, we define a matrix of scores as R = g(Ix^2) * g(Iy^2 - g(IxIy) ^ 2 - alpha * (g(Ix^2) + g(Ix^2))^2, where alpha is a weighting for how distinctively cornerlike the feature must be (here, we start with alpha = 0.05) and g is a gaussian blur. Finally, we detect local maxima of the score, using non-maximum suppression to avoid drawing multiple points from the same feature. As a base implementation, this simply relies on thresholding the image and taking as interest points those features which have larger scores than an 8x8 region around them (on the order of the size of a feature in this project, 16). This procedure looks as follows:


Base Image

Ix^2

Iy^2

IxIy

Cornerness Scores

Thresholded

After Non-maximum Suppression

With our candidate interest points, we seek to implement feature description. We use a basic version of the SIFT descriptor, which assigns a 128-length feature vector to each interest point. A 16x16 patch around each point is used. We calculate the image gradient at each point, then weight by a Gaussian centered at the point of interest and with variance equal to half the feature width (8). The 16x16 region is divided into 16 4x4 squares, and each of 16 squares is given a length 8 histogram corresponding to directions (separated by 45 degree angles). Each pixel contributes its weight in a trilinear fashion to the bins on its left and right, the bins above and below it, and the direction before and after its direction. The feature vectors are then normalized, thresholded at 0.2 (to reduce excessive contribution from granularities) and then normalized again.

Finally, we implement a basic feature matching procedure. The Euclidean distance between features is calculated, and each feature for the first image is assigned a list of matches and distances in the second image. We calculate the ratio between the closest match and the second-closest match: the smaller, the more likely a correct match has been made. This ratio is assigned as the match confidence and matches are ranked by confidence (and thresholded above a certain confidence level). Our base implementation takes 929 features from image 1 and 760 features from image 2 and finds that 242 matches clear the ratio threshold of 1.2. Of these, 150 are correct and 92 incorrect for an accuracy of 62%. If we up the threshold to 1.4, this increases to 86/109 for 79% accuracy; to 1.6, and we get 52/62 = 84%. This corroborates the idea that a higher distance ratio is more indicative of a correct match. The top 100 appear below:

We see that many of the incorrect points are actually very good matches for each other that rely a bit on spatial positioning: for instances, the incorrect blue point near the bottom is at the top left corner of a door in both, which would be indiscernible from just looking at 16x16 image patches. Note the actual match for the left feature is obscured in the right image by the tree. Drawing other comparisons, it is clear most incorrectly matched features are reasonably similar, and so the algorithm is doing its job effectively based on the limited information we provide it.

With this baseline implementation in place, we may experiment with the effect of various parts of the algorithm on the image accuracy. A first interesting question is how much the trilinear weighting in the SIFT descriptor affected the results; we may remove it and check the accuracy. We get only 130 clearing the 1.2 threshold, with 79/130 = 61% accuracy. Namely, we got many fewer good matches without trilinear interpolation, although the thresholding at the end yields roughly the same accuracy. If we instead compare the top 100, we have 65/100 correct without interpolation and 82/100 correct with, so it is clear the trilinear interpolation in the SIFT pipeline has a dramatic effect on making good matches. The matches without interpolation appear below:

We now play with some other parameters in the interest point generation stage. We may gauge the effect of alpha with a few runs:

Alpha:00.010.030.050.070.090.2
# of Top 100 Matches Correct:72818482868173

We see the range 0.03 - 0.07 performs best, with higher and lower values reducing the accuracy. Rather than re-setting alpha to 0.07, we will leave it at 0.05 to avoid overfitting to this particular image.

Next we may examine the effect of changing the Gaussian blur widths. First we toggle the derivative of Gaussian variance:

Variance (fraction of feature_width):1/161/81/41/21
# of Top 100 Matches Correct:7280827974

A variance on the order of 1/4 the feature size works best, which happens to be the default value used. Next, the structure tensor blur variance:

Variance (fraction of feature_width):1/321/161/81/4
# of Top 100 Matches Correct:87908778

It appears a variance on the order of feature_width/16 would improve our results significantly, so we make the change.

Finally, we experiment a bit with the window of non-maximum suppression - will allowing closer interest points be better?

Minimum X,Y Distance Between Maxima (fraction of feature_width):1/81/41/212
# of Top 100 Matches Correct:9390908648

Allowing more points of interest through appears to be preferred, as opposed to finding only dominant maxima. We adopt a 1/8 ratio for the remainder.

We may also toy with the feature size:

Feature Size:163264
# of Top 100 Matches Correct:939686

The optimal feature size will depend on the image (we have not chosen to implement sensitivity to scale), but for this image 32 appears to be roughly optimal. Then our final matching after the tuning of parameters is the following, with 96 out of the 100 top matches correct:

All of the mistakes are to very similar patches, so we are satisfied with the base implementation. We may see how the image matching looks on a few other sample images, although their accuracy can only be compared by eye:


Mount Rushmore


Statue of Liberty


Capricho Gaudi

The performance on Mount Rushmore is excellent, with almost all points appearing to match visually. It is able to be very effective because the point of view and scale are nearly identical across images. The statue of liberty struggles more, due to the difference in scale/angle of the image. It does, however, correctly match many of the points on the figure itself, and it is interesting to note that the matches on the pedestal are all between the correct side. The final image does a good job matching building features, but is overly fond of matching random patches of vegetation (well, they are visually similar... some sense of space is important).