CS 143 / Project 2 / Local Feature Matching

In this project, I’ll be implementing SIFT-like feature matching algorithm, following [4] chapter 4.1 and the idea of SIFT[3] by David Lowe

The script will visualize the feature correspondence and compute the matching accuracy on this test image, (Figure 1) which have 83 total good matches, 17 total bad matches

Figure 1: correspondence(left) and matching accuracy(right) of the example image

2 Interest point detection (get_interest_points.m)

Following the pipeline of Harris corner detector described in [4] section 4.1.1. We implement

where im is the grey scale image from imread, radius controls the window size for non-maximum suppression, and feature_width controls the border size of the image that we want to eliminate. The returning value x and y are the coordinates of the interesting points.

First of all, we blur the image using a gaussian filter with σ = 1 and size 6 × 6. Then, using filters of dx = [−1, 0, 1],dy = dx^T, we get the I_x and I_y map of the image. After that we can simply get I_x²,I_y²,I_xI_y maps accordingly. As proposed in [1], we use the harmonic mean

Then we’ll perform non-maximum suppression, using a slightly modified version of [2], we have the function

where cim is the harmonic mean map following equation 1. The mechanism of this function is: for each interesting point, we look at the neighboring window with radius size, if it’s indeed has the maximum value, we preserve it. Also, we want to make sure that the harmonic mean of this point is above the median of that of all points. Finally, since we are going to extract features with window size feature_width, we don’t want to preserve points that lies at the border of the image. And this concludes our interest point selection part. An visualization of the detected interesting points are shown in figure 2

Figure 2: Interest points

If we don’t perform non-maximum suppression but only thresholding on the harmonic mean, we’ll get more than 380000 points to compute, and the computational cost is too large. After implementing the non-maximum suppression, we get about 2500 points for each image. Therefore, non-maximum suppression is indeed helpful in choosing the most informative points.

3 Local feature description (get_features.m)

Given interesting points, the next step is to get features for each point. Here, we follow the idea of SIFT[3], which briefly described in Figure 3

Figure 3: sift feature

features is a n by 128 matrix, where n is the number of interesting points. After getting I_x and I_y matrices like before, we loop through all the interesting points: first we extract the patch with size feature_width × feature_width that covers the interesting point and convolve its I_x and I_y maps with a gaussian filter. As a consequence, the points that close to the center will contribute more in the feature vector. Then for each pixel on this patch, their orientation and scale are computed by equation 2

A gradient orientation histogram is then computed in each subregion. Figure 3 only shows a 8 × 8 pixel patch and 2 × 2 descriptor array, while in our implementation, a 16 × 16 patch is computed and producing a 4 × 4 array of eight-bin histogram.

4 Feature matching (match_features.m)

Now that every interesting point has a feature vector, we can match them using simple techniques, such as nearest neighbor classifier. The distance between any two pixels is defined as the L₂ norm of their feature vectors. After we computed the minimum distance of each points in image1 to those in image2, we only preserve those that is smaller than the mean of all the minimum distances. We also compute the nearest neighbor distance ratio[1] as

where d₁ and d₂ are the nearest and second nearest neighbor distances, we manually choose 0.75 as the threshold. Intuitively, the smaller this value is, the more confident we can be about this match.

5 (a bit) more implementations

A user interface is created for people to label pixel correspondence for every image:

Figure 4: interface for constructing ground truth

6 More examples and discussions

feature_width turn out to be a really important parameter to tune. If we use 48 instead of 16, we’ll get 94 total good matches, 6 total bad matches on the Notre Dame image match, a 10% increase in accuracy. Also, the radius turns out to be important too: if we decrease the size of the radius, we'll get more interesting points as candidates, thus potentially increase the performance. If radius=8, we'll actually get 73 good matches and 27 bad matches, a 10% decrease in accuracy.

We also tested the implemented sift-like detectors on other image pairs, and the results are shown below. It's pretty clear that if there's a big change in view/scale, the performance of our implementation is poor.

6.1 Capricho Gaudi


31 total good matches, 29 total bad matches

6.2 House


44 total good matches, 20 total bad matches

6.3 Pantheon Paris


30 total good matches, 71 total bad matches

6.4 Episcopal Gaudi

6.5 Mount Rushmore

6.6 Sacre Coeur

6.7 Sleeping Beauty Castle Paris

6.9 Statue of Liberty

References

[1] M. Brown, R. Szeliski, and S. Winder. Multi-image matching using multi-scale oriented patches. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 510–517. IEEE, 2005.

[2] P. D. Kovesi. MATLAB and Octave functions for computer vision and image processing. Centre for Exploration Targeting, School of Earth and Environment, The University of Western Australia. Available from: <http://www.csse.uwa.edu.au/∼pk/research/matlabfns/>.

[3] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.

[4] R. Szeliski. Computer vision: algorithms and applications. Springer, 2011.