CS 143 Project

CS 143 / Project 2 / Local Feature Matching

Project 2 run on the pre-evaluated Notre Dame pictures. 53 matches were made in total, of which 46 were correct: an accuracy of 86 %.

Introduction

This project tries to implement a feature matching pipeline, based on a simplified version of a standard SIFT pipeline. A working feature matching program should be able to take two images of the same scene, taken from different perspectives, and identify corresponding points in both images. The pipeline consists of first identifying key points in both images; then, for each of the key points, establish some sort of feature description (a representation of the region around the point); and finally, matching points in each image based on the proximity of their feature descriptors.

Algorithm

Corner Detection

A basic Harris corner detector was implemented. The equation for the Harris detector depends on computing the image derivatives (and squaring and blurring them to suppress outliers); I experimented with using directed Sobel filters but eventually settled for a simple [-1,0, 1] (and its transpose for the vertical image derivative) as I found that it allowed for more accurate points. That is to say: using the Sobel filter would give me far more points after thresholding, but visualisations demonstrated the algorithm would still get hooked on points on edges. The simpler filter returned approximately four times less points than the Sobel filter, but these points were more suited to finding successful results in the pipeline overall. (It is also to be noted that the filter I use is more prone to thresholding; that is to say, the current threshold is set quite high, but if more points were truly desired, the threshold could be lowered for a significant point gain. This was, I found, less the case with the Sobel filter, where increasing the threshold generally did not translate into more accurate points. The blurring effect of the Sobel filter seems to lead to a general flattening of pixel values.)

The output of the Harris detector visualised on one image.

Feature Descriptors

Each point returned by the Harris corner detector's region is codified into a 128-dimensional vector, as described by the SIFT algorithm. The region around the image is subdivided into 16 squares: in each square, an 8-bin histogram is made summarizing the directions of each pixel within each square. All these histograms are then conglomerated in the large feature vector, which is subsequently normalised.

I did not implement any other feature descriptors due to the fact that midterms have taken a toll on my time. It would be interesting to see how different schemes weigh up against each other. All I can say about this step is that it is, unfortunately, unreasonably slow. This was part of the motivation also for not using the Sobel filter in the Harris corner detector: by reducing the amount of points found, the computation takes less time.

Matching points

We match points using the Nearest Neighbor Distance Ratio. For each point in image 1, we calculate the proximity of the point's feature descriptor to the descriptors of all the points in image 2. We examine the ratio between the distance to the point " closest " to point A (in terms of the distance of feature descriptors) and the second closest point. If the ratio is sufficiently low (a threshold that we can manipulate as a parameter) then we can safely assume the match is a valid one. The lower the ratio, the higher the confidence is of the match being a good one. I ended up setting the threshold at about 0.6, as this was not prohibitively low to eliminate all points, but was also rather discerning about which matches to accept. A higher threshold will mean more matches will be made, but less accuracy can be guaranteed.

Again, due to time constraints I was unable to make efforts at making a different matching scheme. The NNDR test seemed to be working well as is, but is however computationally tedious. Ideally, this step would be replaced by a scheme which allows for more geometrical considerations and less computational brute force, but these two aims are at odds with each other. A more geometry-dependent scheme may not lead to such a significant rise in results as to merit its extra power (although, again, this is pure conjecture).

Results

Running the algorithm on the already established Notre Dame image pair led me to believe something good was happening. As described above, 53 matches were found in total, of which 46 were corrected, hinting at an accuracy of about 86 % for this one image pair: this was setting the point threshold at 0.01 and the matching NNDR ratio threshold at 0.6. Rerunning it with a lower corner threshold (0.0075) and a higher NNDR ratio led to far more points - 197 in total, of which 166 were correct (84 %). Putting the NNDR at 0.8 has significant implications on performance: more matches are found (516) but accuracy is 72 % in this case. These results confirm the class lecture notes, as well as intuition, that there is a necessary tradeoff between quality and quantity.

Here follow some more test cases of test images. Sadly there was no ground truth to establish accuracy from, although the visualisation of the 20 most confident matches can reveal some things about the effectiveness of the algorithm.
Some points are obvious matches; the brown spot on the tower, the corner of the lower window. But a lot of the points seem lost: the NNDR threshold had to be significantly raised to even get this many matches.
Here no parameter finetuning was done but the features seem to match up quite well.
We can see here how some points are mismatched because of the fact there regions are similar: see the cluster in the bushes, or generally around the statue's edges.
Here matches seem to occur almost on a textural basis: see the matches on the statue's cloth, or the way matches are made not so much on actual space but on their belonging to the same textural object. This might point at a weakness of using SIFT descriptors only. Matches, however, tend to be generally okay.

Conclusion

Though the pipeline seemed to work well on the given images with ground truths, generally allowing for accuracies of around 82- 86 %, we must note that using the image on other image pairs may or may not be as accurate. From my findings, we see that some image pairs make for very good comparisons from the get-go; others may need some fine-tuning of parameters. It would be interesting to expand the pipeline using complementary or alternative strategies for each step, to create a more robust and invariably solid system.

Jonathan Adam ja11

CS 143 / Project 2 / Local Feature Matching

Introduction

Algorithm

Results

Conclusion