I initially implemented all the algorithms as baseline as possible. I originally implemented feature description much differently then my current implementation. I obtained the contribution in each bin by filtering the image windows with different oriented filters. I then summed all the positive values for each filter and used that as the value for the orientation. I did not have multiple contributions across bins. When I started out, I would get anywhere from 30-50 correct matches in my top hundred. As such, I changed my implementation to calculate the angle & magnitude with atan2, and set up the multiple orientation donation. This slightly improved performance by a few points. The biggest factor that changed things was changing the parameters in my get_interest_points. The first thing I tried was changing my threshold value. I reduced it by orders of magnitude (~100) and found that my results jumped up to about 60. I then went on to add multiple binning, first adding nearest 2 neighboring bins. Again, this showed slight improvements, but nothing too drastic. It did increase the runtime as it now had to calculate 3x the indexes and do 3 times the I/O operations which, while negligible in cost, add seconds to the runtime as they are calculated hundreds of thousands of times. However, it can help if the pixels are along the boundary between two bins. I then realized that my blurs for the interest point selection were very large (sigmas of 5, 3). When I reduced these, I kept much more of the high frequencies of the images. These higher frequencies allowed me to better hone in on the corners and has better information to process. However, when I initially cut down my values, I found the program taking enormous amounts of time. I detected that this was due to massive amounts of interest points. The corner score for points was now much higher than before. This is because with less blurring, points will look less like their surroundings and thus the differences when shifting away from a certain point. I had to raise my threshold value back up to its previous level to get a reasonable amount of interest points (500-900). Yet, the interest points proved to be much more interesting with the new parameters, and I had ~75 points correct at this point. I then added on the 3rd neighboring bin for feature descriptors, which did not produce any new results, although this may be beneficial for other images/parameters. Finally, by fine tuning the feature binning falloff and interest point parameters, I was able to obtain 90% correctness for the Notre Dame image pair.
Corner detection was implemented with a harris corner detector. The gradients of the image were obtained by filtering the image with the convolution of a guassian filter and a correctly oriented sobel filter. This produced the image derivatives. The "corneriness" of each pixel was calculated by approximating the gradients as a quadratic function, allowing us to look at the eigenvalues of the 2nd moment matrix for an estimation of the value. If both eigenvalues are large, this represents a large change in image intensity when moving in any direction, and a good candidate for a corner. Points that were too close to the edges were set to 0, to avoid the hassle of resizing any feature windows in the next step.
The values were then thresholded using im2bw to produce a binary image where all pixels whose value was above the threshold are 1, all below are 0. The threshold was chosen as a percentage of the max corner value. The decision for this was to help select a good threshold value without knowing the image content. This choice of threshold was determined to be very important. A value that is too low results in a binary image that is mostly white. This leads there to being only a few, large connected components that will be reduced into points, leading to less points found. On the other hand, setting too high of a threshold will lead to a mostly black image that too will lead to few points. An optimal value is one that produces many distinct, small connected components. The choice of this value depends on many factors including the size of the two guassians used earlier. I found that I had to drastically raise my threshold (factor of 100) when I changed my guassians' sigmas from 5 and 3 to 1 and .25. Else, the corner detector produced immense amounts of points with the retention of higher frequencies, and the algorithm would take much too long.
From the binary images, connected components were found using the bwconncomp function. This produced the sets of objects and their indices in the image. For each object, the point with the largest corner value was selected to be the representative of that image and marked as an interest point. The more confident corner points are those with a higher corner value.
My features were described in a sift like fashion. Each feature was represented by a 128 dimensional vector. This was created from 16 gradient histograms from a 4x4 grid of bins in a window around the image, each with 8 directions. The direction of the image gradient is found using atan2(dy, dx) for each point. Image gradients are again obtained by filtering with sobel filters. For each point, its angle and magnitude are collected and stored in the histogram. For each point, it will make a contribution to the nearest 4 bins and within the bins make contributions to the two nearest orientations. To help with the binning, the get index function was used to return the indices to contribute to within the histogram. It did the calculation to localize the row and column values to the correct bin and to find the neighbors. I did not see drastic improvements upon implementing the nearest bin additions, but I did note that I got a few more correct matches when I implemented nearest orientation binning at first. Each additional computation adds a non-negligible amount of time to the runtime of the program, as each bin index needs to be calculated, range bounded, and updated with correct values. Because these steps needs to be calculated for every pixel in the feature window, the number of calls add up to the hundreds of thousands (1/3 million). Each of the additional contributions adds about 2 seconds of runtime (despite each calculation taking only a few microseconds). However, with this increase in time we get better feature descriptor and better results. Yet, I was able to obtain 60/100 results correct with an unoptimized implementation that only binned to the current bin and orientation.
After being calculated, each histogram is normalized, its values capped at .2, then normalized again. I did experiment with raising the vector to a power less than 1 rather than capping the values, but I found this to not make a significant difference in performance.
Features were matched with the ratio test of nearest neighbors. For each pint, the distance to all the points in the other set was calculated. Then, the nearest and 2nd nearest neighbors were found. The point was matched with the nearest neighbor, and the confidence in that match was the quotient of the distance to the 2nd nearest neighbor and the nearest neighbor distance (d2/d1). If the 2nd neighbor was very far from the point, we are more confident that the match is not ambiguous. For my matching, I attempt to find a match for every point in features1 in features2.
The Notre Dam images produce very good results and had 90 correct matches. Its distinct geometry with many corners allowed many features to be found. The algorithm performed well at distinct features, like spires and large building edges. However, it was confused at times with repeating patterns. For instance, the figures above the arches proved difficult for the algorithm, as they all had very similar shape and illumination. As such, you can see the algorithm did not always match the correct figures in each image, but did match them to a different figures. However, patterns were not a problem where the pattern was rotated, as in the case with the circular window in the center. The different orientations of the gradients at each of the points allowed the algorithm to more easily match these points, despite the pattern.
The Eiffel tower produces very good results and had 90 correct matches with the ground truth I created. Its distinct geometry with many corners allowed many features to be found. The tower has many oriented surfaces that intersect, producing interest points with many different orientations. Moreover, since the size of the tower decreases as you go, the size of the intersections decease. Due to this effect, even though there is a repeating pattern of crosses throughout the length of the tower, the change in scale allows the feature matcher to correctly match the features. The Eiffel Tower image pair also stressed the fact that my feature matcher was not scale-invariant. As demonstrated in the image below, when one of the images was twice the size of the other, the results were drastically different. Many of the points did not correctly match, as the pixels within each feature are very different. Scaling the images to the same size allowed me to obtain the good results above.
The capitol building illustrates the difficulties the feature matcher may encounter when presented with images that have large swaths of repeated patterns. In this example, the capitol building has countless columns and windows. However, all the areas are located under relatively uniform lighting, color and orientation. Because of this, you will see that it has a hard time matching the row of columns it detected since almost all the columns look the same. However, it still can match unique points along distinct geometry such as the spire. It also dealt better with the columns with different orientations in the dome.