Point correspondence between images (Red Circles are incorrect matches!)
The goal of this project is to create an efficient feature matching pipeline. The pipeline uses simplified version of SIFT features. The motivation behind this project is to match two images of same scene. This task is achieved by matching images using local features. Local features are distinct among images and robust to occlusion and clutter. Also they are efficient to compute and we can find hundreds of them. The main goal behind image matching are object recognition, panorama construction, object tracking etc.
Local Feature matching algorithm includes three major components:
The implementation basically follows Lowe's "Distinctive image features from scale-invariant key-points" paper.
There are different key points detection strategies like Harris corners, MSER etc. This project uses Harris corner detector. Harris corner detection algorithm basically finds points in the image such that the minimum change caused by shifting the window (patch around the key point) in any direction is large. This error in shifting is calculated using Sum of Square Distance. The points with error greater than certain threshold are used as key points.
Key points extraction using Harris corner detection algorithm
As the name suggests we suppress (remove) all the key points (pixels) that are no part of local maxima. The idea is to slide a 3x3 window through the image and suppressing all the key points except for the key point with maximum confidence. This is the way to ensure that key points are evenly distributed among the image. Though we will see that Adaptive non-maxima suppression scheme distributes the key points more evenly.
The Harris Corner Detector provides large number of interest points that has some local non-maximum suppression in a 3x3 grid. But we want the points to be spatially diverse. Adaptive Non-maximal Suppression algorithm developed by Lowe is used to get feature points which are evenly distributed throughout the image. To perform adaptive non-maximal suppression for each interest point we compare the corner strength to all other interest points and we keep track of the minimum distance to a larger magnitude interest point. After we've computed this minimum radius for each point, we sort the list of interest points by descending radius and take the top N. The output is 500 interest points well distributed spatially.
After extracting and filtering key points we compute a descriptor vector for each key point such that the descriptor is highly distinctive and partially invariant to the remaining variations such as illumination, 3D viewpoint, etc. I have implemented two kinds of feature descriptor, one is SIFT feature descriptor i.e. histogram of oriented gradients and the other one is normalized patch around the pixel.
In SIFT, we take a 16x16 region around the key point(pixel) and then divide it into 16 cells of dimension 4x4 each. Then we perform binning of magnitude based on orientation. we used 8 bins for 8 orientation values. Further more we weight the gradient magnitude using Gaussian function of sigma equal to half the window size. Thus we finally get a 128 dimension vector representing the feature descriptor at that key point. The visualization of the feature descriptor is show n below.
This is the simplest kind of descriptor anyone could think of. Just taking a patch around the key-point that encapsulate local information also works pretty well. I implemented this feature as a baseline to compare my SIFT like feature. The visualization of both feature is shown above. For this descriptor I used 8x8 grid around the interest point with spacing of 4 pixels. Each descriptor is normalized by subtracting mean and dividing it by standard deviation.
In this part we perform feature matching to find the best correspondence between features among different images. I have implemented two ways for feature matching. I started with Lowe's ratio test and the applied RANSAC to filter outliers.
After computing the feature descriptors we find the matching descriptors by comparing each feature in image1 to each feature in image2 using Sum of Squared Distance metric. Then for each feature in image1 we select the two features in image2 having less SSD error. And we select one with least squared error, if the ratio of these two features is less than a certain threshold. I used 0.5 as threshold, but keeping threshold as low as possible improves the matches but then number of matches also decreases. Feature matching using ratio test is shown in the figure below.
Though ratio test gives a good amount of matches but it still provides a lot of bad matches. RANSAC can be used to filter out the bad matches. The algorithm is pretty straight forward, we randomly select four feature pairs and compute homography on them. Then using this homography we transform all the key-points and see if these transformation is valid, and count the number of inliers. The homography with a maximum number of inliers is selected and those points that does not adhere to this homography are rejected. This filtering scheme is very robust and filters a significant amount of bad matches. Feature matching using RANSAC is shown above.
|
|
|
|
Scale-Invariant Feature Transform (SIFT) has many applications in Object Recognition, Localization and Mapping in Robotics, Panorama Stitching, Object Tracking etc. Also performing scale and orientation invariance will definitely improve the results.