Note: This session has a complementary soundtrack, because we know you all like Francophilic early 90s avant-pop.
Learning Objectives
Understand the principles of stereo vision.
Use two images taken from slightly different perspectives to find the depth of objects in an image.
Background
Feel free to read and follow the slides at the end of the corresponding stereo deck for more detail on the epipolar geometry described below.PPTX | PDF
A photograph flattens the three-dimensional world into a two-dimensional image plane. Information about the 3D structure of the word, like the size and distance of different objects from the camera, is distorted or lost.
How can we recover this lost information? We might take inspiration from our own visual system. Humans perceive depth using a number of mechanisms, including the fact that our vision is binocular: we have two eyes that see the world from slightly different places.
We can emulate our binocular vision by using two photos taken from slightly different perspectives. How far corresponding points move from one photo to the other will tell us how far away the objects are. Objects closer to the camera will move more.
When we take two images of the same scene from different perspectives, then points along lines in the first photo will lie along different lines in the second photo.
These "epipolar lines" are very helpful; they tell us where we should look for corresponding points in our two images. These lines are computed using the parameters of the two cameras, but in this lab we will ignore this process and provide you with images that have already been rectified, meaning that the images have been transformed such that corresponding points lie along horizontal lines across both photos.
Corresponding Points
To find the depth of a point in the image, we want to find out how much that point moves in the sensor plane between its projection onto our two stereo images. This is called the disparity. Disparity is related to depth by: $$z = \frac{ft}{x_l-x_r},$$ where \(z\) is the depth, \(t\) is the baseline distance between the two optical centers of the cameras, \(f\) is the (shared) focal length of both cameras, and \(x_l-x_r\) is the disparity.
To do this, we need to find the corresponding point in the second image. Since we are working with rectified images, we can restrict our search to the same horizontal line in both images.
Our approach will be to slide a small window across the second image and record the patch that is most similar to the patch in the first image.
The most similar patch contains the corresponding point, and the disparity between the location of these points in the rectified images tells us the depth of the points.
In this dataset, the focal length of the camera is 3740 pixels, and the baseline is 160 mm.
Task
We will implement a block matching algorithm for finding corresponding points in two rectified images. The stencil code has an matrix the same size as the input images called disparity. For each pixel (y, x) in the first image, we wish to set disparity(y, x) to be the horizontal distance between the pixel in the first image and its corresponding point in the second.
We've seen a few block matching approaches; let's parameterized by pixels in the window \(x,y\) and disparity \(d\) and try our old favorites, plus some new approaches:
Sum of squared differences: $$\mathrm{SSD}(I_1,I_2) = \sum_{x,y \in W} (I_1(x,y) - I_2(x+d,y) ) ^2$$
Sum of absolute differences (`total variation'): $$\mathrm{SAD}(I_1,I_2) = \sum_{x,y \in W} |I_1(x,y) - I_2(x+d,y)|$$
Rank transform (not an 'intensity'): $$\mathrm{RT}(I_1,I_2) = \sum_{x,y \in W} (r(I_1,x,y) - r(I_2,x+d,y) )$$
$$ r(I,x,y) = \sum_{m,n \in W} I(m,n) < I(x,y) $$
r(I,x,y) is a signature of variation; I(x,y) is the center pixel of a window, and we count how many neighboring values are below I(x,y).
Census transform: $$\mathrm{CT}(I_1,I_2) = \sum_{x,y \in W} \mathrm{HAMMING}(\mathrm{BITSTRING}(I_1,x,y), \mathrm{BITSTRING}(I_2,x+d,y) )$$
$$ BITSTRING(I,x,y) = \mathrm{CONCAT}_{m,n \in W}( I(m,n) < I(x,y) ) $$
This is like the rank transform, but instead of summing the binary 'less than' indicators from \(m,n\) pixel values in the window \(W\) around \(I(x,y)\), we concatenate them into a bitstring and compare them via Hamming distance.
Example of bitstring() on a \(3\times3\) window:
96
73
105
81
84
79
101
98
84
Produces bitstring: 01011000
What are the advantages and disadvantages of these new measures compared to the more familiar distances that we have seen?
Stencil code
More experiments (inspiration)
Convert your disparity map to a depth map, and visualize it in 3D using MATLAB. Spin the camera around!
Post processing: Use morphology, and/or an edge-aware smoothing filter.
How could we prevent averaging over edges? One way is to create a set of 'shiftable' windows where the point of evaluation within the window varies to nine extremities (four corners, four edge centers, plus the window center). Then, in each location, we pick the shifted window (and evaluation point) which minimizes the matching score. The intuition is that match score will likely decrease if the window does not cross a depth edge.
Can we adaptively vary the size of the window based on the image content? What adaption condition would be appropriate?
Consider a coarse to fine pyramid search strategy with increasing image sizes (or decreasing window sizes). This might increase robustness to noise or accelerate the search.
Use your depth map to blur your image! Create depth-dependent kernels to make your own portrait mode.
Stretch Goal: Dynamic Programming scanline stereo
We saw in class how we can add ordering and uniqueness constraints to our solution by setting up stereo disparity estimation as a dynamic programming problem. This is heavily related to the boundary seam creation methods from the textures and seams coursework project.
Further reading
Classical: Brown, M. Z., Burschka, D., and Hager, G. D. 2003. Advances in Computational Stereo. IEEE Trans. Pattern Anal. Mach. Intell. 25, 8 (Aug. 2003), 993-1008.
Classical: Scharstein, D. and Szeliski, R. 2002. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Int. J. Comput. Vision 47, 1-3 (Apr. 2002), 7-42.
Submission
Please upload your MATLAB code, input/result images, and any notes of interest as a PDF to Gradescope. Please use writeup.tex for your submission.
Acknowledgements
This lab was developed by the 1290 course staff. Thanks to Andrea Fusiello.