Lab: Stereo

Note: This session has a complementary soundtrack, because we know you all like Francophilic early 90s avant-pop.

Learning Objectives

Understand the principles of stereo vision.
Use two images taken from slightly different perspectives to find the depth of objects in an image.

Background

Feel free to read and follow the slides at the end of the corresponding stereo deck for more detail on the epipolar geometry described below. PPTX | PDF

A photograph flattens the three-dimensional world into a two-dimensional image plane. Information about the 3D structure of the word, like the size and distance of different objects from the camera, is distorted or lost.

How can we recover this lost information? We might take inspiration from our own visual system. Humans perceive depth using a number of mechanisms, including the fact that our vision is binocular: we have two eyes that see the world from slightly different places.

We can emulate our binocular vision by using two photos taken from slightly different perspectives. How far corresponding points move from one photo to the other will tell us how far away the objects are. Objects closer to the camera will move more.

When we take two images of the same scene from different perspectives, then points along lines in the first photo will lie along different lines in the second photo.

**Fig 1. By stereo geometry, points in one image are constrained to lie on a line in a second image. This line is called an epipole.**

These "epipolar lines" are very helpful; they tell us where we should look for corresponding points in our two images. These lines are computed using the parameters of the two cameras, but in this lab we will ignore this process and provide you with images that have already been rectified, meaning that the images have been transformed such that corresponding points lie along horizontal lines across both photos.

**Fig 2. Rectification projects stereo images to be coplanar if their sensors were originally not aligned.**

Corresponding Points

To find the depth of a point in the image, we want to find out how much that point moves in the sensor plane between its projection onto our two stereo images. This is called the disparity. Disparity is related to depth by: $$z = \frac{ft}{x_l-x_r},$$ where $z$ is the depth, $t$ is the baseline distance between the two optical centers of the cameras, $f$ is the (shared) focal length of both cameras, and $x_l-x_r$ is the disparity.

To do this, we need to find the corresponding point in the second image. Since we are working with rectified images, we can restrict our search to the same horizontal line in both images. Our approach will be to slide a small window across the second image and record the patch that is most similar to the patch in the first image. The most similar patch contains the corresponding point, and the disparity between the location of these points in the rectified images tells us the depth of the points.

Data

Our stereo pairs have been taken from the Middlebury Stereo datasets and evaluation benchmark. Specifically, four scenes from the 2006 collection.

In this dataset, the focal length of the camera is 3740 pixels, and the baseline is 160 mm.

**Fig 3. *Bowling* test stereo image pair (left/right).**

**Fig 4. *Flowerpots* test stereo image pair (left/right).**

**Fig 5. *Lampshade* test stereo image pair (left/right).**

**Fig 6. *Midd2* test stereo image pair (left/right).**

Task

We will implement a block matching algorithm for finding corresponding points in two rectified images. The stencil code has an matrix the same size as the input images called disparity. For each pixel (y, x) in the first image, we wish to set disparity(y, x) to be the horizontal distance between the pixel in the first image and its corresponding point in the second.

We've seen a few block matching approaches; let's parameterized by pixels in the window $x,y$ and disparity $d$ and try our old favorites, plus some new approaches:

Sum of squared differences: $$\mathrm{SSD}(I_1,I_2) = \sum_{x,y \in W} (I_1(x,y) - I_2(x+d,y) ) ^2$$
Zero-normalized cross correlation: $$\mathrm{ZNCC}(I_1,I_2) = \frac{ \sum_{x,y \in W} (I_1(x,y)-\overline{I_1})(I_2(x+d,y)-\overline{I_2}) }{ \sqrt{ \sum_{x,y \in W} (I_1(x,y)-\overline{I_1})^2 \sum_{x,y \in W} (I_2(x+d,y)-\overline{I_2})^2 } } $$
Sum of absolute differences (`total variation'): $$\mathrm{SAD}(I_1,I_2) = \sum_{x,y \in W} |I_1(x,y) - I_2(x+d,y)|$$
Rank transform (not an 'intensity'): $$\mathrm{RT}(I_1,I_2) = \sum_{x,y \in W} (r(I_1,x,y) - r(I_2,x+d,y) )$$ $$ r(I,x,y) = \sum_{m,n \in W} I(m,n) < I(x,y) $$ r(I,x,y) is a signature of variation; I(x,y) is the center pixel of a window, and we count how many neighboring values are below I(x,y).
Census transform: $$\mathrm{CT}(I_1,I_2) = \sum_{x,y \in W} \mathrm{HAMMING}(\mathrm{BITSTRING}(I_1,x,y), \mathrm{BITSTRING}(I_2,x+d,y) )$$ $$ BITSTRING(I,x,y) = \mathrm{CONCAT}_{m,n \in W}( I(m,n) < I(x,y) ) $$ This is like the rank transform, but instead of summing the binary 'less than' indicators from $m,n$ pixel values in the window $W$ around $I(x,y)$, we concatenate them into a bitstring and compare them via Hamming distance.

Example of bitstring() on a $3\times3$ window:

96 73 105

81 84 79

101 98 84

Produces bitstring: 01011000

What are the advantages and disadvantages of these new measures compared to the more familiar distances that we have seen?

Stencil code

function [ disparity ] = stereo_match(left, right, window_size, search_size)

% Using the stereo image pair (left, right), calculate the disparity map
% (related to depth map) by using a block matching algorithm.
% window_size : the size of the kernel (e.g., 7)
% search_size : the maximum search offset (e.g., 30)

% Hint: We resized our input images to 244 x 300 for speed.
im_left = rgb2gray(imread(left));
im_right = rgb2gray(imread(right));
[h,w] = size(im_left);

disparity = zeros(h, w, 'uint8');

% Hint: We could write loops over window offsets; OR, we could offset whole images in advance into a 3D array...

% For every scanline in the left image (ignoring border)
half_window_size = floor(window_size/2);
for y = half_window_size+1 : h-half_window_size
    % For every pixel in the scanline (ignoring border)
    for x = half_window_size+1 : w-half_window_size
        
        %% Your code for calculating disparity for each pixel (y, x) goes here %%
        %% Remember: window_size, search_size

    end
end

% Scale the distances to get a image where white=close, black=far
scale = 255 / search_size;
disparity = uint8(disparity .* scale);
imshow(disparity);
end

Listing 1. Stencil code for stereo disparity matching.

**Fig 7. Simple block matching disparity with SSD metric.**

More experiments (inspiration)

Convert your disparity map to a depth map, and visualize it in 3D using MATLAB. Spin the camera around!
Post processing: Use morphology, and/or an edge-aware smoothing filter.
How could we prevent averaging over edges? One way is to create a set of 'shiftable' windows where the point of evaluation within the window varies to nine extremities (four corners, four edge centers, plus the window center). Then, in each location, we pick the shifted window (and evaluation point) which minimizes the matching score. The intuition is that match score will likely decrease if the window does not cross a depth edge.
Can we adaptively vary the size of the window based on the image content? What adaption condition would be appropriate?
Consider a coarse to fine pyramid search strategy with increasing image sizes (or decreasing window sizes). This might increase robustness to noise or accelerate the search.
Use your depth map to blur your image! Create depth-dependent kernels to make your own portrait mode.

Stretch Goal: Dynamic Programming scanline stereo

We saw in class how we can add ordering and uniqueness constraints to our solution by setting up stereo disparity estimation as a dynamic programming problem. This is heavily related to the boundary seam creation methods from the textures and seams coursework project.

**Fig 8. Scanline stereo as a directed graph, which can be solved via dynamic programming. Figure © Boykov.**

**Fig 9. Directed graph cost matrices in scanline/scanline space (left) and in disparity/scanline space (right). Figure © Brown, Burschka, Hager, 2003.**

Submission

Please upload your MATLAB code, input/result images, and any notes of interest as a PDF to Gradescope. Please use writeup.tex for your submission.

Acknowledgements

This lab was developed by the 1290 course staff. Thanks to Andrea Fusiello.

96	73	105
81	84	79
101	98	84