Final Project: Lazy Snapping*

Tristan Hale (thale)

*based on the paper Lazy Snapping, by Yin Li, Jian Sun, Chi-Keung Tang, and Heung-Yeung Shum

The goal of the Lazy Snapping program is to cut out an object from an image with ease. Lazy Snapping is based on a graph cut and uses interactively drawn lines to specify which regions to keep and which to cut out. These are called the foreground and background of the image. A user draws foreground and background lines on a few regions of the image, and the isolated region is calculated and displayed with every new line. After only a few lines, the user should be able to cut out the correct region successfully.

Since the program must work at interactive speeds, a regular graph cut will be too slow. In my experience, using a full resolution graph cut on a regular size image (about 400x600 pixels) took about a minute. Using the Lazy Snapping technique, this could be reduced to less than a second.


In the Lazy Snapping paper, the image is initially segmented using a watershed algorithm. In my implementation, I've simplified this by merely downsampling the image using Matlab's imresize function. After scaling down the image (I settled on a scale of 1/8), it is passed into the following graph cutting algorithm.

We use the max-flow min-cut algorithm, like in the previous graph cut project. However, the values plugged in are changed significantly. Ultimately, we are trying to solve the following minimization problem:

E(X)=\sum_{i\in V}E_1(x_i)+\lambda \sum_{(i,j)\in E}E_2(x_i,x_j)

The first energy, Eis the weights set to each node in the graph. This becomes the T matrix for min-cut. The second energy, E2 is the weights set to each edge in the graph. This becomes the A matrix for min-cut. The lambda is a weight. I found that 10 worked well. I didn't see too much difference as I raised the weight, although above 1,000 seemed too high, and under 10 seemed too low.

To set the T values, which represent the starting source and sink energies of each node (pixel), we use a color value system. First, the average color for both the fore- and background constraining lines are calculated. Then, for each pixel, we calculate the sum of squares distance from the fore- and background average colors. These are called dF and dB. Then, the following values are set for each of the pixels' T values:

E_1(x_i=1)=0,\;\;\;\;\; E_1(x_i=0)=\infty,\;\;\;\;\; \forall i\in F
E_1(x_i=1)=\infty,\;\;\;\;\; E_1(x_i=0)=0,\;\;\;\;\; \forall i\in B
E_1(x_i=1)=\frac{d_i^F}{d_i^F+d_i^B},\;\;\;\;\; E_1(x_i=0)=\frac{d_i^B}{d_i^F+d_i^B},\;\;\;\;\; \forall i\in U

where F is the foreground pixels, B is the background pixels, and U is all unconstrained pixels. This sets the weights for both the source and sink (0 and 1) values for the matrix T.

Now for the A matrix, which represent the cost of cutting along an edge between two nodes (adjacent pixels), we use a color difference equation:
E_2(i,j)=\frac{1}{\epsilon +||C(i)-C(j)||^2}
where C is the color value at that pixel. This finds the SSD between the adjacent pixels' colors and inverts it, which results in a high value when the pixels are similar and a low value when they are different. That means that it will be easy to cut along this edge when the pixels have very different colors, and when the pixels are close in color, it will be much harder to make that cut.
This is altered from the equation provided in the paper. In theirs, they suggest using a 1 on the denominator in place of the epsilon. I found that this made the edge weights too similar. Using just an epsilon value makes for far more drastic differences in edge weights, which greatly improved my results.
The paper also suggests using an indicator function, so that the weight is only included when the two pixels are cast to different categories (0 or 1). This was very strangely documented in the paper. In fact, I'm still not sure I understand exactly what their notation meant or why the it would be necessary. I finally ended up excluding the indicator, and I found that in practice I could achieve good results without it.

After running the solver, we end up with a rough border which is based on the low-resolution grid. As a result, it looks blocky, and it doesn't find good edges:

Since my implementation uses a low-resolution image rather than an image segmented by a watershed, my segmentation is essentially a square grid. Since the watershed algorithm used in the paper accounts for edges, all the low-resolution graph-cutting is handled at a full-resolution pixel scale. In my implementation, the square regions don't take any image information into account, and are essentially arbitrary. As a result, to smooth the edge of the segmentation, a second graph cut must be applied in the "fuzzy" region around the low-resolution border. To do this, I find the region around the border with a radius equal to the grid size. That is, if I downsample the image to 1/8 scale, I find the region within 8 pixels of the low-res border.
With this subset of the image (usually quite a small percent of the total image), I set the borders to be source and sink constraints accordingly. I also add in any user-drawn lines that fall into the region.
I then pass this region with the calculated source and sink lines into the same segmenting algorithm. The only difference is that this time it requires a lot of painful Matlab indexing so that it only computes the region of interest. If the rest of the image is included, run time grows to an unusable length of time. The subset solving, on the other hand, is quite fast. In my implementation, I used bwdist to generate the source and sink lines around the fuzzy region, and this alone was as time-consuming than the solver.

Here's what it looks like after the high-resolution solver:
As you can see, it snuggles right up along the edge, because now it operates at the full pixel resolution.

On a regular size image (around 400x600 pixels), the cut takes between 0.5 and 1 second to compute. Compared to about 1 minute for a full-resolution cut using the same code, this is a huge speed increase which allows for interactive use.
In the paper, they report much better speeds, at about a quarter of a second. I wasn't able to achieve speeds quite this fast. My code is slowed down a lot, since I use a lot of bwdist and imresize to downsample my image and mark constrained regions. The actual solving code is quite fast. In my testing on the puppy image (above), the solver for the low resolution image took an average of 0.04 seconds, the high resolution solver took an average of 0.4 seconds, and the whole segmentation process to 0.7 seconds. That means that the imresize, bwdist, and other "setup" code I was running took roughly 0.3 seconds. Unfortunately, I haven't come up with a workaround for using imresize and and bwdist. I find that in practice, the lag doesn't hinder usability, although I would have been more pleased with total times under a half-second.

Since the initial node weights are based on color, segmentation is very easy on subjects that are a very different color from the background, like the puppy picture. It is also helpful when the fore- and/or background is a solid color. When there is a lot of discontinuous color, it means more lines will be needed to correctly segment the image, like the giraffe and cat pictures below.
It can also be a challenge to separate between the dark underbody of a figure and the dark shadow directly beneath it. This is visible in the golfball picture below. This picture also shows off another problem. Some edges are crisp, like the left side of the ball, whereas others are jaggy, like the ball's right side. This seems to be an effect of the foreground and background being a similar color at regions, which reduces the sharpness of the edge.
The last shortcoming is the fairly obvious alpha matting issue. Since this program just does a boolean matting, there is no smoothed edge alpha values and no good solution for furry or other translucent regions (see the foxes below).

Here are some resulting cuts: