Steve Gomez (steveg) . February 25, 2010

In this project, we find best completions for holes in photographs using other photos gathered from the internet. This is a simplified version of "Scene Completion Using Millions of Photographs", by Hays and Efros, in SIGGRAPH 2007. We are given the 120 best matching scenes from the internet-scale data set. We need to use local context matching to find a best alignment and a best graphcut to composite and blend our images.

Matching local context

I match local context with template matching in the signal domain, using multiplication on Fourier transforms as an optimization over convolution in the spatial domain. More on this later. We will look for areas in a matched photo that look like the area around the hole we are trying to fill. This is relatively straight-forward: if the local context matches well, the pixels are likely to 'make sense' when we transplant them to our missing region.

The local context mask I use is a dilation of the hole mask, minus the hole. So we have a 'ribbon' of pixels around the border of the cut that we will try to match. The ribbon is also multiplied by a distance transform of its pixels to the original mask, and weighted such that pixels closest to the hole influence the match more (with a linear fall off). Our matching score is based on the sum of squared difference (SSD), weighted by our template, in each color channel. We write this out as: sum( M.*Ai.^2 - 2*Ai.*B + B.^2 ), where M is the template mask, Ai is the mask 'window' over the compared scene at offset i, and B is the mask times (context in the original image).

If we write this instead as: sum(M.*Ai.^2) - sum(2*Ai.*B) + sum(B.^2), then the first two terms can be written as correlations. These will produce matrices that we can subtract to compute the "difference" in the SSD. I use Matlab's fft2 instead of performing sliding window correlations. Because convolution in the spatial domain is equivalent to multiplication in the signal domain, I piece-wise multiply my ffts of A and M, A and B, and take their inverse to get back the correlation matrices I wanted. Noteworthy, working with the fft requires us to pad out the matrices to avoid the problems at the border when we filter, so I use Matlab's built-in fft padding (for the two matrices I want to convolve, I pad each to be their combined size), then crop back afterwards. The product of all of this, once we compute the SSD matrix, is a heat map of where the template scored well with our local context -- it tells us the best places to "fill the hole" based on our mask.

Scoring matches

I tried a few things in scoring my matches. For the alignment, I compute the SSD (as described above) in the signal domain using the Fourier transform of our images/convolution kernels. When we're finished, we want to look for the minimum value in our result matrix, which corresponds to the window (mask) position with the fewest color differences in each channel. This is the basis for our score of this match: its best alignment score.

The alignment map is also weighted in my implementation using a distance transform to penalize alignments farther from the original masked region (if we scale the mask to the same dimensions as the heat map). The idea here is: if we have roughly similar scenes, the best match for a hole in one image is likely to occur in the same region of the other image. This tries to address the issue in the scene completion below, where we complete the right side of the street from another scene's left side, so the perspective is all wrong:

 Far template dist. issue Diff. depth of field issue

The penalty is assigned by taking a distance transform of the image pixels, normalizing their distance so that the farthest pixels in an image are 1 unit away (and pixels in the mask are 0), then multiply this distance by some penalty amount (20%) in the SSD score.

Another problem I tried to address was that regions with well-matching colors can still be jarring if the sharpness is wrong. This happened in a few of my images, where the matched region was clearly out of focus but transplanted into a focused image context. An example of this is shown above (center-bottom is out of focus). I attempted to characterize the density of high-frequency features, as a sharpness score under the masked region, and to penalize alignments that have very different densities from the source. I did this in a simple way by summing the gradient magnitudes under the mask. In the final version, I don't bother to compute this score, because it had a low weight (still shouldn't domain color constraints) and wasn't completely reliable, but a better metric could certainly be worth investigating.

Using graphcut

We'd like to disguise the seam from our image editing to make the photo as natural looking as possible. I use a graph cut when I transplant pixels from a match into the original image to choose a least salient seam.

For this project, I modified my graphcut arc cost to consider both differences in color intensity and the color gradient. I use the improved arc cost metric explained in "Graphcut Textures: Image and Video Synthesis Using Graph Cuts", by Kwatra et al, in SIGGRAPH 2003, which divides a difference in color intensity by the gradient along the arc (sum of directional gradient at start and end pixel, in both source and match images). This encourages seams to cross along edges. I found that averaging this cost function with the original cost function gave a good result and didn't weight either gradients or color intensities too heavily.

Results: sample successes and failures

The results are all reasonable given my heuristics, but not all images look natural or make sense. A big part of the weakness is in the SSD metric sometimes scoring matches well, even when a better match (combined with Poisson blending) would look better. Overall I was impressed with some of the results. In the scene below, we want to mask out the parking lot from a photo of some landmark building. My best completion succeeds in matching the original tree line to find a lawn scene to cut in, and finds a reasonable graphcut to composite at.

Some other successes are shown below:

Many photos did not work well with scene completion. The biggest issues were not related to color as much sharpness, scene semantics, and graphcut creating salient unnatural photos. More time spent addressing these issues would result in better completions. Here are some interesting failures:

 Almost perfect, except for disappearing limbs... This would fine (extending the mountain, with a bay in the background), but Poisson blending of the original water into the mountainside makes the lower half seem underwater. Sometimes the results create problems with image semantics.

Comparison with Poisson image fill

I tried comparing images created by diffusion (Poisson image fill) to those created with scene completion. In some cases, the diffusion methods work remarkably well -- in small areas, but also in larger spaces that are natural devoid of much texture (water). We see this in the sea scene below, where diffusion is almost as good as scene completion (which includes some water ripples).

In most other cases, scene completion is clearly a better choice. Large diffusion areas just look blurred and unnatural in most contexts. Below, we show a case where the blurry diffusion clearly looks wrong in the context of a cityscape, but scene completion finds another collection of buildings to splice in with the correct 'texture' of a city photograph.

 Poisson fill Scene completion