CS129 / Final Project / Scene Completion

1. Introduction

This is a simple version of Scene Completion Using Millions of Photographs by James Hays and Alexei A. Efros. The two main parts of the implementation is Semantic Scene Matching which finds the images that have the similar scene envelope with the input image from database, and Local Context Matching which searches from the semantic similar scenes to find the best patch to fill the hole of the incomplete image. To test the results of those two parts, I used 51 test images from the project website , and built a tiny database (6072 images) which already contains several similar images with the test images.

2. Semantic Scene Matching

To find images from database, which have the similar spacial layout and context with the input image, gist scene descriptor is used to represent image features. The gist means the significant information that observers can notice immediately when they see the scene. So the matching procedure pay attention to the basic level and superordinate level of the scene, that is, considering the scene as a individual object, instead of the configuration of exact objects in the scene.

The gist descriptor is built from 8 oriented edge responses at 4 scales with 4x4 nonoverlapping windows. The images below are the visualization of gist of the scenes with missing region removed.

Before computing gist, the missing regions of source image and matched images should be removed by using a mask. And images were resized to 192x256 (most common aspect ratio in the data set). Besides calculating the SSD between the gist of query image and gist of matching images from database, the color difference was also counted. The scores of the gist and the color were scaled to roughly 2:1. And the outputs are 20 similar images.

2.1 Partial scene matching results

Input1
Input2
Input3
Input4
Output1:
Output2:
Output3:
Output4:

3.Local Context Matching

After having some semantically similar scenes, the next step is to find a patch from those scenes which can best fill the hole of the source image. And applying Graph cut on that best patch to find the optimal seam to cut off. When pasting that patch to the incomplete image, Poisson blending is also needed.

3.1 Finding the best patch

To get the optimal patch which best fits the hole of incomplete image, the most meaningful information comes from the the pixels just outside the hole. To get those pixels, I applied dilation to the source mask and then subtracted the original mask, so I can get the new template mask which is about 80 pixels around the hole. Then I cropped the new mask and only kept the smallest size, and applied the small mask to every possible patch of the matching image. Since we consider the matching scenes are roughly aligned with the source incomplete image, so the small mask can be placed to limited region around the according hole position instead of searching through the whole matching image to speed up and avoid some wrong semantic results. I used SSD of color descriptor, SSD of texture descriptor and graph cut flow which we will talk next session to compute the matching score. The texture descriptor is calculated as 5x5 median filter of pixel gradient.

Original mask
After dilation
New template mask
Small mask

3.2 Graph cut and Poisson blending

In order to fit the selected patch better to the incomplete scene, graph cut is needed to apply to the small template mask we obtained from last mentioned session. For the template mask, the pixels 'outside' the mask area must be from the source image and the pixels 'inside' the mask area must come from the selected patch from the matching image. (As the image below shows, the red region must come from source and blue region must come from patch, white region is the mask area.) What graph cut do to decide which part of the mask area belongs to source image and which part belongs to the patch.

To implement the graphcut algorithm, the core is to generate a neighbor edge weights graph where every node represents the weight of edge between two adjacent pixels(4-way neighbors). The edge weight equals to M = SSD(incomplete_image(current_pixel) - matching_image(current_pixel)) + SSD(incomplete_image(neighbor_pixel) - matching_image(neighbor_pixel)). The edge weight shows the 'price' we break that edge. Then applying the max-flow/min-cut algorithm to get the optimal seam to which has the minimal sum of weight. According to the paper, it is better to minimize the edge weight in gradient domian instead of intensity domain for human are not sensitive to the color change if shifts are seamless and low frequency. And the final step is to do poisson blending to paste the graphcut result on the incomplete image to fill the hole seamlessly.

Color mask
Graphcut mask
Pixels from matching image

4.Result Images

4.1 Part of results

In this part, we aim to compare two different sets of scene completion results. One set is the results which applying local context matching on 20 similar images we generated from tiny 'homemade' database, and the other set of results using 120 best matching scenes from internet-scale database: Flickr.

Incomplete image
Results from 120 best
Results from tiny database

4.2 The rest of results

4.3 Problems to discuss

The procedure of searching similar images from the database is scene-oriented instead of object-oriented, and there is no mechanism in Local context matching step to keep the whole objects. So It is possible for the completion results to contain some incomplete objects. Besides there are also some results have wrong scales between objects or wrong feeling of space.

Incomplete objects
Incomplete objects
Wrong scale
Wrong feeling of space

References

[1] J. Hays, and A. Efros. Scene Completion Using Millions of Photographs. In Proc.SIGGRAPH, 2007.

[2] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick. Graphcut Textures: Image and Video Synthesis Using Graph Cuts. In Proc.SIGGRAPH, 2003.

[3] A. Oliva, and A. Torralba. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. In Proc.IJCV, 2001.

[4] A. Oliva, and A. Torralba. Building the gist of a scene: The role of global image features in recognition. In Visual Perception, Progress in Brain Research, 2006.