To find images from database, which have the similar spacial layout and context with the input image,
gist scene descriptor is used to represent image features. The gist means the significant information
that observers can notice immediately when they see the scene. So the matching procedure pay attention to
the basic level and superordinate level of the scene, that is, considering the scene as a individual object,
instead of the configuration of exact objects in the scene.
The gist descriptor is built from 8 oriented edge responses at 4 scales with 4x4 nonoverlapping windows.
The images below are the visualization of gist of the scenes with missing region removed.
Before computing gist, the missing regions of source image and matched images should be removed by using a mask.
And images were resized to 192x256 (most common aspect ratio in the data set). Besides calculating the SSD between
the gist of query image and gist of matching images from database, the color difference was also counted.
The scores of the gist and the color were scaled to roughly 2:1. And the outputs are 20 similar images.
Output4: |
3.Local Context Matching
After having some semantically similar scenes, the next step is to find a patch from those scenes which can best fill the hole
of the source image. And applying Graph cut on that best patch to find the optimal seam to cut off. When pasting that patch
to the incomplete image, Poisson blending is also needed.
3.1 Finding the best patch
To get the optimal patch which best fits the hole of incomplete image, the most meaningful information comes from the
the pixels just outside the hole. To get those pixels, I applied dilation to the source mask and then subtracted the original mask,
so I can get the new template mask which is about 80 pixels around the hole. Then I cropped the new mask and only kept the smallest
size, and applied the small mask to every possible patch of the matching image. Since we consider the matching scenes are roughly aligned
with the source incomplete image, so the small mask can be placed to limited region around the according hole position instead of searching through
the whole matching image to speed up and avoid some wrong semantic results. I used SSD of color descriptor, SSD of texture descriptor and graph cut flow which we will talk next session to compute the
matching score. The texture descriptor is calculated as 5x5 median filter of pixel gradient.
Original mask | After dilation | New template mask | Small mask |
---|
3.2 Graph cut and Poisson blending
In order to fit the selected patch better to the incomplete scene, graph cut is needed to apply to the small template mask
we obtained from last mentioned session. For the template mask, the pixels 'outside' the mask area must be from the
source image and the pixels 'inside' the mask area must come from the selected patch from the matching image.
(As the image below shows, the red region must come from source and blue region must come from patch, white region is the mask area.)
What graph cut do to decide which part of the mask area belongs to source image and which part belongs to the patch.
To implement the graphcut algorithm, the core is to generate a neighbor edge weights graph where every node represents
the weight of edge between two adjacent pixels(4-way neighbors). The edge weight equals to M = SSD(incomplete_image(current_pixel) - matching_image(current_pixel)) + SSD(incomplete_image(neighbor_pixel) - matching_image(neighbor_pixel)).
The edge weight shows the 'price' we break that edge. Then applying the max-flow/min-cut algorithm to get the optimal seam to
which has the minimal sum of weight. According to the paper, it is better to minimize the edge weight in gradient domian instead of
intensity domain for human are not sensitive to the color change if shifts are seamless and low frequency. And the final step is to
do poisson blending to paste the graphcut result on the incomplete image to fill the hole seamlessly.
Color mask | Graphcut mask | Pixels from matching image |
---|
4.Result Images
4.1 Part of results
In this part, we aim to compare two different sets of scene completion results. One set is the results which applying local context matching
on 20 similar images we generated from tiny 'homemade' database, and the other set of results using 120 best matching scenes from internet-scale database: Flickr.
Incomplete image | Results from 120 best | Results from tiny database |
---|
4.2 The rest of results
4.3 Problems to discuss
The procedure of searching similar images from the database is scene-oriented instead of object-oriented, and there is no
mechanism in Local context matching step to keep the whole objects. So It is possible for the completion results to contain some incomplete
objects. Besides there are also some results have wrong scales between objects or wrong feeling of space.
Incomplete objects | Incomplete objects | Wrong scale | Wrong feeling of space |
---|
References
[1] J. Hays, and A. Efros. Scene Completion Using Millions of Photographs. In Proc.SIGGRAPH, 2007.
[2] V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick. Graphcut Textures: Image and Video Synthesis Using Graph Cuts. In Proc.SIGGRAPH, 2003.
[3] A. Oliva, and A. Torralba. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. In Proc.IJCV, 2001.
[4] A. Oliva, and A. Torralba. Building the gist of a scene: The role of global image features in recognition. In Visual Perception, Progress in Brain Research, 2006.
---|