Final Project Writeup: Image Completion + IM2GPS
Basia Korel (bkorel)
May 17, 2010


Project Overview

This project is motivated by the problem of global image geolocation. Specifically given an input image, estimate its geographic location by scene matching against a database of millions of geo-tagged photos taken from the Internet. Obtaining geolocation estimates is useful for numerous geographic information applications such as estimating population density, cultural differences (1) and land coverage. This project is based off of the IM2GPS (2) paper; in this effort many of the test input images contain a mixture of scenes with landmarks, of generic scenes that provide little geographic information and of randomly sampled photos. What the majority of these images have in common is they represent what the scene should actually look like. That is, "this is what the picture of this scene would look like given your average day" (no parades on the streets of Paris, a sand storm in the Sahara or a giant cruise ship blocking the view of a beautiful port city in the Greek isles). This motivated me to consider an approach to IM2GPS for images that contain "noise"; in this context I will define noise as people, cars or any other object that is not a permanent part of the scene.

Given a set of such input images, I investigated the problem of segmenting out the non-permanent portions of a scene (which is user-specified) and filling in the missing region to try and improve the accuracy of IM2GPS. I use two different image completion techniques to generate a photo of what the scene might look like without the person/car/object: scene completion (3) and inpainting (4). Scene completion is also a data-driven approach that uses other semantically similar images to composite the input, while inpainting uses existent content from the photo to fill in the hole. I analyze and compare the results of running IM2GPS on: the original input, the scene completed input and the inpainted input images.

This process is dependent on the user to define the region to be filled in, and also to select the appropriate composited photo after scene completion that could resemble what the true scene looks like without the person or car. Ideally, an enhanced IM2GPS application would automatically perform image segmentation through object detection and a higher-level understanding of what objects or portions of the scene should be eliminated from the photo, and image completing the scene.

Approach

IM2GPS

IM2GPS estimates the geographic location of an input image using a data-driven scene matching approach. The approach leverages 6+ million geo-tagged photos collect from Flickr.
In order to perform scene matching of images, different feature descriptors are extracted from the photos to determine how semantically similar two images are.
Although the implementation uses a combination of many feature descriptors, the following are the main feature descriptors used:
Features are precomputed for every image in the database. For a new query image, the same feature vectors are computed and the distances in each feature space are calculated between the input and every image in the database. Each feature is weighted appropriately so each features has roughly the same influence and all of the distances are aggregated to find the nearest neighbor scenes in the image. At the completion of running the algorithm, there is a set of 40-NN and 100-NN images. To compute accuracy, I use the 1st NN in the set of 40-NN, and I also use the entire set of 40-NN to determine if a resulting nearest neighbor image in this set has the correct geolocation of the query image. The Flickr database, precomputed feature database, the code to compute features and find nearest neighbors was already provided to me, since implementing each of these steps was outside the time-constraints of this project.

Scene Completion

After running IM2GPS on an input image, the 100-NN set of images is used as the nearest neighbor scenes for Scene Completion. I used my Project 4 code to composite photos by computing an alignment within the local mask region, graph cut and Poisson blending to seamlessly combine images. For this step, the user must define the mask region to be taken out of the input image. Furthermore, the user is presented with 30 composited images. Because I only use the alignment cost as the scene completion cost, this used alone is a rather weak metric for measuring the resulting photos. Thus instead of automatically selecting the composited photo with the lowest cost, the user must specify which photo to use as the input for the second pass of IM2GPS.

In general, my results of scene completion were not as good as they could be, simply because I compute nearest neighbor scenes are selected using features extract over the entire input image. To perform scene completion it would be better to compute feature descriptors with the mission region excluded, however I believe this would not have been a trivial change. I decided that the results were good enough, because the goal of scene completion was not to have a seamlessly composited photo, but rather to have a photo filled in with content that could legitimately exist.

Inpainting

I also explored inpainting as an image completion technique. This is based on "Object Removal by Exemplar-based Inpainting", which fills in a missing region by repeating both texture and structure contained in the original image. A description of the algorithm and the source code is available here: http://www.cc.gatech.edu/~sooraj/inpainting/.

Evaluation

Test Set

81 images were used in the test set to evaluate the performance of each im2gps approach. Many of the images were used from the Flickr database, and some were images from my own photo collection. I deliberately used photos that had people in them or contained other objects that are not a permanent component of the scene, just as cars or buses. A number of the images contain recognizable landmarks or provide some geographic information because I was particularly interested in the accuracy of such scenes; a smaller set of the images are "generic" scenes (e.g. a beach, mountain or desert landscape). Overall, the test set is not an even distribution over the entire globe and I ideally would like to further test this with many more input images, including a larger set that contains little geographic information, that are taken from all over the globe.

Quantitative Results

Below are the accuracies of a correct geolocation estimate (within 200km) for both the first nearest neighbor and the set of 40 nearest neighbors returned from im2gps.

IM2GPS:
14.81% 1-NN correct (12 out of 81)
54.32% 40-NN correct (44 out of 81)

Scene Completion:
11.11% 1-NN correct (9 out of 81)
55.56% 40-NN correct (45 out of 81)

Inpainting:
16.05% 1-NN correct (13 out of 81)
62.96% 40-NN correct (51 out of 81)

Image Results

The first 3 nearest neighbors are displayed for each input image.

1-NN correct
40-NN correct


im2gps: India
Scene Completion: India + India
Inpainting: India
Inda
London
Oklahoma
Spain
Austria
Barcelona
Spain
Austria
India

im2gps: Italy
Scene Completion: Italy + Minnesota
Inpainting: Italy
Pisa
Mexico City
St Petersburg


Paris Mexico City
Los Angeles

Pisa
Mexico City
Pisa


im2gps: Croatia
Scene Completion: Croatia + Italy
Inpainting: Croatia
Greece
Italy
Berlin


Greece
Venice
Nepal

North Carolina
Berlin
Greece





im2gps: Turkey
Scene Completion: Turkey + Turkey
Inpainting: Turkey
Turkey
Venice
Barcelona


Turkey
Rome
Rome
Turkey
Turkey
England

im2gps: Paris
Scene Completion: Paris + Paris
Inpainting: Paris
NYC
Italy
Paris




Paris
DC
UK


Paris
London
Romania



im2gps: Utah
Scene Completion: Utah + Utah
Inpainting: Utah
Mendoza
Uruguay
Berlin
Utah
Utah
USA
Utah
Nevada
Scotland

im2gps: Utah
Scene Completion: Utah + Utah

Inpainting: Utah

Netherlands
Colorado
Wyoming
Utah
USA
Italy


Washington
Africa
Argentina


im2gps: Croatia
Scene Completion: Croatia + Malta
Inpainting: Croatia
Spain
Rome
Czech Republic


Hyderabad
Spain
Oman
Thailand
India
Spain


im2gps: Fiji
Scene Completion: Fiji + Monaco
Inpainting: Fiji
Hawaii
Fiji
Panama

Hawaii
Fiji
Tunisia

Fiji
South Africa
Tunisia



im2gps: Peru
Scene Completion: Peru + Peru
Inpainting: Peru
Ireland
Peru
Namibia
Ireland
Peru
Greece
Peru
Ireland
Greece


im2gps: Spain
Scene Completion: Spain + NYC
Inpainting: Spain
Germany
Thailand
San Francisco
Germany
New York
Paris
Spain
Lybia
Paris

im2gps: Toronto
Scene Completion: Toronto + Toronto
Inpainting: Toronto
Spain
Albania
San Francisco
London
Toronto
Barcelona
Albania
Paris
Paris

im2gps: Malta
Scene Completion: Malta + Spain
Inpainting: Malta
Colorado
Taiwan
Gambia
Maldives
Japan
Gambia
Venice
Aruba
Vermont

im2gps: Poland
Scene Completion: Poland + Chile
Inpainting: Poland
Beijing
Rio de Janeiro
Guatemala


Rio de Janeiro
Washington
Hong Kong
India
Hong Kong
South Africa

im2gps: Samoa
Scene Completion: Samoa + Samoa
Inpainting: Samoa
Fiji
Fiji
Jamaica
Fiji
Cuba
Taiwan
Fiji
Fiji
Brazil

im2gps: Vegas
Scene Completion: Vegas + Tokyo
Inpainting: Vegas
Rome
Beijing
Egypt
Palestine
Egypt
Cairo
Egypt
Taipei
Paris


Conclusion

I initially suspected that the results of geolocating an image that contains a person would be less accurate than geo-estimating the same scene without the person. It seems intuitive that this would be the case since scene matching is performed by comparing image features over the entire query image with image features from the database; if there is in fact a similar (or the same) scene in the database without the person, the distance between the two images would be greater than had the person not been there.

IM2GPS on the original input image performs better that scene completion for 1-NN. This initially surprised me, however this could make sense because scene completion may often bring "too much" content from another photo and location that simply doesn't fit within the query image context. I suspect that if I found matching scene in scene completion according to (3) ("first compute its gist descriptor with the missing regions excluded...calculate the SSD between
the the gist of the query image and every gist in the database, weighted by the mask"), which is something that I did not do, I could have had better results with scene completion.

Inpainting performs fiarly well; which give hope that filling in "noise" of a query photo is a plausible thing to do. Overall I feel that I need to do run this analysis on a much larger number and wider variety of photos to have truly conclusive results.


References
1. "Detecting cultural differences using consumer-generated geotagged photos", Yanai et al.
2. "IM2GPS: estimating geographic information from a single image", J. Hays and A. Efros.
3. "Scene Completion Using Millions of Photographs", J. Hays and A. Efros.
4. "Object Removal by Exemplar-based Inpainting", Criminisi et al.