Image Dating

- Emanuel Zgraggen (ez), December 5th, 2011

Introduction

This project explores ways of estimating the decade a picture was taken in. All training and testing was done using a dataset of 2'195 Flickr images for the five-decade span between 1930-1979. Some experiments were conducted to get a sense of how good humans are at this task. The project presents an algorithm that significantly outperforms humans.

Dataset

The dataset used in this project contains 2'195 Flickr images (439 per decade) for the five-decade span between 1930-1979. For each of the images we know the exact date when it was taken.

Excerpt fromt the dataset (30s, 40s, 50s, 60s, 70s)
Averaged images for all decades (30s, 40s, 50s, 60s, 70s)

How good are humans?

Even for humans this task is really hard. In order to get an estimate of how good humans are, I randomly sampled 50 images from each decade and posted them on Amazon Mechanical Turk as a classification HIT (Human Intelligence Tasks). Each image was dated by 5 different workers and each HIT was rewarded with 0.01$. The workers were asked to assign a specific year in the range of 1930 - 1979 to each image. The 1'250 results were compared against the ground-truth and converted into a simple accuracy (number of correct classifications / number of trials). For this evaluation I only considered if the decade was correct. With an accuracy of 26.64% humans are only slightly better than chance (20%). One interesting phenomenon is that humans tend to classify images to more recent decades. A lot more images got classified to the 70s category than to any other.

Images where human performance was 100% (all of the workers classified the image right)
Images where human performance was 80%
Images where human performance was 60%
Images where human performance was 40%
Images where human performance was 20%
Images where human performance was 0% (none of the workers classified the image right)
Example HIT
Plot of average human performance for all the posted 250 images (blue) vs. chance (red)
Human classification per decade (Note that the number of test images per decade was equal)
Histogram of human average performance (around 90 images were misclassified by all workers)

Approach

The problem is setup up as a classification task. At test time, the algorithm should assign a unique category label to each image. In our case the categories are the decades from 1930 - 1979. The algorithm performs a random test / train split for all images in a certain decade. The training images are then converted into their feature representation. The algorithm trains 1-vs-all classifiers for each decade. Using the test images, the solution measures the accuracy for each classifier. I experimented with different features, which are all explained in greater detail in the following sections.

SIFT Bag of Words
The algorithm collects many densely sampled SIFT features for all the training images. By running k-means, these features are clustered into M visual words, the vocabulary. For each of the training images a histogram indicating how often each visual word is occurring gets computed. This gives us a M-dimensional representation ("bag of words") which is directly used as an input to the classifier.

HOG Bag of Words
This feature follows the same approach as the SIFT Bag of Words, but instead of using SIFT features it is using densely sampled HOG features.

GIST
Another representation for an image is the GIST descriptor. The GIST descriptor is a global image descriptor, that produces one feature vector for an entire image. The algorithm does not do any clustering on this feature, but rather uses the the global feature vector directly as an input to the classifier.

Color Histogram
All of the above features do not incorporate any color information. A simple color-sensitive feature can be produced by calculating the histogram of the different color-channels of an image. The algorithm uses the normalized concatenated channel-histograms as a feature vector.

Example image with its corresponding RGB histograms

Textons
Inspired by Leung and Malik this feature tries to encode the different textures in an image. The algorithm filters the input images with a set of filters. This gives us a X-dimensional feature vector at each pixel (where X is the number of filters we are using). The algorithm creates a texture vocabulary by clustering all those filter response vectors of all the training images. For each image we then create a histogram of how often each texture word occurs.

The filterbank
Example images with in their Texton representation and with their Texton histograms (350 Texture words)

Results

All reported experiments were performed on random splits of the 5 decade dataset and by using linear SVMs.
To combine multiple features, the algorithm trains multiple sets of 1-vs-all classifiers, one for each feature. A test image is assigned to the decade where the mean of the classifier-confidences is highest.

Approach Mean Accuracy Confusion Matrix Comments
Humans (MTurk Workers) 26.64
SIFT Bag of Words 38.80 100 train / test images
Vocabulary size = 200
HOG Bag of Words 33.00 100 train / test images
Vocabulary size = 200
GIST 33.20 200 train / test images
Color Histogram 32.60 200 train / test images
Texton 27.20 100 train / test images
Vocabulary size = 350
SIFT Bag of Words,
Color Histogram
40.60 100 train / test images
Vocabulary size = 200 (BoW)
SIFT Bag of Words, GIST,
Color Histogram
43.40 200 train / test images
Vocabulary size = 350 (BoW)
SIFT Bag of Words, GIST,
Color Histogram, Texton
42.60 200 train / test images
Vocabulary size = 350 (BoW)
Vocabulary size = 350 (Texton)

Discussion

Humans probably rely on two different pieces of evidence when they try to date a photo - recognized objects in the photo (people, clothes, cars, etc.) and artifacts of the photographic process (color, borders, etc.). The experiments in this project show that humans are generally bad at this task.
A simple algorithm that combines different features can significantly outperform humans at this task. The best results were produced by combining SIFT Bag of Words, GIST and Color Histogram features. Adding Texton features did not contribute positively to the average performance. Those features do not rely on any object recognition, but probably rather encode artifacts of the photographic process.