- Emanuel Zgraggen (ez), December 5th, 2011
This project explores ways of estimating the decade a picture was taken in. All training and testing was done using a dataset of 2'195 Flickr images for the five-decade span between 1930-1979. Some experiments were conducted to get a sense of how good humans are at this task. The project presents an algorithm that significantly outperforms humans.
The dataset used in this project contains 2'195 Flickr images (439 per decade) for the five-decade span between 1930-1979. For each of the images we know the exact date when it was taken.
How good are humans?
Even for humans this task is really hard. In order to get an estimate of how good humans are, I randomly sampled 50 images from each decade and posted them on Amazon Mechanical Turk as a classification HIT (Human Intelligence Tasks). Each image was dated by 5 different workers and each HIT was rewarded with 0.01$. The workers were asked to assign a specific year in the range of 1930 - 1979 to each image. The 1'250 results were compared against the ground-truth and converted into a simple accuracy (number of correct classifications / number of trials). For this evaluation I only considered if the decade was correct. With an accuracy of 26.64% humans are only slightly better than chance (20%). One interesting phenomenon is that humans tend to classify images to more recent decades. A lot more images got classified to the 70s category than to any other.
The problem is setup up as a classification task. At test time, the algorithm should assign a unique category label to each image. In our case the categories are the decades from 1930 - 1979. The algorithm performs a random test / train split for all images in a certain decade. The training images are then converted into their feature representation. The algorithm trains 1-vs-all classifiers for each decade. Using the test images, the solution measures the accuracy for each classifier. I experimented with different features, which are all explained in greater detail in the following sections.
SIFT Bag of Words
The algorithm collects many densely sampled SIFT features for all the training images. By running k-means, these features are clustered into M visual words, the vocabulary. For each of the training images a histogram indicating how often each visual word is occurring gets computed. This gives us a M-dimensional representation ("bag of words") which is directly used as an input to the classifier.
HOG Bag of Words
This feature follows the same approach as the SIFT Bag of Words, but instead of using SIFT features it is using densely sampled HOG features.
Another representation for an image is the GIST descriptor. The GIST descriptor is a global image descriptor, that produces one feature vector for an entire image. The algorithm does not do any clustering on this feature, but rather uses the the global feature vector directly as an input to the classifier.
All of the above features do not incorporate any color information. A simple color-sensitive feature can be produced by calculating the histogram of the different color-channels of an image. The algorithm uses the normalized concatenated channel-histograms as a feature vector.
Inspired by Leung and Malik this feature tries to encode the different textures in an image. The algorithm filters the input images with a set of filters. This gives us a X-dimensional feature vector at each pixel (where X is the number of filters we are using). The algorithm creates a texture vocabulary by clustering all those filter response vectors of all the training images. For each image we then create a histogram of how often each texture word occurs.
All reported experiments were performed on random splits of the
5 decade dataset and by using linear SVMs.
To combine multiple features, the algorithm trains multiple sets of 1-vs-all classifiers, one for each feature. A test image is assigned to the decade where the mean of the classifier-confidences is highest.
Humans probably rely on two different pieces of evidence when they try to date a photo
- recognized objects in the photo (people, clothes, cars, etc.) and artifacts of the
photographic process (color, borders, etc.). The experiments in this project show
that humans are generally bad at this task.
A simple algorithm that combines different features can significantly outperform humans at this task. The best results were produced by combining SIFT Bag of Words, GIST and Color Histogram features. Adding Texton features did not contribute positively to the average performance. Those features do not rely on any object recognition, but probably rather encode artifacts of the photographic process.