Image Quality Assessment and Data Scale

Sam Birch (sbirch)

May 18, 2011.

Abstract

This project applied established algorithms for image quality assessment at a novel scale to empirically estimate the effect of training database size on image quality assessment. The features for the images are derived from Ke, et. al. [1] and reflect common compositional attributes of "good" photos (Ke et. al. attempt to distinguish between professional and amateur photographs). This work uses a new dataset derived from approximately a year's worth of Flickr data, using the most and least interesting photos from each day, amounting to about 1.4 million photographs.

Dataset

The dataset was collected in two stages. In a first pass the image meta-data was queried through Flickr's image search API. By searching for all the images in a given day and sorting up or down by "interestingness" I could skim the top and bottom 3000 images by their interestingness [1] (Flickr's API starts to return erroneous results after a couple thousand photos). The distribution of photos has a mean of approximately 1.3 million photos per day, so these extrema represent, on average, the top and bottom 0.23% of any given day. (Fig 1: full distribution of photos per day.) This is notably smaller than the margins Ke used, at 10%. After collecting the metadata, the photos were downloaded according to their URL in Flickr's medium size (at most 640px to a side).

Implementation

After collecting the photographs, I computed six features for each photograph:

These six numbers formed the feature vector for each image. For example, the following image:

(Original.)
has the feature vector:
(0.284374668954535, 5, 0.958191720005291, 241, 8624, -0.0465446681658742)
I split the dataset into 10,000 test images and left the remainder as potential training data. To test the performance, I collected the 1000 nearest neighbors by an unweighted Euclidean norm (I also tried normalizing it by the feature's variance, but it lowered 11-NN classification rates, so I went back to unweighted). These neighbors were then used to train a local support vector machine[2], which then classified the original point (lazy learning).

Results

Fig 2: performance versus training set size:

There's an upward trend in performance, which doesn't seem to plateau. The performance of this system is not as good as that of Ke, et. al., but because the aim was to look at relative performance across scales, this is not of much importance. Due to speed limitations the system was only tested with 256 examples, which may contribute to noise in the performance numbers.

Raw results:

2048 (2K) training examples:
True positive: 162, true negative: 154, false negative: 94, false positive: 102
Precision: 0.6136, recall: 0.6328, accuracy: 0.6172, specificity: 0.6016

16384 (16K) training examples:
True positive: 164, true negative: 159, false negative: 92, false positive: 97
Precision: 0.6284, recall: 0.6406, accuracy: 0.6309, specificity: 0.6211

65536 (64K) training examples:
True positive: 161, true negative: 160, false negative: 95, false positive: 96
Precision: 0.6265, recall: 0.6289, accuracy: 0.6270, specificity: 0.6250

262144 (256K) training examples:
True positive: 161, true negative: 162, false negative: 95, false positive: 94
Precision: 0.6314, recall: 0.6289, accuracy: 0.6309, specificity: 0.6328

524288 (512K) training examples:
True positive: 169, true negative: 159, false negative: 87, false positive: 97
Precision: 0.6353, recall: 0.6602, accuracy: 0.6406, specificity: 0.6211

1048576 (1M) training examples:
True positive: 172, true negative: 156, false negative: 84, false positive: 100
Precision: 0.6324, recall: 0.6719, accuracy: 0.6406, specificity: 0.6094

References

Footnotes

Acknowledgments

Special thanks to James Hays, for giving this project direction and entertaining many hours of questioning.