Synthetic Photographs for Learning Aesthetic Preferences Soja-Marie Morgens and Arnav Jhala

Late-Breaking Developments in the Field of Artificial Intelligence
Papers Presented at the Twenty-Seventh AAAI Conference on Artificial Intelligence
Synthetic Photographs for Learning Aesthetic Preferences
Soja-Marie Morgens and Arnav Jhala
Basking School of Engineering
University of California, Santa Cruz
Santa Cruz, California 95064
rule of thirds1 , saliency2 , color, and balance3 . Su et al. looks
at low-level features using Bag-of-Aesthetic-Preserving features (BoAP) focused on comparing patches of the raw image for color and edges. Other studies have used databases
such as Flickr.com (Bhattacharya, Sukthankar, and Shah
2012) and Photo.net (Datta, Li, and Wang 2007). Bhattacharya, Sukthankar, and Shah in particular used just three
features, rule of thirds, aesthetic golden ratio4 , and tilted
horizon line, to achieve 86% accuracy at predicting the rating of photos from 1-5.
However, contextual bias is inherent in any ratings and
rankings of these real photo databases. A badly framed
photo of a sleeping kitten will rate much higher then a well
composed picture of a table. Moreover with real photos it
may be hard to separate out the weight of each feature in determining quality. Swanson, Escoffery, and Jhala 2012 initialized the Panorama data set of synthetic photos with controlled content bias and samples image along the full range
of three quantifiable features. Such a dataset could allow one
to quickly tune a model to an individual aesthetic preferences before applying it to real photos. The sections listed
below will give an overview of the Panorama data set, the
feature selection and tuning presented in this paper, and the
final results.
Abstract
Photo ranking algorithms aim to quantify features
within photographs to determine aesthetic quality and
learn user preferences over these features. However,
current benchmark corpora of photographs contain nonquantifiable contextual information. Moreover, they do
not control the variance of quantifiable features. The recently released Panorama data set contains annotated
synthetic images with controlled contextual features.
The images lay along the full range of quantifiable features. This paper focuses on improving the performance
of a predictive learning model trained on the pairwise
preferences collected on these images. Predictive models were trained on individual as well as group preferences. Feature selection improved prediction accuracy
from the 67% earlier reported to 91% accuracy on the
group preference. On average, individual raters’ preferences were predicted with 87% accuracy. Top features varied widely among individuals. However, the
most frequent top three features were tilted horizon line,
cropped objects, and number of objects.
In the world of digital cameras and inexpensive data space,
online photography has been inundated with amateur photographs with a large variance in quality. Not only does this
distill the quality of the photographs seen, but it allows for
personal aesthetic preferences to play a key role in photo
search algorithms. Aesthetic photo ranking models can help
users glean the bad photos from the good, actively improve
their photos, or potentially suggest better shots to photographers in real time.
Benchmark studies in photo ranking have used the annotated DPChallenge.com photo contest database of 12,000
photos due to its robust labeling of each photo by at least
10 raters (Ke, Tang, and Jing 2006; Yeh et al. 2010; Luo
and Tang 2008). Several of these studies have achieved upwards of 93% accuracy in binary classification of a good
photo versus a bad photo (Luo and Tang; Yeh et al.; Su et al.
2012). Ke, Tang, and Jing and Yeh et al. use a large number
of features traditionally taught in photography composition
classes. The most prominent of these high-level features are
Synthetic Photograph Database
Instead of real photos, Swanson, Escoffery, and Jhala introduces a database that consists of synthetic photos generated by users playing Panorama, a photo taking game.
Panorama allows researchers to turn off various features,
such as content and color, and focus on a particular subset
of easily quantifiable features. The game Pandora generates
a grey scale landscape, populated by simple shapes denoting
houses, trees, and windmills, and users take many photos
of it over a five minute period. This initial effort sampled
1
Rule of thirds states photo subjects are best placed at the intersection of the lines that divide the photo into thirds.
2
A saliency map highlights the subject in the photo, often looking at which pixels are most focused.
3
Photographers balance the visual weight of objects in photo
along the line of sight (x=-y) and across the vertical and horizontal
line.
4
The aesthetic golden ratio is the ideal ratio of sky to ground
(or the inverse) in the picture.
c 2013, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
83
erence model is due to the group preference model training and predicting on only clear preferences. Raters likely
had softer preferences and were given a pairing only once.
Feature weights were also more evenly distributed in the
group model than the individual models, possibly reflecting
that individuals use a few features to determine quality. The
top five features in the group model were differences in the
golden ratio, tilted horizon, number of objects, objects occluded, and objects cropped. Within the individual models,
differences in the tilted horizon, number of objects, number
of objects cropped, and balance dominated as can be seen
in Figure 2. Rule of thirds was one of the least seen features
and was never the most weighted feature. This contrasts earlier studies that often had Rule of thirds as one of the main
features. Golden ratio by itself was the single most accurate
predictor of any feature, with 70% accuracy on the group
preference data set.
Figure 1: Screenshot from Panorama Game
synthetic photos along the full range of balance and rule of
thirds and spacing from 5 amateur photographer users for a
total of 100 images.
These initial photos were then taken to Mechanical Turk
where raters were asked their pairwise preferences using the
4AFC method presented in Yannakakis 2009. 2470 unique
comparisons were generated with at least 10 raters looking
at each comparison. For this study, these ten data points per
comparison were averaged and those with clearly delineated
preference (1300 in all) were used to train the group preference model. The individual preference models were trained
on nine raters who generated at least 500 comparisons with
no suspicious patterns.The 500 or more ratings cutoff was
proposed originally in Swanson, Escoffery, and Jhala, although not implemented.
Figure 2: Number of models in which each feature was in
the top five features, based on weight
Feature Selection
Swanson, Escoffery, and Jhala used 17 high-level features
like balance, rule of thirds, symmetry, number of objects
cropped, as well as a bag of 64 spacial features similar to
Su et al.’s BoAP . That study used 100 of the top features
and SVM to generate its predictive model.
This study separated out the spacial features from the
high-level features and used logistic regression for an interpretable model. The low-level spacial features were used to
train a separate predictive model from the high-level features. In addition, the aesthetic golden ratio and tilted horizon line features presented in Bhattacharya, Sukthankar, and
Shah were added to the high level feature set.
Conclusion and Future Studies
This study expands the findings of Swanson, Escoffery, and
Jhala, bringing the pairwise preference model accuracy to an
average of 87% for individual preference models. This improvement in accuracy allows better analysis of the predominant features. Moreover, while other studies have focused
on sophisticated features, this study showed a dominance of
more amateur features like cropped objects and tilted horizons in predicting preference. This suggests current photo
ranking models should expand their feature sets and training
data to include more amateur features and photos in order to
best rank the average photo.
Future studies will look into automatically generating the
synthetic image pairings in order to learn individual’s preferences in as few as possible pairings. These test pairings
will then be brought to Mechanical Turk raters along with
real photos from the DBChallenge.com and Flickr.com to
test for correlation between synthetic and real photo preferences. In conclusion, synthetic photo database, pairwise
preference, and individualized algorithms give alternative
means for delving the strong features in amateur photography, features that need to be understood to create more
personalized photographic search engines.
Results
This paper used logistic regression with a ridge estimator of
1.0 ∗ 10−8 and cross validation to train for the binary classification of pairwise preference (whether a rater preferred A
photo or B photo). Logistic regression allowed easy viewing
of the weight of each feature. The group preference models have a 91% predictive accuracy using the low-level spacial features and 95% predictive accuracy using the highlevel features. The individual preference models had an average predictive accuracy of 92% using the low level features and 85% using the high-level features, with a range of
75% to 90%. The notable higher accuracy of the group pref-
84
References
Bhattacharya, S.; Sukthankar, R.; and Shah, M. 2012. A
framework for photo-quality assessment and enhancement
based on visual aesthetics. In Proceedings of the international conference on Multimedia, 271–280. ACM.
Datta, R.; Li, J.; and Wang, J. 2007. Learning the consensus
on visual quality for next-generation image management. In
Proceedings of the 15th international conference on Multimedia, 533–536. ACM.
Ke, Y.; Tang, X.; and Jing, F. 2006. The design of high-level
features for photo quality assessment. In Computer Vision
and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, 419–426. IEEE.
Luo, Y., and Tang, X. 2008. Photo and video quality evaluation: Focusing on the subject. Computer Vision–ECCV 2008
386–399.
Su, H.; Chen, T.; Kao, C.; Hsu, W.; and Chien, S. 2012.
Preference-aware view recommendation system for scenic
photos based on bag-of-aesthetics-preserving features. Multimedia, IEEE Transactions on 14(3):833–843.
Swanson, R.; Escoffery, D.; and Jhala, A. 2012. Learning visual composition preferences from an annotated corpus generated through gameplay. In Computational Intelligence and
Games (CIG), 2012 IEEE Conference on, 363–370. IEEE.
Yannakakis, G. N. 2009. Preference learning for affective modeling. In Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International
Conference on, 1–6. IEEE.
Yeh, C.; Ho, Y.; Barsky, B.; and Ouhyoung, M. 2010.
Personalized photograph ranking and selection system. In
Proceedings of the international conference on Multimedia,
211–220. ACM.
85