Measuring and Modifying the Intrinsic Memorability of Images
by Akhil Raju
S.B. EECS MIT 2014
Submitted to the
Department of Electrical Engineering and Computer Science
In Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
ARCHVES
.77
Massachusetts Institute of Technology
May 2015 [OtUkle 2015
Copyright 2015 Akhil Raju. All rights reserved
AUG 202016
LIBRARIES
The author hereby grants to MIT permission to reproduce and to distribute publicly
paper and electronic copies of this these document in whole and in part in any medium
now known or hereafter created.
Author:
Signature redacted
Department of lectrical Engineering and Computer Science
May 22, 2015
Certified by:
Signature redacted
Antonio Torraba,~ssociate Professor, Thesis Advisor
May 22, 2015
Signature redacted
Accepted by:
Prof. Albert Meyer, Chairman, Masters of Engineering Thesis Committee
I
Measuring and Modifying the Intrinsic Memorability of
Images
by
Akhil Raju
Submitted to the Department of Electrical Engineering and Computer Science
on May 22, 2015, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Electrical Engineering and Computer Science
Abstract
Images have intrinsic memorable properties that enable humans to recall them. In
this thesis, I developed and carried out a procedure to measure the memorability of
an image by running hundreds of human-trials and making use of a custom designed
image dataset, the Mem60k dataset. The large store of ground-truth memorability
data enabled a variety of insights and applications. The data revealed information
about what qualities (emotional content, aesthetic appeal, etc.) in an image make it
memorable. Convolutional neural networks (CNNs) trained on the data could predict
an image's relative memorability with high accuracy. CNNs could also generate memorability heat maps which pinpoint which parts of an image are memorable. Finally,
with additional usage of a massive image database, I designed a pipeline that could
modify the intrinsic memorability of an image. The performance of each application
was tested and measured by running further human trials.
Thesis Supervisor: Antonio Torralba
Title: Associate Professor
3
4
Dg-wi , 1 6 , ,
I ,
-
S.,
al
-
-
Acknowledgments
I would like to thank my thesis supervisor, Professor Antonio Torralba, and Professor
Aude Oliva for their guidance and expertise throughout my thesis. I would like to
thank many others who have provided advice and assistance, like Phillip Isola.
Also, I would like to give a special thanks to Aditya Khosla for his continual
support and mentorship, for taking time to teach me a great deal and enabling me to
learn and explore computer vision. This thesis would not have been possible without
him.
Finally, I would like to thank my friends and family for always being supportive
and helpful.
5
.1
"-.-
-.-
6
Contents
1
2
3
4
Introduction
11
1.1
M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.2
Prior Research
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.3
Thesis Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Measuring Image Memorability
17
2.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2
The Memorability 60K Dataset
. . . . . . . . . . . . . . . . . . . . .
21
Memorability Heat Maps
25
3.1
Building Memorability Heat Maps: Then and Now
. . . . . . . . . .
25
3.2
Validating the Heat Maps
. . . . . . . . . . . . . . . . . . . . . . . .
27
3.2.1
Algorithmic Details - Creating Cartoons
. . . . . . . . . . . .
27
3.2.2
Evaluating Correctness . . . . . . . . . . . . . . . . . . . . . .
29
Analysis of Memorability Data
31
4.1
Im age D atasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2
Memorability and Emotions . . . . . . . . . . . . . . . . . . . . . . .
33
4.3
Memorability and Popularity . . . . . . . . . . . . . . . . . . . . . . .
35
4.4
Memorability and Aesthetic Appeal . . . . . . . . . . . . . . . . . . .
37
4.5
Memorability and Objects . . . . . . . . . . . . . . . . . . . . . . . .
38
4.5.1
Predicting Memorability . . . . . . . . . . . . . . . . . . . . .
39
4.5.2
Object Categories . . . . . . . . . . . . . . . . . . . . . . . . .
40
7
11-
-. 1
-
I
-
.
AUUQ. -11
A6,,6k.,
4.5.3
5
6
Object Counts and Sizes . . . . . . . . . . . . . . . . . . . . .
41
4.6
Memorability and Human Fixations . . . . . . . . . . . . . . . . . . .
42
4.7
What Makes an Image Memorable . . . . . . . . . . . . . . . . . . . .
45
Modifying the Memorability of Images
47
5.1
Overview of Modification Pipeline . . . . . . . . . . . . . . . . . . . .
48
5.2
Detecting Objects in an Image . . . . . . . . . . . . . . . . . . . . . .
50
5.3
Semantically Similar Image Retrieval . . . . . . . . . . . . . . . . . .
51
5.4
Scene Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.5
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Conclusion
55
8
List of Figures
2-1
Experimental setup on Amazon's Mechanical Turk . . . . . . . . . . .
18
2-2
Experimental human consistency for different image display times . .
20
3-1
Example memorability heat maps . . . . . . . . . . . . . . . . . . . .
26
3-2
"Cartoonized" images at different levels of detail . . . . . . . . . . . .
28
3-3
Cartoons at different levels of memorability, used for heat map validation 29
3-4
Results for heat map validation experiments . . . . . . . . . . . . . .
30
4-1
The difference in memorability between image datasets . . . . . . . .
32
4-2
The difference in false alarm rates between image datasets
33
4-3
The differences in memorability for different emotions - VSO dataset
4-4
The differences in memorability for different emotions - Art Photo dataset 35
4-5
Correlation between memorability and popularity . . . . . . . . . . .
36
4-6
Popularity for different memorability quartiles . . . . . . . . . . . . .
37
4-7
Correlation between memorability and aesthetics . . . . . . . . . . . .
38
4-8
Aesthetics for different memorability quartiles . . . . . . . . . . . . .
39
4-9
Most and least memorable objects from Microsoft's COCO . . . . . .
41
4-10 Example memorability heat map and human fixation saliency map . .
42
4-11 Correlation between human fixations and memorability . . . . . . . .
43
. . . . . .
34
4-12 Comparison between highly and less memorable images for fixation
consistency and saliency entropy . . . . . . . . . . . . . . . . . . . . .
44
5-1
Example facial modification results . . . . . . . . . . . . . . . . . . .
48
5-2
Example object-detecting Edge Boxes . . . . . . . . . . . . . . . . . .
50
9
5-3
Results of object isolation using GrabCut and Edge Boxes . . . . . .
10
53
Chapter 1
Introduction
A group gives a friendly wave to you as they approach during a conference. Two of
their faces are immediately recognizable, you remember their faces from a previous
conference, but the other two seem new. However, once the conversation begins, you
realize that you had met all 4 just a few months prior. What makes two of their faces
so memorable, while the other two were harder to recall?
This attribute of memorability extends beyond faces, as well. As we flip through
the pages of a magazine or peruse the Internet, some images stick in our minds easier
than others, and we upon seeing that image again, we immediately recognize it. The
human visual system has the ability to recall a wide variety of types of images, and it
retains not only the semantic information from a given image, but also many of the
details from that image [2].
Image memorability, the study of how and why images are memorable, has become
a growing field of research of the last few years. While there is some variability in how
memorable an image is to each person, prior work has shown that the memorability
of an image is actually somewhat intrinsic, meaning that most people find the same
types of images memorable or unmemorable [9]. The intrinsic nature of memorability
allows us to measure it through experimentation and exploit it through modification.
The question remains, however, what about these faces, objects, or places make
them memorable? Why can we recall certain images more readily than others? While
these may seem like questions traditionally reserved for psychology and neuroscience,
11
applying fundamentals from computer vision and machine learning can allow us to
not only better understand what makes an image memorable but to also predict how
memorable an image is and even to change images and make them more memorable.
These are the general questions that the work in this thesis aims to answer. Building upon previous and ongoing research at Professor Antonio Torralbas Computer Vision Group at MITs Computer Science and Artificial Intelligence Lab (CSAIL) (see
Prior Research section 1.2), my research serves to distinguish memorability as a distinct intrinsic image property, define certain human-interpretable characteristics that
help make images memorable, and design an algorithm that enables a computer to
automatically modify an input image to make it more memorable to human observers.
My work accomplishes these tasks by expanding the human studies into memorability,
critically analyzing what human data tells us about memorability, and utilizing the
data to train models that enable computers to automatically modify images, making
them more memorable.
1.1
Motivation
Studying image memorability grants us advances in both academic and industrydriven applications. First, research into image memorability helps enrich our understanding of how the human visual memory system operates. With greater insight
into qualities that make an image memorable, we can better understand the specific
visual cues that strengthen or weaken the visual recall of scenes, objects, people,
etc. This understanding, in turn, can form the groundwork for therapeutic methods in strengthening a persons memory. For instance, deeper knowledge of how the
mind remembers faces could assist in designing visual memory exercises targeted to
enhancing ones facial recall.
From an industry-driven perspective, predicting and modifying the intrinsic memorability of an image could have applications in a variety of sectors, including education and advertising. In many educational programs, remembering images and tying
images to words or phrases is a common method to learning anything from a new
12
1-
,
"
WAWD
language to how a biological cell works. A method to increase the memorability of
those educational images would help the effectiveness of such practices. In advertising, and general commerce, images used to display a product, person, or experience
are ubiquitous. Billions of dollars are spent just here in the United States to ensure
that viewers will remember the images displayed, and more concrete methods to 'measure the memorability of different approaches would make those efforts more efficient
and precise.
These examples, however, only scratch the surface of what is possible with a
greater understanding of image memorability. We also want to unlock the door for
others to continue work in this field in order to collectively further our knowledge of
human memory. To do this, we need a strong basis of human memorability data from
which others can begin to perform their own research and development of various
applications.
This thesis describes work that moves towards a better understanding of image
memorability while also opening a platform of data for others to use for research, as
well.
1.2
Prior Research
As mentioned above, existing research into image memorability has shown that, despite expected human variability with respect to memorability, humans tend to find
the same types of images memorable, showing evidence to the fact that memorability
is intrinsic to an image [9] [11].
Moreover, the past research has shown that it is possible for a computer to predict these intrinsic qualities. Phillip Isola, et al. created a support-vector machine
based regressor that could successfully (with a probability significantly higher than
chance) predict, given two images, which one would be more memorable to humans
[9]. Isola et al collected human memorability data on approximately 2200 images by
running experiments on Amazons Mechanical Turk, and they used that data to train
their SVR classifiers. The research described by them first began to illuminate that
13
memorability could be measured by experimentation and then predicted by standard
machine learning tools.
Aditya Khosla et al. yielded similar successful results in prediction and measurement for the class of facial images [11]. Khosla et al went further to show it was
possible to subtly modify the facial features in an image in order to make the face
seem more or less memorable. The modification process includes identifying facial
anchor points, some of which were annotated previously and some of which were calculated on the fly, in order to create an active appearance model (AAM). Once the
face is parameterized, the AAM is then fed into a memorability optimization function
that aimed to maximize (or minimize) the faces memorability while not moving the
facial features too much. While the work by Isola et al revealed the intrinsic and measurable nature of general scene memorability, the work by Khosla et al showed that
those qualities extend well to human faces and that those qualities can be modified
without changing the semantic content of an image.
In 2012, Khosla et al. also showed that there are specific regions of an image that
are more memorable than others [13]. Intuitively, we understand this, and as we look
at images we can identify their most memorable attributes. However, the work by
Khosla et al. showed that those attributes are predictable and follow a pattern that
computers can understand and learn. The algorithm and machine learning pipeline
they introduce can break down an image into a memorability heat map, distinguishing
which regions are more likely to be remembered than others. As expected, sample
results show that things like people are more memorable than a single tree in a
picture of a forest, but the research proves that a computer system can understand
and distinguish these differences in memorability.
Together, these past experiments prove the feasibility of predicting and modifying the memorability of images. However, each example utilizes a relatively limited
subset of data. For both the work done by Isola et al and by Khosla et al, they
use approximately 2000 images in their training and testing processes. Furthermore,
Isola et al uses images solely from the SUN database, and Khosla, understandably,
uses only facial images in his modification work. Thus, it is tough to generalize their
14
''..
__
.
W.Wivwk
findings and applications to many different types of images. Part of the motivation
for the work in this thesis it to expand the number and variability of images used in
memorability experiments.
1.3
Thesis Overview
Chapter 2 describes how we expanded the human studies into memorability and
created the Mem60k dataset for image memorability research. Our research required
further experimentation on Amazons Mechanical Turk, and we compiled images from
a wide variety of sources in order to obtain a varied and generalizable set of images.
This chapter describes experimental procedure and the details of Mem60k.
Chapter 3 describes how we find the memorable regions of an image to build memorability heat maps. My work specifically focused on how we could experimentally
validate the heat maps generated by our algorithms.
Chapter 4 describes my analysis into what makes an image memorable. Given a
large set of human memorability data corresponding to a diverse set of images enables
insight into what memorability is and what factors influence. My research compared
the attribute of memorability to other traits, like aesthetics and popularity. Similarly,
this thesis finds how emotions, objects, and other attributes affect the memorability
of images.
Chapter 5 describes image modification algorithms which automatically makes an
image more (or less) memorable. We aim to go beyond faces and create a scalable
solution to modify the memorability of general images while making the use of an
extensive image database. Chapter 5 describes the different approaches we explored.
15
_
_ - I
-
-
W"W.91611 - I - "
16
Chapter 2
Measuring Image Memorability
In order to understand the inherent memorability of images and automatically predict
how memorable images will be, we need to collect a large quantity of human ground
truth data. We need to collect the probabilities that different images will be recalled
by humans, and these probabilities will give insight into how memorable images are.
More memorable images have higher likelihoods of being recalled than low memorable
images.
We measure these probabilities by running experiments with humans and evalThis chapter describes the experimental
uating the probabilities of image recall.
setup and some parameter selection techniques that enabled us to run cost-effective,
large-scale experiments quickly.
2.1
Experimental Setup
The type of memory we focus on in this thesis is short-term visual memory.
In
order to test this, and to measure the likelihood of images being recalled, we run
online experiments (run on Amazons Mechanical Turk platform) with humans and
ask them to click a button when an image has been shown twice. The data from the
experiments allows us to calculate the likelihood an image is remembered, and we
call these psuedo-likelihoods "memorability scores".
The experimental setup is shown in Figure 2-1. Experiment participants are shown
17
Vigilance repeat
1 sec
1.4 sec
Memory repeat
time
Figure 2-1: The structure of the memory tests run on Amazons Mechanical Turk.
Memory repeats test the actual memorability of the given images. Vigilance tests
check for the continual attention and responsiveness of participants.
a sequence of images, some of which re-occur. When an image re-occurs, the participant clicks a button to signify he or she has seen that image before. The same
images are shown to many participants, and scores are only validated if the memorability scores for images are consistent across different groups of participants. The
experiment also contains vigilance tests, which are images that re-occur in relatively
close succession for the purpose of checking that the participant is paying attention
to the experiment. If a participant fails the vigilance tests, his or her test results are
invalidated.
This experimental procedure is largely similar to the experiments used by [9]
and [11]. The experiment contains many tunable parameters that affect the cost
and validity of the experiment. One major parameter is the image display time and
human click time. Each image is displayed for several hundred milliseconds, and after
each image the participant is allotted several hundred milliseconds to click the button
if he or she believes it is an image they had seen previously.
More time-intensive experiments cost more, and since we wanted to find data on
almost 60,000 images, we wanted to find the most cost-effective method of performing
our experiments without invalidating our results. With regards to the timing, we
wanted to show each image for a long enough time for users to comprehend its content,
and we wanted to provide enough time for them to process the image and click between
images.
Our cost-constraints set a fixed total time per image (display time plus the click
time), and we wanted to find which balance of timings would not detract from the
quality of our experiments. The total time was fixed to 1500 ms, and we tried several
18
-Widgwi.
6_-_
I -Al--__ W,
display times ranging from 500 ms to 800 ms.
We define quality of an experiment by the consistency of its results across different
humans and as compared to previous memorability experiments [9]. Human consistency refers to checking whether different humans find the same images memorable
or not memorable. If we were to look at two different subgroups of the participants,
the data from each subgroup should result in the same ranking of images in terms of
memorability. We check human consistency by randomly splitting the participants
into two groups, calculating the memorability scores for the images from the data
from each group, and finding the Spearman rank correlation between these two sets
of scores. Ideally, each group would find the same images more memorable than other
images, and thus would lead to a high rank correlation.
For our parameter selection through experimentation, we only used images that
Isola et. al. used in their experiments, and thus we were able to use their results as
a baseline for our own analyses. We evaluate our consistency with their results in a
similar fashion as previously described: by calculating the rank correlation between
our memorability scores and their baseline memorability scores. Each display time
was tested with an experiment consisting of 100 images, and each image was viewed
by 80 different participants. There were two types of memorability scores we looked
at, and their formulas were as follows:
mem=
memFA =
hit count
showcount
hit-count - false-alarms
show -count
where mem and memFA are the memorability scores, show-count is the number
of times the image was shown, hit-count is the number of times the image had been
correctly clicked on during the memory repeat, and false-alarms is the number of
times the image had been incorrectly clicked on during the image's first showing.
Each score yielded similar results, but for most of the consistency measurements,
we used the second memorability score type, which took false alarms into account 1.
'We use these scores for most analyses and applications in this thesis
19
0.8
0,8
1
e
-
Intra-experiment consistency
Consistency wth
Isola et al
0.76,
00.74
-
0.72
0.7
0.68
0.66
0.64
0.62
500
550
650
700
Image Display Time (ms)
600
750
800
Figure 2-2: Human consistency for different image display times. The total time per
image was fixed to 1450 ms while the balance between display time and click time was
altered. As visible through the intra-experiment consistency, significantly reducing
either time reduces the overall consistency.
For the experiments, we hypothesized that humans needed as much time as possible to view an image in order to process it. Thus, we expected that as the image
display time increased, the intra-experiment consistency and the consistency with the
baseline memorability scores would increase as well. However, our several experiments
revealed that the allocated click time is nearly as important as the image display time
(see Figure 2-2).
The results show that both the click time and the image display time have significant effects on human performance. Particularly favoring one or the other causes
a decrease in human consistency, but rather a balance of the two is shown to enable
the highest possible consistency for a given time. This contradicted our hypothesis
and showed us that click time was more important to our experimental setup than
previously thought.
Other parameters of the experiment were chosen in a similar manner, but the
image timing was the most significant and thus the only one this thesis goes into full
detail over.
20
...............................................
The Memorability 60K Dataset
2.2
In order to increase the prediction performance and to expand the applicability of
image memorability research, we needed to gather more information on which images
humans find memorable, and how memorable they truly are. The work done by Isola
et.
al
[9]
uses about 2000 images collected from the SUN dataset to perform his
analyses on memorability. Even from 2000, he makes significant insights into what
makes images memorable, but 2000 images from one particular dataset is too few to
make more generalized prediction and modification techniques. The work by Khosla
et. al. [11] uses a dataset of a similar size (approximately 2000 facial images), but the
work is applied very specifically to facial memorability and modification. For their
purposes, a smaller dataset still yields great results.
However, our ultimate goal is general understanding, predictive abilities, and further applications. For these, significantly further data is required on a much more
varied set of images. Such a dataset did not exist prior to our work, so we created
the Memorability 60K (MEM60k) dataset, containing approximately 60,000 images
pulled from a variety of other existing image datasets. Using the experimental setup
described in the previous section, we gathered human data on each image which allowed us to calculate memorability scores for each image. These scores could then be
used for a variety of applications. The next chapters in this thesis discuss some of the
insights and applications we derived from the memorability scores of the MEM60k
dataset. The remainder of this section briefly describes the various datasets we pulled
images from and the initial results of our experiments.
" Aesthetic Visual Analysis Database [16]: contains images across several different
categories along with metadata regarding a quantified measure of their aesthetic
appeal to humans.
" Abnormal Image Dataset [19]: contains images of strange or abnormal objects
that dont occur in the real world
* Abstract Photo Dataset [15]: contains abstract images of designs, similar to
21
abstract textures.
* Art Photo Dataset [15]: contains artistic images and accompanying metadata
regarding regarding the specific emotions that each photograph invokes
* COCO [14]: Microsofts Common Objects in Context dataset contains images
along with annotations regarding the size and type of all objects found in each
image.
" MIRFlickr [8]: contains images a wide-variety of images from Flickr under the
Creative Commons licenses, along with metadata and labels for each image
" MIT300 Fixation Dataset [10]: contains images from Flickr that were initially
used for studies in human fixations on images. Also contains the human fixations
and saliency maps for each image.
* Object and Semantic Images and Eye-tracking dataset [22]: contains images
with object labels and human fixations data.
" SUN [21]: contains many types of images initially curated for scene understanding. Accompanying the images are scene and object annotations/data.
" Visual Sentiment Ontology [1]: contains images from Flickr along with their
view-count data, which gives insight into the popularity of each image.
We pulled images from all of the above datasets to create the MEM60k dataset.
Using the previously described experimental setup, we collected 80 labels per image,
running experiments with hundreds of individuals to do so.
The human consistency within the experiment matched fairly well with the previous experiments run by Isola et al Our average rank correlation between the memorability scores as determined by two different randomly divided subsets of participants
was 0.68, while the Spearman rank correlation from the experiments from Isola et al
were 0.75. Our slight decrease may simply be due to the higher volume of images we
are using, along with shorter image display and click times allotted per image, which
we needed for cost reasons.
22
Overall, however, the consistency shows that the data we collected was reasonable, for different groups of people found the same images to be the most (or least)
memorable. Thus, we were able to utilize the memorability scores we calculated to
derive conclusions on what makes an image memorable, make predictions on what
parts of an image are memorable, and begin to modify images to make them more
memorable.
23
24
Chapter 3
Memorability Heat Maps
The Memorability 60k dataset and the memorability scores that we found through
experimentation led to a variety of insights and applications. One application developed by Aditya Khosla and others was an algorithm to find the memorable regions of
an image. For instance, given an image with a girl standing in a forest, typically the
girl will be the most memorable aspect of the image while the trees are less memorable. Essentially, Khosla developed a memorability heat map generator, which could
pinpoint which regions of the image were most memorable and which regions were
more boring and forgettable. Some examples of these heat maps can be found in
Figure 3-1.
Empirically, these heat maps seem to correctly pinpoint the most memorable parts
of an image. However, how can we tell for sure? This chapter first describes the work
done by Khosla et al to build the memorability heat maps, and then it details how
we validated the correctness of these heat maps.
3.1
Building Memorability Heat Maps: Then and
Now
In 2012, Khosla et al used existing memorability data collected from the experiments
performed by Isola et al to create a methodology for finding the memorability of
25
Figure 3-1: Memorability heat maps automatically generated for various images. The
red regions denote regions with higher memorability than the blue regions.
specific image regions and for generating memorability heat maps from those regions
[13] [9]. Khosla et al modelled memorability as a noisy process that may add or
remove elements of an image when the human visual system converts an image from
its external representation to its internal representation. For small segments of the
image, the algorithm would extract different image features from the segment and
feed the features through a noisy memory process and into a linear regressor that
could compute the memorability of that segment. Each feature type (color, HOG,
semantic , etc.) would generate its own heat map , and the different heat maps would
be pooled together to create an overall memorability map.
The memorability heat map process described in [13] does a good job of predicting
the overall memorability of an image and of finding the memorable portions, but it was
trained and tested on a relatively small and homogeneous dataset. The Memorability
60k dataset provides a richer source of data due to its size, and in order to take full
advantage, Khosla et al updated the memorability heat map generation process to
utilize convolutional neural networks (CNNs) to predict the memorability of segments.
The CNN s find how memorable each segment of the image is, for various sizes of
segments. The different segments are blended together to build the memorability
heat maps (see Figure 3-1). See our new paper for more details on how the CNN s are
designed and trained.
26
1. .1-1 1-.'--__'--
II 1
3.2
.1 1
_ _111-
. 1- 1 ...
I I -
-11, ''1 .1
i
, -
-
L
-
,
.
- A
1
.1
. I -
.
I
'A
.
. 11
Validating the Heat Maps
As mentioned before, the heat maps constructed by the new process that takes advantage of the data from MEM60k seemed to make sense, empirically. However, we
needed to confirm our beliefs through experimentation.
We wanted to test whether the memorability heat maps correctly differentiated
memorable and unmemorable regions of an image. One way to do this is to create
new images that emphasize or de-emphasize those memorable regions. Images where
the memorable regions are emphasized should be more memorable than images where
those regions are de-emphasized.
This section discusses how we algorithmically created images that emphasized
memorable regions of an image and details the experimental setup and results for
evaluating the correctness of our memorability heat maps.
3.2.1
Algorithmic Details - Creating Cartoons
In 2002, DeCarlo and Santella developed a method to utilize human fixation data in
order to create artistic renderings of photographs
[6].
Their procedure uses a hierar-
chical color segmentation and filtering scheme to make each image more cartoon-like,
and the parameters of their scheme can be adjusted to allow more or less detail per
image segment. They take advantage of saliency maps created from human fixations
to pinpoint which regions of an image are more important than others, and these regions are designated to have more detail than the rest. In a similar manner, we were
able to take advantage of our memorability heat maps to designate which segments
of the image are more important than others, and thus contain more detail.
Our algorithm works as follows. An input image is converted into several cartoonized versions, each with a different level of detail. For each cartoon, the input
image is segmented by color using the Rutgers EDISON system [3]. The color segmenter from EDISON uses a mean shift filter and has three main parameters, a range
bandwidth h, a spatial bandwidth h,, and minimum size of segments M. We choose
their values based on the desired level of detail d such that as d went up, the other pa27
(a) d
= 0.1
(b) d = 0.5
(c) d
=
0.9
Figure 3-2: The cartoons automatically generated for different levels of detail d using
our cartooning algorithm. These images have the same level of detail across the image
and do not take into account the memorability heat map data.
rameters would decrease. The segmented image is assigned one color for each segment
(the average color for that segment from the input image). A Canny edge detector
finds edges in the input image, again with parameters dependent on d, such that
more edges are found as d increases. We taper each edge by using the binary image
of edges and sequentially dilating the image more and more as we come closer to the
edges center. These tapered edges are added to the segmented image to result in our
final "cartoonized" image. See Figure 3-2 for examples of the cartoons at different
levels of detail.
For each image, we find the memorability of the various image segments by looking at the memorability heat map. For the most memorable segments, we extract
those regions from the high detail cartoon, and the least memorable regions are extracted from the low-detail cartoon. The resulting image, which emphasizes the highmemorable regions , is finally smoothed along the cut lines. To create an image which
emphasizes the least memorable regions , we extract those regions from the high detail
cartoon and the least memorable regions from the low-detail cartoon. Finally, we also
create a baseline image which randomly selects half the segments (as measured by
area) to be assigned to high or low detail.
Examples of the resulting images can be seen in Figure 3-3.
28
original image memorability map
high
medium
low
Figure 3-3: Examples of the cartoons at different memorability levels. Each row shows
the original image, the memorability heat maps , and the cartoons that emphasize the
high or low memorability regions. The medium column emphasizes half the regions
randomly. The memorability scores for each cartoon is included.
3.2.2
Evaluating Correctness
We expect the images for which the most memorable regions are emphasized to be
more memorable than the baseline, and images where the least memorable regions
are emphasized to be less memorable than the baseline. If this is the case, it gives
evidence that our memorability heat maps are correctly differentiating the memorable
and unmemorable regions in an image.
In order to measure the memorability of the various cartoon images, we use a
similar experimental setup to the one described in Chapter 2. We host visual experiments on Amazons Mechanical Turk in which participants are shown a sequence of
images and told to click a button each time they see an image repeated. Each image
is shown to 80 participants, and the proportion of participants who correctly find the
repeated image relates to the memorability score for that image.
We tested 250 images from the MEM60k dataset , creating 3 versions of each
image, one which emphasizes the most memorable regions , one which emphasizes the
29
0.90.8-
8C,,
0.70.6
E
0.4
E
0.3
a>
-
M0.5
low
-
medium _
0.2-
0
0.1
0.2
0.3
0.4
05
0.6
0.7
high
0.8
0.9
1
image index
Figure 3-4: The memorability scores for the cartoons that emphasize the high or low
memorability regions of an image. The medium cartoons are the baseline images.
least memorable regions, and a baseline version which emphasizes a random selection
of regions.
The filler images used in the experiment (images that do not repeat),
were also constructed using the same scheme outlined in the previous section. Each
exercise contained approximately 100 images, and we ensured that participants would
never see two different versions of the same image.
The resulting memorability scores from our experimentation are shown in Figure
3-4.
As visible, the cartoons where the most memorable regions are emphasized
are more memorable than the baseline, and the cartoons where the least memorable
regions are emphasized are less memorable than the baseline, as expected.
All the
differences between the memorability scores of the low, high, and baseline images
were found to be statistically significant using an o = 0.05.
The results from our experiments validate the memorability heat maps generated
by our CNN-based algorithm. Also, they begin to shed light on methodologies that
could modify the memorability of images.
As shown, accentuating certain aspects
of an image can significantly affect the memorability of that image.
In Chapter 5
we will further explore this topic and ways to utilize the information stored in the
memorability heat maps to modify the intrinsic memorability of images.
30
I
--
I
I
-
- I1
1.
, .
---
I
. .
- I
-
-
-
1 .11 -
I
I I
-
.
1-1.1-
-11
.1
-
1
11-.1
1-1
.
-
I
I I- .
-
-
-
..
1
.11
.
1 .11
tl
I 1 .1
-a
11
-
Chapter 4
Analysis of Memorability Data
The experimental procedure outlined in Chapter 2 was utilized to gather memorability
data for almost 60,000 images. Due to the high number of images and their diversity,
the information we gathered allows us to begin to see what makes an image memorable
and how does memorability relate to different characteristics an image might have
(how popular the image is, what objects the image contains, and so on). The images
were collected from several different existing image datasets, as mentioned in the
previous chapter, and each dataset also contained further information and attributes
that we could relate the memorability data to.
4.1
Image Datasets
Our first check was to see if our different datasets were in fact unique in content
and memorability.
We hypothesized that different content in an image would lead
to different memorability scores, a hypothesis that had been confirmed in previous
smaller scale experiments [9]. Each dataset contains sets of different types of images
that contain different material, and thus we hypothesized that the memorability scores
across the datasets would be different. It is important to reiterate that the images
from the different datasets were mixed together and shown in a random order to the
human participants, so there was no bias towards any particular dataset.
Figure 4-1 shows how the memorability scores for images from different datasets
31
I I
.
-
-
-1-
MVF P S
An At C
0.9-
0.8-An
(D
0.
0.7
C0.6-
0
-Abnormal
(An)
-Popularity
(P)
MIRFlickr (M)
C.D
Coco (C)
- Abstract (At)
-Fixation
Flickr (FF)
0.4
-SUN
0
(S)
0.4
0.6
Image index
0.2
0.8
S
1
(a)
(b)
Figure 4-1: The difference in memorability between image datasets.
scores of the different datasets.
(a) shows the
(b) shows which differences in memorability scores
are statistically significant. Green means the row header is greater than the column,
red means vice versa. Blue means they are equal.
were, in fact, different. Furthermore, as shown in Figure 4-1b the differences between
the memorability scores of different datasets were statistically significant.
The p-
scores shown were calculated using a two-sided t-test with an alpha cutoff of 0.05.
The statistically significant differences in memorability are shown with green boxes.
The results of this comparison support some of our intuition.
As Figure 4-la
shows, the abnormal dataset, which contains strange objects and images that are not
normally seen in the real world, tend to be very memorable, while the SUN dataset,
which typically contains relatively mundane images of general scenes, has the lowest
memorability.
In addition to differences in memorability, the various datasets also differ in homogeneity.
We can see this through the false-alarm rates of the images from each
dataset. A false alarm occurs when a participant clicks on the image, believing it was
a repeat, when in fact it was not. In some sense, the false alarm rates are an inverse
metric to the memorability scores, and they give a sense of how similar an image is to
the other images in the overall set of images. We would expect that images that are
not memorable are very similar to other mundane images, and our results support
32
False alarm rates across
-
0.45
image datasets
abnormal
abstract
art
0.4 -
Cc
fixationflickr
eirflickr
o.35
popularitgnvso
SWa
0.25
-
0.3 -
0.2
osie
0.15
0.1
0.00
0
0.1
0.2
0.3
0.4
0.5
0.6
Image indiex
0.7
0.8
0.9
1
Figure 4-2: The different false alarm rates for various image datasets.
this hypothesis. As expected, the datasets with low memorability scores had high
false alarm rates, and vice versa. See Figure 4-2 for more details.
4.2
Memorability and Emotions
Two image datasets used in our Memorability 60k dataset, the Visual Sentiment Ontology set [1] and the Image Emotion/Art Photo set [15], contain metadata regarding
the specific emotions each image represents. We were able to correlate the different
emotions to the memorability scores we had gathered.
We hypothesized that vastly different emotions would yield different levels of memorability. More specifically, due to the fact that our experiments mainly tested shortterm visual memory, I expected more exciting emotions, like fear and amazement,
would yield higher levels of memorability than more calm emotions, like happiness.
Our use of two different datasets allowed us to cross-validate the results we received, and those results can be seen in Figure 4-3 and 4-4. These figures show how
the memorability scores compare across different emotions and show that the differences between the different emotions were, for the most part, statistically significant.
Statistical significance was determined by running two-sided t-tests with an alpha
33
Figure 4-3: The differences in memorability for different emotions, with data from
the VSO dataset.
cutoff of 0.05.
The results of our experiment validate our hypothesis that different emotions
yield different levels of memorability. This result was supported by both independent
datasets, allowing us to conclude this more strongly. Our more specific hypothesis
that more exciting emotions would be more memorable than calm emotions was
found invalid, though. While both datasets showed that some calm emotions are less
memorable (contentment and serenity were the least memorable emotions of the two
datasets), exciting emotions were found throughout the spectrum of memorability.
Also, classifying the emotions as exciting or calm is slightly subjective, so supporting
this hypothesis is difficult to do.
Interestingly, both datasets had disgust as the most memorable emotion, and
disgust was found to be statistically significantly higher than all other emotions for
both datasets. This may show that feelings of disgust and the images that trigger
those emotions are more readily remembered than other things, which would explain
why some marketing campaigns which rely on shocking or disgusting its recipients
into action (for instance, an environmental campaign showing the effects of oil spills
on wildlife) work so well.
34
0.9
AAn Aw Cn
0.8
Am
>0.7
An
CD)
CTJ 0.6-disgust
06
DsEx
FeSa
Aw
amusement
-fear
-
E
-sad
*
awe
-
0
Ds
excitement
-
0.4
Cn
anger
S0.5 --
0.2
0.4
0.6
contentment
0.8
Fe
1
Image index
(a)
(b)
Figure 4-4: The differences in memorability for different emotions, with data from
the Art Photo image dataset. (a) shows the memorability scores. (b) shows how
many of the differences are statistically significant (highlighted in green). Statistical
significance was determined with one-sided t-tests, and their resulting p-scores are in
the table cells.
4.3
Memorability and Popularity
The Visual Sentiment Ontology dataset gives information on how popular each of its
images is. It derives its images from Flickr and gives information on how many times
the image has been viewed, and when those views occur. The view count, after being
normalized for time, gives insight into how popular an image is. The normalization
process is necessary to gather any signal from the view count data, and the process for
normalization is derived from [12]. Image that have higher view counts are deemed
more popular, and in this thesis, the normalized view count is also referred to as a
popularity score.
We hypothesized that memorability would be strongly related to popularity and
that more popular images would tend to be more memorable. The intuition behind
this is simple: popular images tend to be visually striking and thus more likely to
be remembered, even if just viewed for a moment. We hypothesized to see a strong
positive rank correlation between the popularity scores and the memorability scores.
35
00
0
~b0
00~000
Toreatp bew entehwimtirandb shFw a b SosePr+ lro.COk
im
speifi slic of th images.0
spcii
slc
sltl
o*hr
Memorily W ity scrsv.pplrtysoe.()sos
Figure 4~~-:lt
veal
ofthjjjgs
an
a0h
c
0f
of
0.5
anapacuofo 0.5
Fige 4-5t Mifrn embilityscys s poputlsharit scres.(a eesows owterit.Te st
nh
ouaiysoe hat thowest,
h
ovemralreliagn
bee
akcreainbten
megtrscspopanarty)sshrostheloerto
ovrlthetw
Whlthr is no
lal e httetoatiue
twdomeoaiiysoew0a
ttibte are dsfeengly independenrt
ewe
hs
urie r r eae.Mr
ttsial
meorbl image tedt0emoepplr
Howeveran, oufnditasid sinotfully
spr
deemndwthi.shw in Fw-iued 4-,tere wis
tmeoblity Thereis n cle
aWvrylowhr i oovrl rank correlation betweentn
Mgue4-re
mUoabltscswean closerrnspetion though the two attributes are related.
memorable images haed th e highes popularit
36
crs
h
oethv
h
oet
thoSh
Popularitq Acr-s
Differmnt M-mrbilit
Quartiles
8.8
8.6
8.2
87.8
7,6
7.4Thr
0
0588
000i
[s
2808
2500
lndex
Figure 4-6: The popularity scores for different memorability quartiles of images. The
top quartile has the 25% most memorable images, the bottom quartile has the least
memorable images. As shown, the most memorable images tend to be the most
popular.
4.4
Memorability and Aesthetic Appeal
The Aesthetic Visual Analysis (AVA) dataset [16] gives information on how aesthetically appealing each of its images is. In a similar vein to the popularity analyses
from the previous section, we checked if there is a relationship between the aesthetic
appeal of an image and its inherent memorability.
We hypothesized that more memorable images would have higher aesthetic appeal.
We reasoned that more pleasing or tasteful images would better stick in our visual
memory and would be more easy to recall. More mundane or less pleasing images
would be easily forgettable.
Once again, as with our exploration into popularity, aesthetic appeal did not show
a strong overall rank correlation with memorability. Figure 4-7 shows that the rank
correlation between the aesthetic scores from the AVA dataset and our computed
memorability scores is 0.08, indicating essentially no link between memorability and
aesthetic appeal. A look at a slice of the images, namely those with the highest memorability scores, underscores the seeming independence between the two attributes.
The distribution of aesthetic scores appears to remain the same regardless of the
memorability score.
37
-
A-t1.t1-: AVA dt-t
8, 0,0
88,080
898
08,80,
8,6 80
.
0,
,
M-bOltq
8%08
18
twoottriute8areseeingl8inepenent
Figre 47
eoaiiysoe
saesthetic scores son nFiue -. (a)o how heowal theis littl
to have slightly higher levels of aesthetic appeal, which supports our hypothesis.
However, empirically, we can see that the differences are less significant than those
found in the popularity analysis (see previous section).
We conclude that, while
there may be some relation between memorability and aesthetic appeal, the relation
is relatively weak, or at least weaker than initially hypothesized.
4.5
Memorability and Objects
Several of our datasets provided information of the types and sizes of objects present
in their images. We aimed to utilize this information to explore how specific objects
and their sizes affect memorability.
The datasets SUN, COCO, and MIR Flickr supplied object labels, but for our
analysis we primarily looked at the metadata from COCO [21] [14]
[8].
COCO offers
a happy medium with respect to depth of information between SUN and MIR Flickr.
The two others offer either too little or too many object categories for our analyses,
38
Aesthetics Acrosa
6.6
Different Memorbili Ckartiles
6.4
6.2
6-
5.8
5.65.4
5.2
5To
-Third
4.6
44
50
1000
1506
2Q60
25(0
1-9. Index
Figure 4-8: The aesthetic scores for different memorability quartiles of images. The
top quartile has the 25% most memorable images, the bottom quartile has the least
memorable images. There seems to be little effect of memorability on aesthetic quality.
so we opted to focus on the 5000 images we used from COCO. COCO supplies 80
object categories along with the object sizes of each instance in an image.
This
information enabled us to see how object category, object counts, and object sizes
affect memorability.
4.5.1
Predicting Memorability
Object labels offer a rich supply of information, and it has been previously exhibited
that object metadata can provide the basis for predicting memorability [9]. We follow
a similar approach to Isola et. al and predict the memorability of images based on
several different types of object metadata.
For the labelled data types, we assemble length d feature vectors where d is the
number of object categories supplied by COCO. For object counts, we count the
instances of each object category. For sizes, we either look at the maximum size of
an instance for each object category, or we look at the sum of sizes. For both types,
we normalize our vectors and feed them through a histogram intersection kernel.
These feature vectors are fed into a support vector regression (SVR) with the
ground-truth memorability scores as their labels. To determine the correctness of
each feature vector method, we looked at the rank correlation between the predicted
39
memorability scores and the actual memorability scores for a test set of images (separate from the training set used). The object-counts feature vectors lead to a rank
correlation of 0.39, and the object sizes vectors lead to a rank correlation of 0.41.
4.5.2
Object Categories
First, we explored the differences in memorability between different object categories
and determined which types of objects were the most memorable.
Each image contains several types of objects, so simply looking at the memorability scores of images does not give a valid comparison of the object categories. For
instance, if most images of horses tended to contain barns as well, it would be difficult
to compare the two object categories in terms of image memorability scores alone because they share the same scores. Thus, instead of only using the raw memorability
scores, we predict the memorability of the image with and without a given object and
measure how the predicted memorability changes. A more memorable object will
have a larger decrease in predicted memorability when it is removed from the image.
We predict the memorability in the method outlined in the previous section, and
we processed each image, removing each object in the image and checking to see how
its predicted memorability changed. Averaging over all the instances of an object
category allows us to rank the categories in order of their effects on memorability.
Figure 4-9 shows some results. Interestingly, many categories for smaller, or handheld objects, like ties, bananas, and donuts, were very highly ranked, while categories
for non-household animals, like giraffes and zebras, ranked very low. The high ranking
of smaller object categories may be due to how people typically photograph them.
As somewhat visible in Figure 4-9, these smaller objects are often photographed in
ways that accentuate them, meaning they are large and the only object in the image.
As we will discuss in the following section, these types of images tend to be more
memorable. The animals, on the hand, may simply be difficult to differentiate. A
person may find it tough to distinguish between two different pictures of zebras. This
reasoning is pure conjecture, however. There is no evidence to support this.
40
tie
(+-0.040)
skis (-0.069}
scissors
(+0.039)
bear (-0.069)
banana
(+0.031)
giraffe (-0.11 s)
E
.B
0
al
Figure 4-9: We evaluate importance of an object category to memorability by looking
at its effect on predicted memorability when the object is added to the image. The
top row shows the most memorable object categories , and the bottom row shows the
least. Each category shows images for which adding the object has the most/least
imact on predicted memorability.
4.5.3
Object Counts and Sizes
The COCO object annotations give information into how many and how large the
different objects are in the image. We used the pieces of information to predict
memorability and to rank the categories in terms of memorability. We also wanted
to see how they individually relate to our measured memorability scores.
First we explored how the size of an object relates to its memorability. Empirically, we saw in the previous section that larger objects in an image tended to be
more memorable. This was further supported by our experiments here. The rank
correlation between the average size of an object in an image and the images memorability score is 0.39, and the rank correlation between the size of the largest object
in the image and the memorability scores was 0.37. These correlations are relatively
high , and they show that larger objects tend to make an image more memorable.
Also, we found that the number of objects in an image detracts from its memorability. We found that the rank correlation between the number of objects in an
image and its memorability score is -0 .1 7. While the correlation is not too strong,
it illustrates a similar notion as the object size data does: having only a few, large
objects in an image makes it more memorable.
41
Figure 4-10: An image, its memorability heat map, and its human fixation saliency
map.
4.6
Memorability and Human Fixations
Human fixations refers to where people tend to look when viewing a specific image.
For instance, given a picture of a smiling girl in a forest , most people would tend
to fixate on the girl first before looking at the rest of image. Hum an fixation data
gives insight into what the human visual system immediately focuses on, and we
hypothesized that human fixations would relate in some way to memorability.
The OSIE and Fixation Flickr datasets provide data on the human fixations for its
images [22] [10]. For most of our analyses , though, we simply relied on the Fixation
Flickr dataset. The two datasets are relatively similar in content , and we opted to
use the Fixation Flickr dataset due to ease of access.
From Khosla et. al [13], we have a method of determining which regions in an
image are memorable and generating a heat map to visualize these results. It is worth
noting that the experiments used in this thesis do not use the same process outlined
in [13], but instead we make use of a deep-learning model that achieves the same
memorability heat map with more precision. See Chapter 3 for details on the process
and the method in which we evaluate the accuracy of the heat maps.
We hypothesized that the most memorable regions of the image would also be
the most salient for human viewing, and we wanted to test whether humans tend to
fixate on the memorable regions of an image, and if the human saliency maps found
in studies like [10] are similar to the memorability heat maps we can generate.
To measure the relationship between fixation points and memorability, we wanted
to see whether the human fixation points tended to lie in the most memorable regions
of the image.
Figure 4-11 shows the number of fixation points that lie above a
42
0.9
0.
-
0.8
0.60.5
0.4
0.3
0.2
Pixels Sorted by Mexorability
Pixels Randomlq Sorted
0.1
0
0.1
0.2
0.3
0 6
0.5
0.4
% of Image Pixels
0.7
0.8
0.9
1
Figure 4-11: The number of fixation points covered by pixels either chosen at random
or in order of memorability. This shows memorability and fixations are somewhat
correlated.
certain threshold of memorability. As shown, the number of points grows faster than
if we chose pixels at random, indicating that the more memorable regions of the
image are more likely to contain human fixation points. This aligns exactly with
our hypothesis, and we see clear evidence that memorability and human fixations are
positively related.
To further explore this relationship, we looked at the human saliency maps, which
show where human eyes tend to linger as they view a scene. While the fixation points
are specific points, the saliency maps are akin to the memorability heat maps, as
shown in Figure 4-10. We looked at the saliency maps in two ways: one, we wanted to
see how the saliency maps and the memorability heat maps were correlated, and two,
we wanted to see how the spread of the saliency map was correlated with memorability
scores.
We hypothesized that the saliency and memorability heat maps would be very
closely linked. As mentioned before and supported by our initial heat map experiments, we assumed that humans would tend to fixate on the memorable regions. To
measure the correlation between the two-dimensional maps, we vectorized each map
and found their rank correlation. The average rank correlation found between the
43
4.5
1
4
0L
> 0.95
E 3.5
C-
U)
U)
3
o 0.9
C,)
2.5-
C
E
I
0.85
0.85
-25% Most Memorable
Memorable
0.-25%0.4 Least
0.s6Mmoabe
0
0.2
0.4
0.6
Image index
0.8
-25%
-25%
0.25%.
1
1
0
(a)
0.2
Most Memorable
Least
Memorable
0.6 Mmral
0.4
0.6
Image index
0.8
1
(b)
Figure 4-12: There are statistically significant differences in human fixation consistency and saliency between the most memorable and least memorable images.
maps was 0.10. This correlation was not strong enough to support our hypothesis.
We also hypothesized that the more spread out a saliency map was, the less
memorable that image would be. The reasoning for this stems back to our object
analyses (see previous section). As found before, fewer objects and more focus in an
image yields higher memorability. Thus, if a saliency map is spread out, we thought
it would mean that there were many objects in the image, and humans wouldnt
know where to fixate. We measured the spread of a saliency map by looking at its
entropy. The rank correlation between the entropies and the memorability scores was
-0.24, and Figure 4-12 shows how the more memorable images tended to have lower
entropies in their saliency maps. These two pieces of evidence support our hypothesis
and further support the conclusions we drew from our analyses with the COCO object
data. Fewer areas in an image to fixate on lead to higher levels of memorability.
We also looked at human consistency for the fixation data. Human consistency
refers to how consistent the fixations are between different human subjects when
viewing a particular image. For instance, given our example image of a girl in a forest,
humans would tend to fixate on the girl, and most humans would tend to fixate on
similar places, like the girls face. If most people fixate on similar regions, the image
as a whole has high fixation consistency. More details on how that consistency is
44
numerically calculated can be found in [10].
We hypothesized that more memorable images would be more consistent in their
fixations. The rank correlation between fixation consistencies and memorability scores
for the images was found to be 0.18, and the difference in consistency between different memorability quartiles is shown in Figure 4-12. The differences between these
quartiles is statistically significant.
Together these pieces of evidence show that there is some link between human
fixations and memorability.
More memorable images tend to be those with more
defined regions to fixate on.
4.7
What Makes an Image Memorable
The various experiments outlined in the previous sections allow us to make certain
conclusions about what exactly makes an image memorable.
* Strong, shocking emotions, like disgust, make an image more memorable.
" More popular and aesthetically pleasing images tend to be more memorable.
" An image with fewer objects or areas of focus, and an image that accentuates
those areas (i.e. its objects take up lots of space in the image) tends to be more
memorable.
The final conclusion is most interesting, as it was supported by both the exploration into object annotations data and the human fixations data. In essence, it
means that simplicity in focus leads to more memorable images. Wide, expansive
images with many things to see are not as memorable as images with a single object
that dominates most of the picture.
45
46
Chapter 5
Modifying the Memorability of
Images
Given an understanding of what makes an image memorable, and a vast database
of human, ground-truth memorability data, we seek to create an algorithm that can
modify the underlying, intrinsic memorability of an image without significantly changing the semantic meaning of that image.
Past efforts have shown the feasibility of such algorithms. Namely, the work by
Khosla et al in the area of facial memorability has shown that it is possible to subtly
modify faces and change how memorable they are to humans without changing the
identity of the person [11]. In their work, they learn a function that maps facial
features and image features to memorability scores. Given that function and a facial
image, they apply a sort of gradient descent to change the facial features and maximize
(or minimize) the memorability scores.
Even in this thesis, our cartoonization approach used to validate our memorability
heat maps (see Chapter 3) reveals it is possible to change the memorability of an image
without changing its semantic meaning. All cartoons were of similar scenes and only
varied in where their emphasis was placed, resulting in different memorability scores.
In this chapter, we describe a more generalized modification procedure that aims
to modify the memorability of any type of image. Unlike previous systems which
focused on specific types of images [11], we are able to apply our algorithm to general
47
Figure 5-1: The output of the facial modification process by Khosla et al. The center
image is the original face, and all others are synthetically rendered at different levels of
memorability. This is an example of a successful yet specific modification procedure.
images due to our larger and more diverse set of image and memorability training
data. Essentially, given an image, we plan on adding and/or removing objects from
that image without changing the meaning of the image, in order to make the image
as a whole more (or less) memorable.
In Section 5.1, we provide an overview of our algorithm, or modification pipeline.
In Sections 5.2 , 5.3, and 5.4, we delve into specific pieces of the algorithm. Finally,
in Section 5.5, we discuss further possible work.
5.1
Overview of Modification Pipeline
We are approaching memorability modification from a different direction than previous explorations into the area. Khosla et al were able to achieve significant results
in the area of facial memorability through the usage of well annotated data [11].
Their approach was to subtly modify the facial attributes in the image by using the
annotations of facial features. See their paper for more details on their algorithm.
Instead of simply moving and modifying what is present in an image, we plan on
affecting memorability by actually adding/deleting objects from an image without
changing the semantic meaning of the general scene.
For instance, take our canonical example of an image of a girl in a forest. Lets
assume that the sky is relatively bland, and next to the girl is an unassuming bicycle.
One approach to modifying the memorability would be to replace the sky with a
more memorable sky, or simply to delete the bike (less objects or points-of-focus are
correlated with higher memorability, see Chapter 4). In Section 5.5 , we will discuss
48
1~
-1-1
. 11 111-11
-
- --
.............
how to choose a high level scheme, but at the center of our approach is a pipeline that
involves identifying less memorable (or highly memorable) objects in our image and
replacing them with either a more or less version of that object, or with background.
We will be using a variation of the scene completion pipeline proposed by Efros and
Hays in order to perform the modification [7]. Efros and Hays designed a method
to fill in portions of a scene with semantically accurate pieces by making use of a
database of approximately 1,000,000 images. Their approach is as follows: given an
input image, coarsely select similar images from the database through hand-crafted
image features like GIST [17] and then filter the selection by a more deliberate pixelwise color comparison. With the few remaining candidate images, choose those which
fill the hole well (i.e. match the image gradients and colors, etc.) and then fill the
input images hole by a method of graph cut seam finding along the holes edge and
image blending/filtering.
Our approach is similar, though instead of utilizing only a database of images, we
also use a database of objects. Starting from ImageNet, we decompose each image into
its underlying objects, and those objects form the image database. This difference
possibly increases the size of our database by an order of magnitude, and a simple
GIST-based coarse initial search would prove too slow at finding semantically similar images, and semantically-similar objects. Thus, we opt to use a hashing-based
approach to perform an approximate nearest neighbor search on our image/object
database. See Section 5.3 for more details.
Finally, given candidate images and objects, we score candidates not simply by
their ability to fill in the hole in the input image, as the pipeline by Efros and Hays
does, but also by its effect on the images memorability as a whole. Variations on this
scoring technique and how to test them are discussed in Section 5.5.
This versatile, object-centered approach allows us to robustly modify the memorability of a wide-variety of scenes.
49
Figure 5- 2: Example edge boxes for different images. While not all boxes find full
objects, with enough boxes, all objects would be covered. For our applications, recall
is more important than precision.
5.2
Detecting Objects in an Image
The first step of our process, both in building our database and in modifying a specific
image , is detecting objects in an image. From these objects, we can extract them to
populate our objects database and identify which parts of an image we may want to
replace in order to modify memorability.
For our object detection , we use the Structured Edge Boxes toolkit by Piotr Dollar
and Microsoft research
1
.
The toolkit provides a fast way for finding bounding boxes
around objects in an image. More information on the inner workings of the edge
box detection can be found in their paper, [23]. While their paper details successful
results , the detector does not always work well on our ImageNet images because of
diversity of our dataset (for time reasons , we did not train our edge detector on
ImageNet data and rather used the detector out-of-the-box). Thus , we look at many
candidate edge boxes for each image and use them all in our object database. This
still yields proper results because our selection process is able to find the correct edge
box containing an object when we query our database. Therefore, we simply need to
make sure that we use enough edge boxes that eventually all the objects in the image
are contained in at least one edge box. Figure 5-2 shows some sample edge boxes.
While not perfect, the edge boxes work quickly and are accurate enough for our
purposes.
1
https: / / github.com/pdollar /edges
50
5.3
Semantically Similar Image Retrieval
The bulk of work in our modification pipeline is done in the image retrieval phase, in
which we look for candidate images and objects to fill in a hole in our input image
(Section 5.4 details how we create that hole). During this step, we go from a database
of millions or billions of candidates to tens or hundreds of final candidates, and in
order to parse our large database in a reasonable amount of time, we use a hashing
scheme to implement an approximate nearest neighbor search.
As the sizes of image databases continue to grow in both academia and industry, new schemes have been required to query these large databases quickly when
performing functions like a nearest neighbor search. One popular method has been
locality-sensitive hashing (LSH), which tries to hash similar images to the same bucket
[5]. Thus, when performing a search, it is possible to simply hash an image and see
what other images have been hashed into that bucket. Classic variations of LSH
utilize standard image features, like GIST [17] and HOG [4], and a method of randomized projection to create the similarity-based hashing functions [5]. While these
methods work, and several out-of-the-box ANN search methods utilize an LSH-based
scheme, the methods do not always scale well to larger or more diverse datasets.
Recently, there has been a push towards supervised hashing, and towards learning
the features and hashing functions rather than using a one-size-fits-all approach. In
2014, Xia et al devised an approach named CNNH which learns the similarity-based
hashing functions from the images and a preconstructed similarity matrix by feeding
the input data through a convolutional neural network [20]. This method has been
shown to outperform classic LSH methods in a variety of use cases and datasets, so
we opted to utilize this scheme for our method.
We constructed our similarity matrix based on the image classes from ImageNet
and from Euclidean distances between deep feature vectors learned from the ImageNet
CNN. We devised 128 bit hashing functions to be able to encode a large variety of
hashes.
CNNH has two steps.
First, preliminary hashes are extracted from the
original input similarity matrix. Next, the images along with the hash codes are
51
fed through the CNN in order to learn the hashing function. The last two layers
of the network correspond to the hashing function.
Due to the large size of our
dataset, we also applied optimizations to the algorithm originally proposed by Xia
et al so as to complete Step 1 of CNNH in a reasonable amount of time. Through
some parallelization, which only marginally affected the correctness, of our output,
we were able to speed up the hash code extraction by an order of magnitude. Also,
for Step 2, we opted to use a modified version of the ImageNet CNN rather than the
network proposed by the paper in order to learn the hashing functions.
As a review, CNNH takes an input similarity matrix, finds the relevant hash codes
from that matrix through a method of coordinate descent, and then learns a hash
function that could produce those hash codes with the CNN. We utilize CNNH to
efficiently perform an ANN search on our large-scale objects and image dataset and
find semantically similar images.
5.4
Scene Completion
Given an input image, we identify its various objects using Edge Boxes and determine the memorability of each object based on the memorability heat maps we can
generate (see Chapter 3 for more information on the heat maps). Depending on the
modification plan we are following (different plans are detailed in the next section),
we select an object or objects to remove from the image and fill with information
from our image and object database. This section details how we actually remove
objects and fill in holes in the input image.
For each object we wish to remove from the image, we apply the GrabCut algorithm on the Edge Box to determine the border of the object [18]. Since we are
more interested in completely removing the object than finding its exact borders, we
choose the GrabCut parameters such that they will loosely follow the edges of the
object. Figure 5-3 shows the performance of our object removal using GrabCut in
tandem with Edge Boxes.
Once the object is removed, and the candidate replacements are chosen through
52
Figure 5-3: Object isolation using Edge Boxes and GrabCut . The top row shows the
original image. The second row shows the image with an Edge Box in red and what
GrabCut isolates, given the Edge Box as a bounding box.
the ANN search, we further reduce the number of candidates by looking at which
replacements best match the colors and color gradients of the original image at the
hole and the edges of the hole. This closely follows the same procedure originally
defined by Efros and Hays in [7] .
We apply this procedure to several different final candidates (approximately 20) ,
and among those we handpick the top few replacements as our final modified images.
5.5
Future Work
This thesis has described the several components of the image memorability modification pipeline and how we have built them. At this point however, several items
and extensions remain for the project.
First , and most importantly, we must merge the different components. At this
time, each component is built and works separately.
We must join the different
processes together and scale it up to include the entire object and image dataset we
53
have proposed.
Second, we must design and implement different modification strategies. As mentioned in previous sections, there are several different strategies we could use when
deciding which objects to replace or use as replacements.
One possible strategy
would be to find the least memorable object in an image and replace it with a more
memorable and semantically valid version of that object. Another strategy would be
to remove the least memorable objects in an image and replace them with a plain,
semantically-valid background. This would work because our exploration has shown
that less objects in an image can lead to higher memorability. The list of strategies
goes on, and those heuristics must be categorized and implemented.
Finally, we must measure the success of our pipeline. Utilizing a similar procedure
as the one used when gathering our data or validating our memorability heat maps
(see Chapter 2 and Chapter 3), we can test our pipeline by checking whether modified
images have a statistically different level of memorability than its original types. We
also need to make sure that the modification does not change the semantic meaning
of the overall scenes, and this can be validated by have crowd-sourced annotations of
our images.
These further steps will allow us to fully complete a cohesive and successful image
modification pipeline. I anticipate these steps to finish in the coming months.
54
Chapter 6
Conclusion
A better understanding of image memorability helps attain a deeper understanding of
how our mind works and allows for a new range of industry and academic applications.
In order to move image memorability research forward, from previous works which
had impactful yet small-scale or niche results [11]
[13] [9], we needed to construct
a large image memorability dataset, the Mem60k dataset, and we had to collect
human ground-truth memorability data on those images in a time and money efficient
manner.
The experiments yielded a great deal of data that enabled us to to do several
things. First, we trained a CNN to predict the memorability of a given image. Next,
we used CNNs to create memorability heat maps that predicted which regions of
an image were most memorable. Through further experimentation, we validated the
performance of these heat maps.
Finally, since CNNs often do not yield human
interpretable results, we explored the memorability data we collected for a better
understanding of what makes an image memorable.
We saw that certain emotions,
like disgust, are very memorable, and less points of focus in an image simplify the
content of that image and make it more easy to recall for an observer.
The memorability data, along with the memorability heat map work, also enabled
us to begin building a memorability modification pipeline for images, which, given
an image, would remove portions of that image and replace them with semantically
valid objects or scenes that would increase the memorability of that image.
55
Overall, what this thesis contributes is a push towards a better understanding of
image memorability and an impetus to exploring the vast array of applications that
image memorability research can unlock. We expect the coming years to be truly
exciting for this field.
56
Bibliography
[1] Damian Borth, Tao Chen, Rongrong Ji, and Shih-Fu Chang. Sentibank: largescale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the 21st ACM international conference on Multimedia,
pages 459-460. ACM, 2013.
[2] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual
long-term memory has a massive storage capacity for object details. Proceedings
of the National Academy of Sciences, 105(38):14325-14329, 2008.
[3] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature
space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 24(5):603-619, 2002.
[4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 1, pages 886-893. IEEE, 2005.
[5] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the
twentieth annual symposium on Computational geometry, pages 253-262. ACM,
2004.
[6] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs.
In ACM Transactions on Graphics (TOG), volume 21, pages 769-776. ACM,
2002.
[7] James Hays and Alexei A Efros. Scene completion using millions of photographs.
ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 2007.
[8] Mark J Huiskes and Michael S Lew. The mir flickr retrieval evaluation. In
Proceedings of the 1st ACM internationalconference on Multimedia information
retrieval, pages 39-43. ACM, 2008.
[9] Phillip Isola, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. What makes
an image memorable? In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 145-152, 2011.
57
[10] Tilke Judd, Krista Ehinger, Fredo Durand, and Antonio Torralba. Learning to
predict where humans look. In IEEE International Conference on Computer
Vision (ICCV), 2009.
[11] Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba, and Aude Oliva. Modifying the memorability of face photographs. In International Conference on
Computer Vision (ICCV), 2013.
[12] Aditya Khosla, Atish Das Sarma, and Raffay Hamid. What makes an image
popular? In InternationalWorld Wide Web Conference (WWW), Seoul, Korea,
April 2014.
[13] Aditya Khosla, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Memorability
of image regions. In Advances in Neural Information Processing Systems (NIPS),
Lake Tahoe, USA, December 2012.
[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common
objects in context. In Computer Vision-ECCV 2014, pages 740-755. Springer,
2014.
[15] Jana Machajdik and Allan Hanbury. Affective image classification using features inspired by psychology and art theory. In Proceedings of the international
conference on Multimedia, pages 83-92. ACM, 2010.
[16] Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale
database for aesthetic visual analysis. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2408-2415. IEEE, 2012.
[17] Aude Oliva and Antonio Torralba. Building the gist of a scene: The role of global
image features in recognition. Progress in brain research, 155:23-36, 2006.
[18] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive
foreground extraction using iterated graph cuts. ACM Transactions on Graphics
(TOG), 23(3):309-314, 2004.
[19] Babak Saleh, Ali Farhadi, and Ahmed Elgammal. Object-centric anomaly detection by attribute-based reasoning. In Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on, pages 787-794. IEEE, 2013.
[20] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised
hashing for image retrieval via image representation learning. In Twenty-Eighth
AAAI Conference on Artificial Intelligence, 2014.
[21] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages
3485-3492. IEEE, 2010.
58
[22] Juan Xu, Ming Jiang, Shuo Wang, Mohan S. Kankanhalli, and Qi Zhao. Predicting human gaze beyond pixels. Journal of Vision, 14(1):1-20, 2014.
[23] C Lawrence Zitnick and Piotr Dollar. Edge boxes: Locating object proposals
from edges. In Computer Vision-ECCV 2014, pages 391-405. Springer, 2014.
59