Measuring and Modifying the Intrinsic Memorability of Images by Akhil Raju S.B. EECS MIT 2014 Submitted to the Department of Electrical Engineering and Computer Science In Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the ARCHVES .77 Massachusetts Institute of Technology May 2015 [OtUkle 2015 Copyright 2015 Akhil Raju. All rights reserved AUG 202016 LIBRARIES The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this these document in whole and in part in any medium now known or hereafter created. Author: Signature redacted Department of lectrical Engineering and Computer Science May 22, 2015 Certified by: Signature redacted Antonio Torraba,~ssociate Professor, Thesis Advisor May 22, 2015 Signature redacted Accepted by: Prof. Albert Meyer, Chairman, Masters of Engineering Thesis Committee I Measuring and Modifying the Intrinsic Memorability of Images by Akhil Raju Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2015, in partial fulfillment of the requirements for the degree of Masters of Engineering in Electrical Engineering and Computer Science Abstract Images have intrinsic memorable properties that enable humans to recall them. In this thesis, I developed and carried out a procedure to measure the memorability of an image by running hundreds of human-trials and making use of a custom designed image dataset, the Mem60k dataset. The large store of ground-truth memorability data enabled a variety of insights and applications. The data revealed information about what qualities (emotional content, aesthetic appeal, etc.) in an image make it memorable. Convolutional neural networks (CNNs) trained on the data could predict an image's relative memorability with high accuracy. CNNs could also generate memorability heat maps which pinpoint which parts of an image are memorable. Finally, with additional usage of a massive image database, I designed a pipeline that could modify the intrinsic memorability of an image. The performance of each application was tested and measured by running further human trials. Thesis Supervisor: Antonio Torralba Title: Associate Professor 3 4 Dg-wi , 1 6 , , I , - S., al - - Acknowledgments I would like to thank my thesis supervisor, Professor Antonio Torralba, and Professor Aude Oliva for their guidance and expertise throughout my thesis. I would like to thank many others who have provided advice and assistance, like Phillip Isola. Also, I would like to give a special thanks to Aditya Khosla for his continual support and mentorship, for taking time to teach me a great deal and enabling me to learn and explore computer vision. This thesis would not have been possible without him. Finally, I would like to thank my friends and family for always being supportive and helpful. 5 .1 "-.- -.- 6 Contents 1 2 3 4 Introduction 11 1.1 M otivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Prior Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Thesis Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Measuring Image Memorability 17 2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 The Memorability 60K Dataset . . . . . . . . . . . . . . . . . . . . . 21 Memorability Heat Maps 25 3.1 Building Memorability Heat Maps: Then and Now . . . . . . . . . . 25 3.2 Validating the Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Algorithmic Details - Creating Cartoons . . . . . . . . . . . . 27 3.2.2 Evaluating Correctness . . . . . . . . . . . . . . . . . . . . . . 29 Analysis of Memorability Data 31 4.1 Im age D atasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Memorability and Emotions . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Memorability and Popularity . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Memorability and Aesthetic Appeal . . . . . . . . . . . . . . . . . . . 37 4.5 Memorability and Objects . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5.1 Predicting Memorability . . . . . . . . . . . . . . . . . . . . . 39 4.5.2 Object Categories . . . . . . . . . . . . . . . . . . . . . . . . . 40 7 11- -. 1 - I - . AUUQ. -11 A6,,6k., 4.5.3 5 6 Object Counts and Sizes . . . . . . . . . . . . . . . . . . . . . 41 4.6 Memorability and Human Fixations . . . . . . . . . . . . . . . . . . . 42 4.7 What Makes an Image Memorable . . . . . . . . . . . . . . . . . . . . 45 Modifying the Memorability of Images 47 5.1 Overview of Modification Pipeline . . . . . . . . . . . . . . . . . . . . 48 5.2 Detecting Objects in an Image . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Semantically Similar Image Retrieval . . . . . . . . . . . . . . . . . . 51 5.4 Scene Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Conclusion 55 8 List of Figures 2-1 Experimental setup on Amazon's Mechanical Turk . . . . . . . . . . . 18 2-2 Experimental human consistency for different image display times . . 20 3-1 Example memorability heat maps . . . . . . . . . . . . . . . . . . . . 26 3-2 "Cartoonized" images at different levels of detail . . . . . . . . . . . . 28 3-3 Cartoons at different levels of memorability, used for heat map validation 29 3-4 Results for heat map validation experiments . . . . . . . . . . . . . . 30 4-1 The difference in memorability between image datasets . . . . . . . . 32 4-2 The difference in false alarm rates between image datasets 33 4-3 The differences in memorability for different emotions - VSO dataset 4-4 The differences in memorability for different emotions - Art Photo dataset 35 4-5 Correlation between memorability and popularity . . . . . . . . . . . 36 4-6 Popularity for different memorability quartiles . . . . . . . . . . . . . 37 4-7 Correlation between memorability and aesthetics . . . . . . . . . . . . 38 4-8 Aesthetics for different memorability quartiles . . . . . . . . . . . . . 39 4-9 Most and least memorable objects from Microsoft's COCO . . . . . . 41 4-10 Example memorability heat map and human fixation saliency map . . 42 4-11 Correlation between human fixations and memorability . . . . . . . . 43 . . . . . . 34 4-12 Comparison between highly and less memorable images for fixation consistency and saliency entropy . . . . . . . . . . . . . . . . . . . . . 44 5-1 Example facial modification results . . . . . . . . . . . . . . . . . . . 48 5-2 Example object-detecting Edge Boxes . . . . . . . . . . . . . . . . . . 50 9 5-3 Results of object isolation using GrabCut and Edge Boxes . . . . . . 10 53 Chapter 1 Introduction A group gives a friendly wave to you as they approach during a conference. Two of their faces are immediately recognizable, you remember their faces from a previous conference, but the other two seem new. However, once the conversation begins, you realize that you had met all 4 just a few months prior. What makes two of their faces so memorable, while the other two were harder to recall? This attribute of memorability extends beyond faces, as well. As we flip through the pages of a magazine or peruse the Internet, some images stick in our minds easier than others, and we upon seeing that image again, we immediately recognize it. The human visual system has the ability to recall a wide variety of types of images, and it retains not only the semantic information from a given image, but also many of the details from that image [2]. Image memorability, the study of how and why images are memorable, has become a growing field of research of the last few years. While there is some variability in how memorable an image is to each person, prior work has shown that the memorability of an image is actually somewhat intrinsic, meaning that most people find the same types of images memorable or unmemorable [9]. The intrinsic nature of memorability allows us to measure it through experimentation and exploit it through modification. The question remains, however, what about these faces, objects, or places make them memorable? Why can we recall certain images more readily than others? While these may seem like questions traditionally reserved for psychology and neuroscience, 11 applying fundamentals from computer vision and machine learning can allow us to not only better understand what makes an image memorable but to also predict how memorable an image is and even to change images and make them more memorable. These are the general questions that the work in this thesis aims to answer. Building upon previous and ongoing research at Professor Antonio Torralbas Computer Vision Group at MITs Computer Science and Artificial Intelligence Lab (CSAIL) (see Prior Research section 1.2), my research serves to distinguish memorability as a distinct intrinsic image property, define certain human-interpretable characteristics that help make images memorable, and design an algorithm that enables a computer to automatically modify an input image to make it more memorable to human observers. My work accomplishes these tasks by expanding the human studies into memorability, critically analyzing what human data tells us about memorability, and utilizing the data to train models that enable computers to automatically modify images, making them more memorable. 1.1 Motivation Studying image memorability grants us advances in both academic and industrydriven applications. First, research into image memorability helps enrich our understanding of how the human visual memory system operates. With greater insight into qualities that make an image memorable, we can better understand the specific visual cues that strengthen or weaken the visual recall of scenes, objects, people, etc. This understanding, in turn, can form the groundwork for therapeutic methods in strengthening a persons memory. For instance, deeper knowledge of how the mind remembers faces could assist in designing visual memory exercises targeted to enhancing ones facial recall. From an industry-driven perspective, predicting and modifying the intrinsic memorability of an image could have applications in a variety of sectors, including education and advertising. In many educational programs, remembering images and tying images to words or phrases is a common method to learning anything from a new 12 1- , " WAWD language to how a biological cell works. A method to increase the memorability of those educational images would help the effectiveness of such practices. In advertising, and general commerce, images used to display a product, person, or experience are ubiquitous. Billions of dollars are spent just here in the United States to ensure that viewers will remember the images displayed, and more concrete methods to 'measure the memorability of different approaches would make those efforts more efficient and precise. These examples, however, only scratch the surface of what is possible with a greater understanding of image memorability. We also want to unlock the door for others to continue work in this field in order to collectively further our knowledge of human memory. To do this, we need a strong basis of human memorability data from which others can begin to perform their own research and development of various applications. This thesis describes work that moves towards a better understanding of image memorability while also opening a platform of data for others to use for research, as well. 1.2 Prior Research As mentioned above, existing research into image memorability has shown that, despite expected human variability with respect to memorability, humans tend to find the same types of images memorable, showing evidence to the fact that memorability is intrinsic to an image [9] [11]. Moreover, the past research has shown that it is possible for a computer to predict these intrinsic qualities. Phillip Isola, et al. created a support-vector machine based regressor that could successfully (with a probability significantly higher than chance) predict, given two images, which one would be more memorable to humans [9]. Isola et al collected human memorability data on approximately 2200 images by running experiments on Amazons Mechanical Turk, and they used that data to train their SVR classifiers. The research described by them first began to illuminate that 13 memorability could be measured by experimentation and then predicted by standard machine learning tools. Aditya Khosla et al. yielded similar successful results in prediction and measurement for the class of facial images [11]. Khosla et al went further to show it was possible to subtly modify the facial features in an image in order to make the face seem more or less memorable. The modification process includes identifying facial anchor points, some of which were annotated previously and some of which were calculated on the fly, in order to create an active appearance model (AAM). Once the face is parameterized, the AAM is then fed into a memorability optimization function that aimed to maximize (or minimize) the faces memorability while not moving the facial features too much. While the work by Isola et al revealed the intrinsic and measurable nature of general scene memorability, the work by Khosla et al showed that those qualities extend well to human faces and that those qualities can be modified without changing the semantic content of an image. In 2012, Khosla et al. also showed that there are specific regions of an image that are more memorable than others [13]. Intuitively, we understand this, and as we look at images we can identify their most memorable attributes. However, the work by Khosla et al. showed that those attributes are predictable and follow a pattern that computers can understand and learn. The algorithm and machine learning pipeline they introduce can break down an image into a memorability heat map, distinguishing which regions are more likely to be remembered than others. As expected, sample results show that things like people are more memorable than a single tree in a picture of a forest, but the research proves that a computer system can understand and distinguish these differences in memorability. Together, these past experiments prove the feasibility of predicting and modifying the memorability of images. However, each example utilizes a relatively limited subset of data. For both the work done by Isola et al and by Khosla et al, they use approximately 2000 images in their training and testing processes. Furthermore, Isola et al uses images solely from the SUN database, and Khosla, understandably, uses only facial images in his modification work. Thus, it is tough to generalize their 14 ''.. __ . W.Wivwk findings and applications to many different types of images. Part of the motivation for the work in this thesis it to expand the number and variability of images used in memorability experiments. 1.3 Thesis Overview Chapter 2 describes how we expanded the human studies into memorability and created the Mem60k dataset for image memorability research. Our research required further experimentation on Amazons Mechanical Turk, and we compiled images from a wide variety of sources in order to obtain a varied and generalizable set of images. This chapter describes experimental procedure and the details of Mem60k. Chapter 3 describes how we find the memorable regions of an image to build memorability heat maps. My work specifically focused on how we could experimentally validate the heat maps generated by our algorithms. Chapter 4 describes my analysis into what makes an image memorable. Given a large set of human memorability data corresponding to a diverse set of images enables insight into what memorability is and what factors influence. My research compared the attribute of memorability to other traits, like aesthetics and popularity. Similarly, this thesis finds how emotions, objects, and other attributes affect the memorability of images. Chapter 5 describes image modification algorithms which automatically makes an image more (or less) memorable. We aim to go beyond faces and create a scalable solution to modify the memorability of general images while making the use of an extensive image database. Chapter 5 describes the different approaches we explored. 15 _ _ - I - - W"W.91611 - I - " 16 Chapter 2 Measuring Image Memorability In order to understand the inherent memorability of images and automatically predict how memorable images will be, we need to collect a large quantity of human ground truth data. We need to collect the probabilities that different images will be recalled by humans, and these probabilities will give insight into how memorable images are. More memorable images have higher likelihoods of being recalled than low memorable images. We measure these probabilities by running experiments with humans and evalThis chapter describes the experimental uating the probabilities of image recall. setup and some parameter selection techniques that enabled us to run cost-effective, large-scale experiments quickly. 2.1 Experimental Setup The type of memory we focus on in this thesis is short-term visual memory. In order to test this, and to measure the likelihood of images being recalled, we run online experiments (run on Amazons Mechanical Turk platform) with humans and ask them to click a button when an image has been shown twice. The data from the experiments allows us to calculate the likelihood an image is remembered, and we call these psuedo-likelihoods "memorability scores". The experimental setup is shown in Figure 2-1. Experiment participants are shown 17 Vigilance repeat 1 sec 1.4 sec Memory repeat time Figure 2-1: The structure of the memory tests run on Amazons Mechanical Turk. Memory repeats test the actual memorability of the given images. Vigilance tests check for the continual attention and responsiveness of participants. a sequence of images, some of which re-occur. When an image re-occurs, the participant clicks a button to signify he or she has seen that image before. The same images are shown to many participants, and scores are only validated if the memorability scores for images are consistent across different groups of participants. The experiment also contains vigilance tests, which are images that re-occur in relatively close succession for the purpose of checking that the participant is paying attention to the experiment. If a participant fails the vigilance tests, his or her test results are invalidated. This experimental procedure is largely similar to the experiments used by [9] and [11]. The experiment contains many tunable parameters that affect the cost and validity of the experiment. One major parameter is the image display time and human click time. Each image is displayed for several hundred milliseconds, and after each image the participant is allotted several hundred milliseconds to click the button if he or she believes it is an image they had seen previously. More time-intensive experiments cost more, and since we wanted to find data on almost 60,000 images, we wanted to find the most cost-effective method of performing our experiments without invalidating our results. With regards to the timing, we wanted to show each image for a long enough time for users to comprehend its content, and we wanted to provide enough time for them to process the image and click between images. Our cost-constraints set a fixed total time per image (display time plus the click time), and we wanted to find which balance of timings would not detract from the quality of our experiments. The total time was fixed to 1500 ms, and we tried several 18 -Widgwi. 6_-_ I -Al--__ W, display times ranging from 500 ms to 800 ms. We define quality of an experiment by the consistency of its results across different humans and as compared to previous memorability experiments [9]. Human consistency refers to checking whether different humans find the same images memorable or not memorable. If we were to look at two different subgroups of the participants, the data from each subgroup should result in the same ranking of images in terms of memorability. We check human consistency by randomly splitting the participants into two groups, calculating the memorability scores for the images from the data from each group, and finding the Spearman rank correlation between these two sets of scores. Ideally, each group would find the same images more memorable than other images, and thus would lead to a high rank correlation. For our parameter selection through experimentation, we only used images that Isola et. al. used in their experiments, and thus we were able to use their results as a baseline for our own analyses. We evaluate our consistency with their results in a similar fashion as previously described: by calculating the rank correlation between our memorability scores and their baseline memorability scores. Each display time was tested with an experiment consisting of 100 images, and each image was viewed by 80 different participants. There were two types of memorability scores we looked at, and their formulas were as follows: mem= memFA = hit count showcount hit-count - false-alarms show -count where mem and memFA are the memorability scores, show-count is the number of times the image was shown, hit-count is the number of times the image had been correctly clicked on during the memory repeat, and false-alarms is the number of times the image had been incorrectly clicked on during the image's first showing. Each score yielded similar results, but for most of the consistency measurements, we used the second memorability score type, which took false alarms into account 1. 'We use these scores for most analyses and applications in this thesis 19 0.8 0,8 1 e - Intra-experiment consistency Consistency wth Isola et al 0.76, 00.74 - 0.72 0.7 0.68 0.66 0.64 0.62 500 550 650 700 Image Display Time (ms) 600 750 800 Figure 2-2: Human consistency for different image display times. The total time per image was fixed to 1450 ms while the balance between display time and click time was altered. As visible through the intra-experiment consistency, significantly reducing either time reduces the overall consistency. For the experiments, we hypothesized that humans needed as much time as possible to view an image in order to process it. Thus, we expected that as the image display time increased, the intra-experiment consistency and the consistency with the baseline memorability scores would increase as well. However, our several experiments revealed that the allocated click time is nearly as important as the image display time (see Figure 2-2). The results show that both the click time and the image display time have significant effects on human performance. Particularly favoring one or the other causes a decrease in human consistency, but rather a balance of the two is shown to enable the highest possible consistency for a given time. This contradicted our hypothesis and showed us that click time was more important to our experimental setup than previously thought. Other parameters of the experiment were chosen in a similar manner, but the image timing was the most significant and thus the only one this thesis goes into full detail over. 20 ............................................... The Memorability 60K Dataset 2.2 In order to increase the prediction performance and to expand the applicability of image memorability research, we needed to gather more information on which images humans find memorable, and how memorable they truly are. The work done by Isola et. al [9] uses about 2000 images collected from the SUN dataset to perform his analyses on memorability. Even from 2000, he makes significant insights into what makes images memorable, but 2000 images from one particular dataset is too few to make more generalized prediction and modification techniques. The work by Khosla et. al. [11] uses a dataset of a similar size (approximately 2000 facial images), but the work is applied very specifically to facial memorability and modification. For their purposes, a smaller dataset still yields great results. However, our ultimate goal is general understanding, predictive abilities, and further applications. For these, significantly further data is required on a much more varied set of images. Such a dataset did not exist prior to our work, so we created the Memorability 60K (MEM60k) dataset, containing approximately 60,000 images pulled from a variety of other existing image datasets. Using the experimental setup described in the previous section, we gathered human data on each image which allowed us to calculate memorability scores for each image. These scores could then be used for a variety of applications. The next chapters in this thesis discuss some of the insights and applications we derived from the memorability scores of the MEM60k dataset. The remainder of this section briefly describes the various datasets we pulled images from and the initial results of our experiments. " Aesthetic Visual Analysis Database [16]: contains images across several different categories along with metadata regarding a quantified measure of their aesthetic appeal to humans. " Abnormal Image Dataset [19]: contains images of strange or abnormal objects that dont occur in the real world * Abstract Photo Dataset [15]: contains abstract images of designs, similar to 21 abstract textures. * Art Photo Dataset [15]: contains artistic images and accompanying metadata regarding regarding the specific emotions that each photograph invokes * COCO [14]: Microsofts Common Objects in Context dataset contains images along with annotations regarding the size and type of all objects found in each image. " MIRFlickr [8]: contains images a wide-variety of images from Flickr under the Creative Commons licenses, along with metadata and labels for each image " MIT300 Fixation Dataset [10]: contains images from Flickr that were initially used for studies in human fixations on images. Also contains the human fixations and saliency maps for each image. * Object and Semantic Images and Eye-tracking dataset [22]: contains images with object labels and human fixations data. " SUN [21]: contains many types of images initially curated for scene understanding. Accompanying the images are scene and object annotations/data. " Visual Sentiment Ontology [1]: contains images from Flickr along with their view-count data, which gives insight into the popularity of each image. We pulled images from all of the above datasets to create the MEM60k dataset. Using the previously described experimental setup, we collected 80 labels per image, running experiments with hundreds of individuals to do so. The human consistency within the experiment matched fairly well with the previous experiments run by Isola et al Our average rank correlation between the memorability scores as determined by two different randomly divided subsets of participants was 0.68, while the Spearman rank correlation from the experiments from Isola et al were 0.75. Our slight decrease may simply be due to the higher volume of images we are using, along with shorter image display and click times allotted per image, which we needed for cost reasons. 22 Overall, however, the consistency shows that the data we collected was reasonable, for different groups of people found the same images to be the most (or least) memorable. Thus, we were able to utilize the memorability scores we calculated to derive conclusions on what makes an image memorable, make predictions on what parts of an image are memorable, and begin to modify images to make them more memorable. 23 24 Chapter 3 Memorability Heat Maps The Memorability 60k dataset and the memorability scores that we found through experimentation led to a variety of insights and applications. One application developed by Aditya Khosla and others was an algorithm to find the memorable regions of an image. For instance, given an image with a girl standing in a forest, typically the girl will be the most memorable aspect of the image while the trees are less memorable. Essentially, Khosla developed a memorability heat map generator, which could pinpoint which regions of the image were most memorable and which regions were more boring and forgettable. Some examples of these heat maps can be found in Figure 3-1. Empirically, these heat maps seem to correctly pinpoint the most memorable parts of an image. However, how can we tell for sure? This chapter first describes the work done by Khosla et al to build the memorability heat maps, and then it details how we validated the correctness of these heat maps. 3.1 Building Memorability Heat Maps: Then and Now In 2012, Khosla et al used existing memorability data collected from the experiments performed by Isola et al to create a methodology for finding the memorability of 25 Figure 3-1: Memorability heat maps automatically generated for various images. The red regions denote regions with higher memorability than the blue regions. specific image regions and for generating memorability heat maps from those regions [13] [9]. Khosla et al modelled memorability as a noisy process that may add or remove elements of an image when the human visual system converts an image from its external representation to its internal representation. For small segments of the image, the algorithm would extract different image features from the segment and feed the features through a noisy memory process and into a linear regressor that could compute the memorability of that segment. Each feature type (color, HOG, semantic , etc.) would generate its own heat map , and the different heat maps would be pooled together to create an overall memorability map. The memorability heat map process described in [13] does a good job of predicting the overall memorability of an image and of finding the memorable portions, but it was trained and tested on a relatively small and homogeneous dataset. The Memorability 60k dataset provides a richer source of data due to its size, and in order to take full advantage, Khosla et al updated the memorability heat map generation process to utilize convolutional neural networks (CNNs) to predict the memorability of segments. The CNN s find how memorable each segment of the image is, for various sizes of segments. The different segments are blended together to build the memorability heat maps (see Figure 3-1). See our new paper for more details on how the CNN s are designed and trained. 26 1. .1-1 1-.'--__'-- II 1 3.2 .1 1 _ _111- . 1- 1 ... I I - -11, ''1 .1 i , - - L - , . - A 1 .1 . I - . I 'A . . 11 Validating the Heat Maps As mentioned before, the heat maps constructed by the new process that takes advantage of the data from MEM60k seemed to make sense, empirically. However, we needed to confirm our beliefs through experimentation. We wanted to test whether the memorability heat maps correctly differentiated memorable and unmemorable regions of an image. One way to do this is to create new images that emphasize or de-emphasize those memorable regions. Images where the memorable regions are emphasized should be more memorable than images where those regions are de-emphasized. This section discusses how we algorithmically created images that emphasized memorable regions of an image and details the experimental setup and results for evaluating the correctness of our memorability heat maps. 3.2.1 Algorithmic Details - Creating Cartoons In 2002, DeCarlo and Santella developed a method to utilize human fixation data in order to create artistic renderings of photographs [6]. Their procedure uses a hierar- chical color segmentation and filtering scheme to make each image more cartoon-like, and the parameters of their scheme can be adjusted to allow more or less detail per image segment. They take advantage of saliency maps created from human fixations to pinpoint which regions of an image are more important than others, and these regions are designated to have more detail than the rest. In a similar manner, we were able to take advantage of our memorability heat maps to designate which segments of the image are more important than others, and thus contain more detail. Our algorithm works as follows. An input image is converted into several cartoonized versions, each with a different level of detail. For each cartoon, the input image is segmented by color using the Rutgers EDISON system [3]. The color segmenter from EDISON uses a mean shift filter and has three main parameters, a range bandwidth h, a spatial bandwidth h,, and minimum size of segments M. We choose their values based on the desired level of detail d such that as d went up, the other pa27 (a) d = 0.1 (b) d = 0.5 (c) d = 0.9 Figure 3-2: The cartoons automatically generated for different levels of detail d using our cartooning algorithm. These images have the same level of detail across the image and do not take into account the memorability heat map data. rameters would decrease. The segmented image is assigned one color for each segment (the average color for that segment from the input image). A Canny edge detector finds edges in the input image, again with parameters dependent on d, such that more edges are found as d increases. We taper each edge by using the binary image of edges and sequentially dilating the image more and more as we come closer to the edges center. These tapered edges are added to the segmented image to result in our final "cartoonized" image. See Figure 3-2 for examples of the cartoons at different levels of detail. For each image, we find the memorability of the various image segments by looking at the memorability heat map. For the most memorable segments, we extract those regions from the high detail cartoon, and the least memorable regions are extracted from the low-detail cartoon. The resulting image, which emphasizes the highmemorable regions , is finally smoothed along the cut lines. To create an image which emphasizes the least memorable regions , we extract those regions from the high detail cartoon and the least memorable regions from the low-detail cartoon. Finally, we also create a baseline image which randomly selects half the segments (as measured by area) to be assigned to high or low detail. Examples of the resulting images can be seen in Figure 3-3. 28 original image memorability map high medium low Figure 3-3: Examples of the cartoons at different memorability levels. Each row shows the original image, the memorability heat maps , and the cartoons that emphasize the high or low memorability regions. The medium column emphasizes half the regions randomly. The memorability scores for each cartoon is included. 3.2.2 Evaluating Correctness We expect the images for which the most memorable regions are emphasized to be more memorable than the baseline, and images where the least memorable regions are emphasized to be less memorable than the baseline. If this is the case, it gives evidence that our memorability heat maps are correctly differentiating the memorable and unmemorable regions in an image. In order to measure the memorability of the various cartoon images, we use a similar experimental setup to the one described in Chapter 2. We host visual experiments on Amazons Mechanical Turk in which participants are shown a sequence of images and told to click a button each time they see an image repeated. Each image is shown to 80 participants, and the proportion of participants who correctly find the repeated image relates to the memorability score for that image. We tested 250 images from the MEM60k dataset , creating 3 versions of each image, one which emphasizes the most memorable regions , one which emphasizes the 29 0.90.8- 8C,, 0.70.6 E 0.4 E 0.3 a> - M0.5 low - medium _ 0.2- 0 0.1 0.2 0.3 0.4 05 0.6 0.7 high 0.8 0.9 1 image index Figure 3-4: The memorability scores for the cartoons that emphasize the high or low memorability regions of an image. The medium cartoons are the baseline images. least memorable regions, and a baseline version which emphasizes a random selection of regions. The filler images used in the experiment (images that do not repeat), were also constructed using the same scheme outlined in the previous section. Each exercise contained approximately 100 images, and we ensured that participants would never see two different versions of the same image. The resulting memorability scores from our experimentation are shown in Figure 3-4. As visible, the cartoons where the most memorable regions are emphasized are more memorable than the baseline, and the cartoons where the least memorable regions are emphasized are less memorable than the baseline, as expected. All the differences between the memorability scores of the low, high, and baseline images were found to be statistically significant using an o = 0.05. The results from our experiments validate the memorability heat maps generated by our CNN-based algorithm. Also, they begin to shed light on methodologies that could modify the memorability of images. As shown, accentuating certain aspects of an image can significantly affect the memorability of that image. In Chapter 5 we will further explore this topic and ways to utilize the information stored in the memorability heat maps to modify the intrinsic memorability of images. 30 I -- I I - - I1 1. , . --- I . . - I - - - 1 .11 - I I I - . 1-1.1- -11 .1 - 1 11-.1 1-1 . - I I I- . - - - .. 1 .11 . 1 .11 tl I 1 .1 -a 11 - Chapter 4 Analysis of Memorability Data The experimental procedure outlined in Chapter 2 was utilized to gather memorability data for almost 60,000 images. Due to the high number of images and their diversity, the information we gathered allows us to begin to see what makes an image memorable and how does memorability relate to different characteristics an image might have (how popular the image is, what objects the image contains, and so on). The images were collected from several different existing image datasets, as mentioned in the previous chapter, and each dataset also contained further information and attributes that we could relate the memorability data to. 4.1 Image Datasets Our first check was to see if our different datasets were in fact unique in content and memorability. We hypothesized that different content in an image would lead to different memorability scores, a hypothesis that had been confirmed in previous smaller scale experiments [9]. Each dataset contains sets of different types of images that contain different material, and thus we hypothesized that the memorability scores across the datasets would be different. It is important to reiterate that the images from the different datasets were mixed together and shown in a random order to the human participants, so there was no bias towards any particular dataset. Figure 4-1 shows how the memorability scores for images from different datasets 31 I I . - - -1- MVF P S An At C 0.9- 0.8-An (D 0. 0.7 C0.6- 0 -Abnormal (An) -Popularity (P) MIRFlickr (M) C.D Coco (C) - Abstract (At) -Fixation Flickr (FF) 0.4 -SUN 0 (S) 0.4 0.6 Image index 0.2 0.8 S 1 (a) (b) Figure 4-1: The difference in memorability between image datasets. scores of the different datasets. (a) shows the (b) shows which differences in memorability scores are statistically significant. Green means the row header is greater than the column, red means vice versa. Blue means they are equal. were, in fact, different. Furthermore, as shown in Figure 4-1b the differences between the memorability scores of different datasets were statistically significant. The p- scores shown were calculated using a two-sided t-test with an alpha cutoff of 0.05. The statistically significant differences in memorability are shown with green boxes. The results of this comparison support some of our intuition. As Figure 4-la shows, the abnormal dataset, which contains strange objects and images that are not normally seen in the real world, tend to be very memorable, while the SUN dataset, which typically contains relatively mundane images of general scenes, has the lowest memorability. In addition to differences in memorability, the various datasets also differ in homogeneity. We can see this through the false-alarm rates of the images from each dataset. A false alarm occurs when a participant clicks on the image, believing it was a repeat, when in fact it was not. In some sense, the false alarm rates are an inverse metric to the memorability scores, and they give a sense of how similar an image is to the other images in the overall set of images. We would expect that images that are not memorable are very similar to other mundane images, and our results support 32 False alarm rates across - 0.45 image datasets abnormal abstract art 0.4 - Cc fixationflickr eirflickr o.35 popularitgnvso SWa 0.25 - 0.3 - 0.2 osie 0.15 0.1 0.00 0 0.1 0.2 0.3 0.4 0.5 0.6 Image indiex 0.7 0.8 0.9 1 Figure 4-2: The different false alarm rates for various image datasets. this hypothesis. As expected, the datasets with low memorability scores had high false alarm rates, and vice versa. See Figure 4-2 for more details. 4.2 Memorability and Emotions Two image datasets used in our Memorability 60k dataset, the Visual Sentiment Ontology set [1] and the Image Emotion/Art Photo set [15], contain metadata regarding the specific emotions each image represents. We were able to correlate the different emotions to the memorability scores we had gathered. We hypothesized that vastly different emotions would yield different levels of memorability. More specifically, due to the fact that our experiments mainly tested shortterm visual memory, I expected more exciting emotions, like fear and amazement, would yield higher levels of memorability than more calm emotions, like happiness. Our use of two different datasets allowed us to cross-validate the results we received, and those results can be seen in Figure 4-3 and 4-4. These figures show how the memorability scores compare across different emotions and show that the differences between the different emotions were, for the most part, statistically significant. Statistical significance was determined by running two-sided t-tests with an alpha 33 Figure 4-3: The differences in memorability for different emotions, with data from the VSO dataset. cutoff of 0.05. The results of our experiment validate our hypothesis that different emotions yield different levels of memorability. This result was supported by both independent datasets, allowing us to conclude this more strongly. Our more specific hypothesis that more exciting emotions would be more memorable than calm emotions was found invalid, though. While both datasets showed that some calm emotions are less memorable (contentment and serenity were the least memorable emotions of the two datasets), exciting emotions were found throughout the spectrum of memorability. Also, classifying the emotions as exciting or calm is slightly subjective, so supporting this hypothesis is difficult to do. Interestingly, both datasets had disgust as the most memorable emotion, and disgust was found to be statistically significantly higher than all other emotions for both datasets. This may show that feelings of disgust and the images that trigger those emotions are more readily remembered than other things, which would explain why some marketing campaigns which rely on shocking or disgusting its recipients into action (for instance, an environmental campaign showing the effects of oil spills on wildlife) work so well. 34 0.9 AAn Aw Cn 0.8 Am >0.7 An CD) CTJ 0.6-disgust 06 DsEx FeSa Aw amusement -fear - E -sad * awe - 0 Ds excitement - 0.4 Cn anger S0.5 -- 0.2 0.4 0.6 contentment 0.8 Fe 1 Image index (a) (b) Figure 4-4: The differences in memorability for different emotions, with data from the Art Photo image dataset. (a) shows the memorability scores. (b) shows how many of the differences are statistically significant (highlighted in green). Statistical significance was determined with one-sided t-tests, and their resulting p-scores are in the table cells. 4.3 Memorability and Popularity The Visual Sentiment Ontology dataset gives information on how popular each of its images is. It derives its images from Flickr and gives information on how many times the image has been viewed, and when those views occur. The view count, after being normalized for time, gives insight into how popular an image is. The normalization process is necessary to gather any signal from the view count data, and the process for normalization is derived from [12]. Image that have higher view counts are deemed more popular, and in this thesis, the normalized view count is also referred to as a popularity score. We hypothesized that memorability would be strongly related to popularity and that more popular images would tend to be more memorable. The intuition behind this is simple: popular images tend to be visually striking and thus more likely to be remembered, even if just viewed for a moment. We hypothesized to see a strong positive rank correlation between the popularity scores and the memorability scores. 35 00 0 ~b0 00~000 Toreatp bew entehwimtirandb shFw a b SosePr+ lro.COk im speifi slic of th images.0 spcii slc sltl o*hr Memorily W ity scrsv.pplrtysoe.()sos Figure 4~~-:lt veal ofthjjjgs an a0h c 0f of 0.5 anapacuofo 0.5 Fige 4-5t Mifrn embilityscys s poputlsharit scres.(a eesows owterit.Te st nh ouaiysoe hat thowest, h ovemralreliagn bee akcreainbten megtrscspopanarty)sshrostheloerto ovrlthetw Whlthr is no lal e httetoatiue twdomeoaiiysoew0a ttibte are dsfeengly independenrt ewe hs urie r r eae.Mr ttsial meorbl image tedt0emoepplr Howeveran, oufnditasid sinotfully spr deemndwthi.shw in Fw-iued 4-,tere wis tmeoblity Thereis n cle aWvrylowhr i oovrl rank correlation betweentn Mgue4-re mUoabltscswean closerrnspetion though the two attributes are related. memorable images haed th e highes popularit 36 crs h oethv h oet thoSh Popularitq Acr-s Differmnt M-mrbilit Quartiles 8.8 8.6 8.2 87.8 7,6 7.4Thr 0 0588 000i [s 2808 2500 lndex Figure 4-6: The popularity scores for different memorability quartiles of images. The top quartile has the 25% most memorable images, the bottom quartile has the least memorable images. As shown, the most memorable images tend to be the most popular. 4.4 Memorability and Aesthetic Appeal The Aesthetic Visual Analysis (AVA) dataset [16] gives information on how aesthetically appealing each of its images is. In a similar vein to the popularity analyses from the previous section, we checked if there is a relationship between the aesthetic appeal of an image and its inherent memorability. We hypothesized that more memorable images would have higher aesthetic appeal. We reasoned that more pleasing or tasteful images would better stick in our visual memory and would be more easy to recall. More mundane or less pleasing images would be easily forgettable. Once again, as with our exploration into popularity, aesthetic appeal did not show a strong overall rank correlation with memorability. Figure 4-7 shows that the rank correlation between the aesthetic scores from the AVA dataset and our computed memorability scores is 0.08, indicating essentially no link between memorability and aesthetic appeal. A look at a slice of the images, namely those with the highest memorability scores, underscores the seeming independence between the two attributes. The distribution of aesthetic scores appears to remain the same regardless of the memorability score. 37 - A-t1.t1-: AVA dt-t 8, 0,0 88,080 898 08,80, 8,6 80 . 0, , M-bOltq 8%08 18 twoottriute8areseeingl8inepenent Figre 47 eoaiiysoe saesthetic scores son nFiue -. (a)o how heowal theis littl to have slightly higher levels of aesthetic appeal, which supports our hypothesis. However, empirically, we can see that the differences are less significant than those found in the popularity analysis (see previous section). We conclude that, while there may be some relation between memorability and aesthetic appeal, the relation is relatively weak, or at least weaker than initially hypothesized. 4.5 Memorability and Objects Several of our datasets provided information of the types and sizes of objects present in their images. We aimed to utilize this information to explore how specific objects and their sizes affect memorability. The datasets SUN, COCO, and MIR Flickr supplied object labels, but for our analysis we primarily looked at the metadata from COCO [21] [14] [8]. COCO offers a happy medium with respect to depth of information between SUN and MIR Flickr. The two others offer either too little or too many object categories for our analyses, 38 Aesthetics Acrosa 6.6 Different Memorbili Ckartiles 6.4 6.2 6- 5.8 5.65.4 5.2 5To -Third 4.6 44 50 1000 1506 2Q60 25(0 1-9. Index Figure 4-8: The aesthetic scores for different memorability quartiles of images. The top quartile has the 25% most memorable images, the bottom quartile has the least memorable images. There seems to be little effect of memorability on aesthetic quality. so we opted to focus on the 5000 images we used from COCO. COCO supplies 80 object categories along with the object sizes of each instance in an image. This information enabled us to see how object category, object counts, and object sizes affect memorability. 4.5.1 Predicting Memorability Object labels offer a rich supply of information, and it has been previously exhibited that object metadata can provide the basis for predicting memorability [9]. We follow a similar approach to Isola et. al and predict the memorability of images based on several different types of object metadata. For the labelled data types, we assemble length d feature vectors where d is the number of object categories supplied by COCO. For object counts, we count the instances of each object category. For sizes, we either look at the maximum size of an instance for each object category, or we look at the sum of sizes. For both types, we normalize our vectors and feed them through a histogram intersection kernel. These feature vectors are fed into a support vector regression (SVR) with the ground-truth memorability scores as their labels. To determine the correctness of each feature vector method, we looked at the rank correlation between the predicted 39 memorability scores and the actual memorability scores for a test set of images (separate from the training set used). The object-counts feature vectors lead to a rank correlation of 0.39, and the object sizes vectors lead to a rank correlation of 0.41. 4.5.2 Object Categories First, we explored the differences in memorability between different object categories and determined which types of objects were the most memorable. Each image contains several types of objects, so simply looking at the memorability scores of images does not give a valid comparison of the object categories. For instance, if most images of horses tended to contain barns as well, it would be difficult to compare the two object categories in terms of image memorability scores alone because they share the same scores. Thus, instead of only using the raw memorability scores, we predict the memorability of the image with and without a given object and measure how the predicted memorability changes. A more memorable object will have a larger decrease in predicted memorability when it is removed from the image. We predict the memorability in the method outlined in the previous section, and we processed each image, removing each object in the image and checking to see how its predicted memorability changed. Averaging over all the instances of an object category allows us to rank the categories in order of their effects on memorability. Figure 4-9 shows some results. Interestingly, many categories for smaller, or handheld objects, like ties, bananas, and donuts, were very highly ranked, while categories for non-household animals, like giraffes and zebras, ranked very low. The high ranking of smaller object categories may be due to how people typically photograph them. As somewhat visible in Figure 4-9, these smaller objects are often photographed in ways that accentuate them, meaning they are large and the only object in the image. As we will discuss in the following section, these types of images tend to be more memorable. The animals, on the hand, may simply be difficult to differentiate. A person may find it tough to distinguish between two different pictures of zebras. This reasoning is pure conjecture, however. There is no evidence to support this. 40 tie (+-0.040) skis (-0.069} scissors (+0.039) bear (-0.069) banana (+0.031) giraffe (-0.11 s) E .B 0 al Figure 4-9: We evaluate importance of an object category to memorability by looking at its effect on predicted memorability when the object is added to the image. The top row shows the most memorable object categories , and the bottom row shows the least. Each category shows images for which adding the object has the most/least imact on predicted memorability. 4.5.3 Object Counts and Sizes The COCO object annotations give information into how many and how large the different objects are in the image. We used the pieces of information to predict memorability and to rank the categories in terms of memorability. We also wanted to see how they individually relate to our measured memorability scores. First we explored how the size of an object relates to its memorability. Empirically, we saw in the previous section that larger objects in an image tended to be more memorable. This was further supported by our experiments here. The rank correlation between the average size of an object in an image and the images memorability score is 0.39, and the rank correlation between the size of the largest object in the image and the memorability scores was 0.37. These correlations are relatively high , and they show that larger objects tend to make an image more memorable. Also, we found that the number of objects in an image detracts from its memorability. We found that the rank correlation between the number of objects in an image and its memorability score is -0 .1 7. While the correlation is not too strong, it illustrates a similar notion as the object size data does: having only a few, large objects in an image makes it more memorable. 41 Figure 4-10: An image, its memorability heat map, and its human fixation saliency map. 4.6 Memorability and Human Fixations Human fixations refers to where people tend to look when viewing a specific image. For instance, given a picture of a smiling girl in a forest , most people would tend to fixate on the girl first before looking at the rest of image. Hum an fixation data gives insight into what the human visual system immediately focuses on, and we hypothesized that human fixations would relate in some way to memorability. The OSIE and Fixation Flickr datasets provide data on the human fixations for its images [22] [10]. For most of our analyses , though, we simply relied on the Fixation Flickr dataset. The two datasets are relatively similar in content , and we opted to use the Fixation Flickr dataset due to ease of access. From Khosla et. al [13], we have a method of determining which regions in an image are memorable and generating a heat map to visualize these results. It is worth noting that the experiments used in this thesis do not use the same process outlined in [13], but instead we make use of a deep-learning model that achieves the same memorability heat map with more precision. See Chapter 3 for details on the process and the method in which we evaluate the accuracy of the heat maps. We hypothesized that the most memorable regions of the image would also be the most salient for human viewing, and we wanted to test whether humans tend to fixate on the memorable regions of an image, and if the human saliency maps found in studies like [10] are similar to the memorability heat maps we can generate. To measure the relationship between fixation points and memorability, we wanted to see whether the human fixation points tended to lie in the most memorable regions of the image. Figure 4-11 shows the number of fixation points that lie above a 42 0.9 0. - 0.8 0.60.5 0.4 0.3 0.2 Pixels Sorted by Mexorability Pixels Randomlq Sorted 0.1 0 0.1 0.2 0.3 0 6 0.5 0.4 % of Image Pixels 0.7 0.8 0.9 1 Figure 4-11: The number of fixation points covered by pixels either chosen at random or in order of memorability. This shows memorability and fixations are somewhat correlated. certain threshold of memorability. As shown, the number of points grows faster than if we chose pixels at random, indicating that the more memorable regions of the image are more likely to contain human fixation points. This aligns exactly with our hypothesis, and we see clear evidence that memorability and human fixations are positively related. To further explore this relationship, we looked at the human saliency maps, which show where human eyes tend to linger as they view a scene. While the fixation points are specific points, the saliency maps are akin to the memorability heat maps, as shown in Figure 4-10. We looked at the saliency maps in two ways: one, we wanted to see how the saliency maps and the memorability heat maps were correlated, and two, we wanted to see how the spread of the saliency map was correlated with memorability scores. We hypothesized that the saliency and memorability heat maps would be very closely linked. As mentioned before and supported by our initial heat map experiments, we assumed that humans would tend to fixate on the memorable regions. To measure the correlation between the two-dimensional maps, we vectorized each map and found their rank correlation. The average rank correlation found between the 43 4.5 1 4 0L > 0.95 E 3.5 C- U) U) 3 o 0.9 C,) 2.5- C E I 0.85 0.85 -25% Most Memorable Memorable 0.-25%0.4 Least 0.s6Mmoabe 0 0.2 0.4 0.6 Image index 0.8 -25% -25% 0.25%. 1 1 0 (a) 0.2 Most Memorable Least Memorable 0.6 Mmral 0.4 0.6 Image index 0.8 1 (b) Figure 4-12: There are statistically significant differences in human fixation consistency and saliency between the most memorable and least memorable images. maps was 0.10. This correlation was not strong enough to support our hypothesis. We also hypothesized that the more spread out a saliency map was, the less memorable that image would be. The reasoning for this stems back to our object analyses (see previous section). As found before, fewer objects and more focus in an image yields higher memorability. Thus, if a saliency map is spread out, we thought it would mean that there were many objects in the image, and humans wouldnt know where to fixate. We measured the spread of a saliency map by looking at its entropy. The rank correlation between the entropies and the memorability scores was -0.24, and Figure 4-12 shows how the more memorable images tended to have lower entropies in their saliency maps. These two pieces of evidence support our hypothesis and further support the conclusions we drew from our analyses with the COCO object data. Fewer areas in an image to fixate on lead to higher levels of memorability. We also looked at human consistency for the fixation data. Human consistency refers to how consistent the fixations are between different human subjects when viewing a particular image. For instance, given our example image of a girl in a forest, humans would tend to fixate on the girl, and most humans would tend to fixate on similar places, like the girls face. If most people fixate on similar regions, the image as a whole has high fixation consistency. More details on how that consistency is 44 numerically calculated can be found in [10]. We hypothesized that more memorable images would be more consistent in their fixations. The rank correlation between fixation consistencies and memorability scores for the images was found to be 0.18, and the difference in consistency between different memorability quartiles is shown in Figure 4-12. The differences between these quartiles is statistically significant. Together these pieces of evidence show that there is some link between human fixations and memorability. More memorable images tend to be those with more defined regions to fixate on. 4.7 What Makes an Image Memorable The various experiments outlined in the previous sections allow us to make certain conclusions about what exactly makes an image memorable. * Strong, shocking emotions, like disgust, make an image more memorable. " More popular and aesthetically pleasing images tend to be more memorable. " An image with fewer objects or areas of focus, and an image that accentuates those areas (i.e. its objects take up lots of space in the image) tends to be more memorable. The final conclusion is most interesting, as it was supported by both the exploration into object annotations data and the human fixations data. In essence, it means that simplicity in focus leads to more memorable images. Wide, expansive images with many things to see are not as memorable as images with a single object that dominates most of the picture. 45 46 Chapter 5 Modifying the Memorability of Images Given an understanding of what makes an image memorable, and a vast database of human, ground-truth memorability data, we seek to create an algorithm that can modify the underlying, intrinsic memorability of an image without significantly changing the semantic meaning of that image. Past efforts have shown the feasibility of such algorithms. Namely, the work by Khosla et al in the area of facial memorability has shown that it is possible to subtly modify faces and change how memorable they are to humans without changing the identity of the person [11]. In their work, they learn a function that maps facial features and image features to memorability scores. Given that function and a facial image, they apply a sort of gradient descent to change the facial features and maximize (or minimize) the memorability scores. Even in this thesis, our cartoonization approach used to validate our memorability heat maps (see Chapter 3) reveals it is possible to change the memorability of an image without changing its semantic meaning. All cartoons were of similar scenes and only varied in where their emphasis was placed, resulting in different memorability scores. In this chapter, we describe a more generalized modification procedure that aims to modify the memorability of any type of image. Unlike previous systems which focused on specific types of images [11], we are able to apply our algorithm to general 47 Figure 5-1: The output of the facial modification process by Khosla et al. The center image is the original face, and all others are synthetically rendered at different levels of memorability. This is an example of a successful yet specific modification procedure. images due to our larger and more diverse set of image and memorability training data. Essentially, given an image, we plan on adding and/or removing objects from that image without changing the meaning of the image, in order to make the image as a whole more (or less) memorable. In Section 5.1, we provide an overview of our algorithm, or modification pipeline. In Sections 5.2 , 5.3, and 5.4, we delve into specific pieces of the algorithm. Finally, in Section 5.5, we discuss further possible work. 5.1 Overview of Modification Pipeline We are approaching memorability modification from a different direction than previous explorations into the area. Khosla et al were able to achieve significant results in the area of facial memorability through the usage of well annotated data [11]. Their approach was to subtly modify the facial attributes in the image by using the annotations of facial features. See their paper for more details on their algorithm. Instead of simply moving and modifying what is present in an image, we plan on affecting memorability by actually adding/deleting objects from an image without changing the semantic meaning of the general scene. For instance, take our canonical example of an image of a girl in a forest. Lets assume that the sky is relatively bland, and next to the girl is an unassuming bicycle. One approach to modifying the memorability would be to replace the sky with a more memorable sky, or simply to delete the bike (less objects or points-of-focus are correlated with higher memorability, see Chapter 4). In Section 5.5 , we will discuss 48 1~ -1-1 . 11 111-11 - - -- ............. how to choose a high level scheme, but at the center of our approach is a pipeline that involves identifying less memorable (or highly memorable) objects in our image and replacing them with either a more or less version of that object, or with background. We will be using a variation of the scene completion pipeline proposed by Efros and Hays in order to perform the modification [7]. Efros and Hays designed a method to fill in portions of a scene with semantically accurate pieces by making use of a database of approximately 1,000,000 images. Their approach is as follows: given an input image, coarsely select similar images from the database through hand-crafted image features like GIST [17] and then filter the selection by a more deliberate pixelwise color comparison. With the few remaining candidate images, choose those which fill the hole well (i.e. match the image gradients and colors, etc.) and then fill the input images hole by a method of graph cut seam finding along the holes edge and image blending/filtering. Our approach is similar, though instead of utilizing only a database of images, we also use a database of objects. Starting from ImageNet, we decompose each image into its underlying objects, and those objects form the image database. This difference possibly increases the size of our database by an order of magnitude, and a simple GIST-based coarse initial search would prove too slow at finding semantically similar images, and semantically-similar objects. Thus, we opt to use a hashing-based approach to perform an approximate nearest neighbor search on our image/object database. See Section 5.3 for more details. Finally, given candidate images and objects, we score candidates not simply by their ability to fill in the hole in the input image, as the pipeline by Efros and Hays does, but also by its effect on the images memorability as a whole. Variations on this scoring technique and how to test them are discussed in Section 5.5. This versatile, object-centered approach allows us to robustly modify the memorability of a wide-variety of scenes. 49 Figure 5- 2: Example edge boxes for different images. While not all boxes find full objects, with enough boxes, all objects would be covered. For our applications, recall is more important than precision. 5.2 Detecting Objects in an Image The first step of our process, both in building our database and in modifying a specific image , is detecting objects in an image. From these objects, we can extract them to populate our objects database and identify which parts of an image we may want to replace in order to modify memorability. For our object detection , we use the Structured Edge Boxes toolkit by Piotr Dollar and Microsoft research 1 . The toolkit provides a fast way for finding bounding boxes around objects in an image. More information on the inner workings of the edge box detection can be found in their paper, [23]. While their paper details successful results , the detector does not always work well on our ImageNet images because of diversity of our dataset (for time reasons , we did not train our edge detector on ImageNet data and rather used the detector out-of-the-box). Thus , we look at many candidate edge boxes for each image and use them all in our object database. This still yields proper results because our selection process is able to find the correct edge box containing an object when we query our database. Therefore, we simply need to make sure that we use enough edge boxes that eventually all the objects in the image are contained in at least one edge box. Figure 5-2 shows some sample edge boxes. While not perfect, the edge boxes work quickly and are accurate enough for our purposes. 1 https: / / github.com/pdollar /edges 50 5.3 Semantically Similar Image Retrieval The bulk of work in our modification pipeline is done in the image retrieval phase, in which we look for candidate images and objects to fill in a hole in our input image (Section 5.4 details how we create that hole). During this step, we go from a database of millions or billions of candidates to tens or hundreds of final candidates, and in order to parse our large database in a reasonable amount of time, we use a hashing scheme to implement an approximate nearest neighbor search. As the sizes of image databases continue to grow in both academia and industry, new schemes have been required to query these large databases quickly when performing functions like a nearest neighbor search. One popular method has been locality-sensitive hashing (LSH), which tries to hash similar images to the same bucket [5]. Thus, when performing a search, it is possible to simply hash an image and see what other images have been hashed into that bucket. Classic variations of LSH utilize standard image features, like GIST [17] and HOG [4], and a method of randomized projection to create the similarity-based hashing functions [5]. While these methods work, and several out-of-the-box ANN search methods utilize an LSH-based scheme, the methods do not always scale well to larger or more diverse datasets. Recently, there has been a push towards supervised hashing, and towards learning the features and hashing functions rather than using a one-size-fits-all approach. In 2014, Xia et al devised an approach named CNNH which learns the similarity-based hashing functions from the images and a preconstructed similarity matrix by feeding the input data through a convolutional neural network [20]. This method has been shown to outperform classic LSH methods in a variety of use cases and datasets, so we opted to utilize this scheme for our method. We constructed our similarity matrix based on the image classes from ImageNet and from Euclidean distances between deep feature vectors learned from the ImageNet CNN. We devised 128 bit hashing functions to be able to encode a large variety of hashes. CNNH has two steps. First, preliminary hashes are extracted from the original input similarity matrix. Next, the images along with the hash codes are 51 fed through the CNN in order to learn the hashing function. The last two layers of the network correspond to the hashing function. Due to the large size of our dataset, we also applied optimizations to the algorithm originally proposed by Xia et al so as to complete Step 1 of CNNH in a reasonable amount of time. Through some parallelization, which only marginally affected the correctness, of our output, we were able to speed up the hash code extraction by an order of magnitude. Also, for Step 2, we opted to use a modified version of the ImageNet CNN rather than the network proposed by the paper in order to learn the hashing functions. As a review, CNNH takes an input similarity matrix, finds the relevant hash codes from that matrix through a method of coordinate descent, and then learns a hash function that could produce those hash codes with the CNN. We utilize CNNH to efficiently perform an ANN search on our large-scale objects and image dataset and find semantically similar images. 5.4 Scene Completion Given an input image, we identify its various objects using Edge Boxes and determine the memorability of each object based on the memorability heat maps we can generate (see Chapter 3 for more information on the heat maps). Depending on the modification plan we are following (different plans are detailed in the next section), we select an object or objects to remove from the image and fill with information from our image and object database. This section details how we actually remove objects and fill in holes in the input image. For each object we wish to remove from the image, we apply the GrabCut algorithm on the Edge Box to determine the border of the object [18]. Since we are more interested in completely removing the object than finding its exact borders, we choose the GrabCut parameters such that they will loosely follow the edges of the object. Figure 5-3 shows the performance of our object removal using GrabCut in tandem with Edge Boxes. Once the object is removed, and the candidate replacements are chosen through 52 Figure 5-3: Object isolation using Edge Boxes and GrabCut . The top row shows the original image. The second row shows the image with an Edge Box in red and what GrabCut isolates, given the Edge Box as a bounding box. the ANN search, we further reduce the number of candidates by looking at which replacements best match the colors and color gradients of the original image at the hole and the edges of the hole. This closely follows the same procedure originally defined by Efros and Hays in [7] . We apply this procedure to several different final candidates (approximately 20) , and among those we handpick the top few replacements as our final modified images. 5.5 Future Work This thesis has described the several components of the image memorability modification pipeline and how we have built them. At this point however, several items and extensions remain for the project. First , and most importantly, we must merge the different components. At this time, each component is built and works separately. We must join the different processes together and scale it up to include the entire object and image dataset we 53 have proposed. Second, we must design and implement different modification strategies. As mentioned in previous sections, there are several different strategies we could use when deciding which objects to replace or use as replacements. One possible strategy would be to find the least memorable object in an image and replace it with a more memorable and semantically valid version of that object. Another strategy would be to remove the least memorable objects in an image and replace them with a plain, semantically-valid background. This would work because our exploration has shown that less objects in an image can lead to higher memorability. The list of strategies goes on, and those heuristics must be categorized and implemented. Finally, we must measure the success of our pipeline. Utilizing a similar procedure as the one used when gathering our data or validating our memorability heat maps (see Chapter 2 and Chapter 3), we can test our pipeline by checking whether modified images have a statistically different level of memorability than its original types. We also need to make sure that the modification does not change the semantic meaning of the overall scenes, and this can be validated by have crowd-sourced annotations of our images. These further steps will allow us to fully complete a cohesive and successful image modification pipeline. I anticipate these steps to finish in the coming months. 54 Chapter 6 Conclusion A better understanding of image memorability helps attain a deeper understanding of how our mind works and allows for a new range of industry and academic applications. In order to move image memorability research forward, from previous works which had impactful yet small-scale or niche results [11] [13] [9], we needed to construct a large image memorability dataset, the Mem60k dataset, and we had to collect human ground-truth memorability data on those images in a time and money efficient manner. The experiments yielded a great deal of data that enabled us to to do several things. First, we trained a CNN to predict the memorability of a given image. Next, we used CNNs to create memorability heat maps that predicted which regions of an image were most memorable. Through further experimentation, we validated the performance of these heat maps. Finally, since CNNs often do not yield human interpretable results, we explored the memorability data we collected for a better understanding of what makes an image memorable. We saw that certain emotions, like disgust, are very memorable, and less points of focus in an image simplify the content of that image and make it more easy to recall for an observer. The memorability data, along with the memorability heat map work, also enabled us to begin building a memorability modification pipeline for images, which, given an image, would remove portions of that image and replace them with semantically valid objects or scenes that would increase the memorability of that image. 55 Overall, what this thesis contributes is a push towards a better understanding of image memorability and an impetus to exploring the vast array of applications that image memorability research can unlock. We expect the coming years to be truly exciting for this field. 56 Bibliography [1] Damian Borth, Tao Chen, Rongrong Ji, and Shih-Fu Chang. Sentibank: largescale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the 21st ACM international conference on Multimedia, pages 459-460. ACM, 2013. [2] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual long-term memory has a massive storage capacity for object details. Proceedings of the National Academy of Sciences, 105(38):14325-14329, 2008. [3] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603-619, 2002. [4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886-893. IEEE, 2005. [5] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Localitysensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253-262. ACM, 2004. [6] Doug DeCarlo and Anthony Santella. Stylization and abstraction of photographs. In ACM Transactions on Graphics (TOG), volume 21, pages 769-776. ACM, 2002. [7] James Hays and Alexei A Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 2007. [8] Mark J Huiskes and Michael S Lew. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM internationalconference on Multimedia information retrieval, pages 39-43. ACM, 2008. [9] Phillip Isola, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. What makes an image memorable? In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 145-152, 2011. 57 [10] Tilke Judd, Krista Ehinger, Fredo Durand, and Antonio Torralba. Learning to predict where humans look. In IEEE International Conference on Computer Vision (ICCV), 2009. [11] Aditya Khosla, Wilma A. Bainbridge, Antonio Torralba, and Aude Oliva. Modifying the memorability of face photographs. In International Conference on Computer Vision (ICCV), 2013. [12] Aditya Khosla, Atish Das Sarma, and Raffay Hamid. What makes an image popular? In InternationalWorld Wide Web Conference (WWW), Seoul, Korea, April 2014. [13] Aditya Khosla, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Memorability of image regions. In Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, USA, December 2012. [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014, pages 740-755. Springer, 2014. [15] Jana Machajdik and Allan Hanbury. Affective image classification using features inspired by psychology and art theory. In Proceedings of the international conference on Multimedia, pages 83-92. ACM, 2010. [16] Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A large-scale database for aesthetic visual analysis. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2408-2415. IEEE, 2012. [17] Aude Oliva and Antonio Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in brain research, 155:23-36, 2006. [18] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309-314, 2004. [19] Babak Saleh, Ali Farhadi, and Ahmed Elgammal. Object-centric anomaly detection by attribute-based reasoning. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 787-794. IEEE, 2013. [20] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [21] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485-3492. IEEE, 2010. 58 [22] Juan Xu, Ming Jiang, Shuo Wang, Mohan S. Kankanhalli, and Qi Zhao. Predicting human gaze beyond pixels. Journal of Vision, 14(1):1-20, 2014. [23] C Lawrence Zitnick and Piotr Dollar. Edge boxes: Locating object proposals from edges. In Computer Vision-ECCV 2014, pages 391-405. Springer, 2014. 59