IMMERSION: A Platform for Visualization and Temporal Analysis of Email Data MASSACHUSETTS INTI OF TECHNOLOGY Deepak Jagdish M.S. Human-Computer Interaction, 2010 Georgia Institute of Technology LE RE214 LIBRARIES B.Tech Information & Communication Technology, 2007 Dhirubhai Ambani Institute for Information & Communication Technology Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology September 2014 . Massachusetts Institute of Technology, 2014. All rights reserve Signature redacted Author epak Jagdi h Program in Media Certified By Signature redacted & 'Dr.tsar A. Hidalgo Assistant Professor of Media Arts and Sciences Program in Media Arts and Sciences Signature redacted Accepted By Dr. Pa Maes Interim Academic Head Program in Media Arts and Sciences E IMMERSION: A Platform for Visualization and Temporal Analysis of Email Data Deepak Jagdish Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on 8 August, 2014 in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology Abstract Visual narratives of our lives enable us to reflect upon our past relationships, collaborations and significant life events. Additionally, they can also serve as digital archives, thus making it possible for others to access, learn from and reflect upon our life's trajectory long after we are gone. In this thesis, I propose and develop a webbased platform called Immersion, which reveals the network of relationships woven by a person over time and also the significant events in their life. Using only metadata from a person's email history, Immersion creates a visual account of their life that they can interactively explore for self-reflection or share it with others as a digital archive. In the first part of this thesis, I discuss the design, technical and privacy aspects of Immersion, lessons learnt from its large-scale deployment and the reactions it elicited from people. In the second part of this thesis, I focus on the technical anatomy of a new feature of Immersion called Storyline - an interactive timeline of significant life events detected from a person's email metadata. This feature is inspired by feedback obtained from people after the initial launch of the platform. Thesis Supervisor Dr. Cesar A. Hidalgo Assistant Professor in Media Arts and Sciences Program in Media Arts and Sciences 2 IMMERSION: A Platform for Visualization and Temporal Analysis of Email Data Deepak Jagdish Signature redacted Dr. Ces". Hidalgo ABC Career Development Professor of Media Arts and Sciences Massachusetts Institute of Technology - Thesis Supervisor IMMERSION: A Platform for Visualization and Temporal Analysis of Email Data Deepak Jagdish Signature redacted Thesis Reader Reader Thesis Dr. Sepandar Kamvar LG Career Development Professor of Media Arts and Sciences Massachusetts Institute of Technology IMMERSION: A Platform for Visualization and Temporal Analysis of Email Data Deepak Jagdish Signature redacted Thesis Reader T ss "d rDr. Fernanda B. Viegas Director, 'Big Picture' Visualization Group Google Acknowledgments This thesis is dedicated to my family; especially to those loved ones who have already departed whose life stories I wish could have been recorded in more detail. The ideas and work presented in this thesis are the result of innumerable conversations, email exchanges and data experiments with my advisor Cesar Hidalgo, who has been a treasure-trove of scientific advice and a paragon of work ethic that I have strived to learn from for the past two years. I am deeply grateful to him for all his help. Immersion would not have been possible without the wonderful collaboration I shared with my colleague Daniel Smilkov. I have had the chance to learn much about his domain of expertise and hope to work with him again in the future. Special thanks to my thesis readers Sep Kamvar and Fernanda Viegas, who apart from providing advisory feedback, have also inspired ideas in this thesis through their past projects. I would also like to extend my gratitude to Ethan Zuckerman for his inputs right from the early days of Immersion. My friends at the lab - including colleagues from the Macro Connections group, the Funfetti band and extended connections in other research groups - have been monumental in their support during my thesis writing stage. I'm thankful to them for keeping me grounded during the tough times and for the good memories. I'm especially indebted to my friend and fellow thesis writer Dhairya Dand for keeping me company during the final stages of this thesis. My thanks also to the Media Lab community for making everyday at the lab such a pleasure, for being willing volunteers to test out Immersion, and for spreading the word about the project. - Last but not the least, my heartfelt thanks to every person who has used Immersion we asked you to take a leap of faith with your personal data, and you did. Your feedback has been invaluable, both from a personal and professional perspective. I hope you have enjoyed your Immersion experience. 6 Table of Contents INTRODUCTION 8 Understanding the problem space 9 Motivation 11 Introducing a new perspective Crystallizing our life's stories Proposed solution 13 Thesis structure 14 PART 1: IMMERSION 16 Prior work 17 Designing Immersion 19 Engineering Immersion 27 Launch versions 33 Impact 37 PART 2: EVENT DETECTION 41 Significance of the temporal dimension 42 Prior work 44 A quick overview of the technique 51 A deep-dive into the Event Detection pipeline 52 Results 63 Future Work 67 CONCLUSION 72 BIBLIOGRAPHY 74 7 INTRODUCTION 8 Understanding the problem space Networks of people such as families, friends, teams and firms are social structures that mediate the flow of information among people, while providing social support and depositing trust. Such networks leave rich digital trails of data as a by-product of our usage of new media technologies such as email and social media. These digital trails in turn have immense potential in terms of information we can unearth about ourselves and the networks that we are part of. However, revealing useful information embedded in the underlying datasets requires specialized tools that are designed to adapt to the ontology of each dataset. Datasets spawned by our use of technological tools include (but is not limited to): our email history, banking transactions, texting and phone call logs, GPS destination requests, location check-ins, social media postings, web browsing history, and so on. Each of these datasets has an ontology that is different from the other due to fundamental differences in the nature of data being recorded. For example, an email dataset contains data fields such as the people a person has exchanged emails with, content of their conversations, and the timestamp of each email. On the other hand, a dataset of GPS request logs contains data fields such as the latitude and longitude of destinations searched for, user ID associated with each request, device from which each request originated, and the timestamps of each request. With the exception of timestamps, no other data fields are common between these two datasets. Hence, a one size fits all approach would not be ideal if our goal is to build systems that can maximize the amount of information that can be extracted from each dataset. 9 This thesis focuses on mining a specific kind of dataset - email metadata - involving networks of people. This is done in the context of designing and developing a web- based platform called Immersion, which creates a visual narrative of a person's life as recorded in their email metadata. It is worthwhile to note that metadata here includes only some data fields from the whole email data corpus, namely the To, From, Cc and Timestamp fields. The decision to use only metadata is made for two reasons. One reason is to respect the privacy of the user by accessing only the data fields that are needed to support the narrative that Immersion aims to create. Secondly, accessing and processing the all of the data fields associated with an email, such as the body content and attachments if any, takes much longer and makes the underlying system architecture more complex than necessary. 10 Motivation This thesis is borne out of a multiplicity of motivations, each fueling a different aspect of novelty that Immersion introduces to the domain of experiencing and exploring personal metadata. Introducing a New Perspective Mainstream email platforms more often than not rely on representing email data using a time-ordered list of messages one after the other regardless of whom the emails are exchanged with. A quick visual survey of the popular email platforms over the past few decades reveals that the basic visual structure has remained the same. As a specialized tool designed to make sense of email data, Immersion is able to ask specific questions that can reveal much more than what we are currently able to learn from our email history. By using the people we exchange emails with as the primary anchor for questions, more abstract parameters such as conversation groups, conversation topics, introduction trees, evolution of relationships and so on can be calculated. The results of such calculations are then carefully crafted into interactive visualizations, thus enabling people to see and explore more of their own email dataset than was previously possible. Crystallizing Our Life's Stories Biographies are a common way to spread information about the life of a person. Some famous individuals are capable enough of writing their own biographies, whereas other famous individuals are inspiring or controversial enough to have others write about 11 them. And then there are other influential people who employ ghostwriters to have their stories told. Regardless of which of these routes leads to the making of a biography, it is perplexing to note that only an infinitesimally small fraction of human population across time has had the luxury or capability to have their stories narrated and archived. The advent of democratic forms of new media such as wikis have improved this fraction, but only marginally. According to the most recent count from the dataset released by the Pantheon project, there are about 997,276 biographies of people that exist in Wikipedia (Macro Connections, 2014). Even though this is a significant improvement from before the emergence of Wikipedia, comparing it to the number of people who have lived across many centuries reveals that we've barely scratched the surface. One of the primary reasons not every person publishes an account of their life, apart from the lack of public visibility due to not being famous, is because there is a massive gap between the desire to publish and capability to actually do it. This gap arises specifically because we lack the tools to easily crystallize our life's events into a more concrete shareable format like a book or a website. 12 Proposed Solution Immersion is a publically available platform that people can use with their own email data to experience new perspectives of their email life through interactive visualizations. Even though there are previous projects have solved the visualization aspect of this, Immersion's goal is also to scale this approach and be able to support hundreds of thousands of users, which presents complex technical and design challenges. Moreover, Immersion aims to bridge the gap between having the desire to publish one's life story and the capability to actually do it, by providing a platform that automatically detects significant events in the timeline of a person's life using their email metadata, and annotates each detected event with the people and topics related to it. It is important to note that Immersion does not create an entire biography of a person. That task requires more work than just creating a skeletal view about a person's life. However, what Immersion does provide is make the first step towards a personal life account easier, by automating a part of the process. It achieves this by automatically creating a skeletal framework of connected significant events in a person's life, which they can then later expand into a format they prefer, such as a book or a movie. The transformation from an outline of events to a richer and longer format is outside the scope of this thesis, and would require tools specifically designed to solve that problem. Immersion aspires to motivate future projects that can build on the skeletal event framework that it provides. 13 Thesis Structure Beyond the introduction, this thesis is broadly divided into two parts. The first part includes a background study and detailed explanation of all the features included in the proposed solution, Immersion. The second part focuses on the implementation details of a specific new feature - detection of significant life events - that is being introduced in the next version of Immersion. In the first part, after reviewing related work that exists in the domain of email data visualization, I explain the design and implementation of each high-level feature that was part of the first public release of Immersion. This is followed by an overview of the impact of that release, challenges faced in the large-scale deployment of the platform and system-level changes being incorporated in the next version of Immersion. I conclude this part with a compilation of the observations made based on my interaction with users of Immersion. One of these observations in particular - users' propensity to point out how significant moments in their email life is reflected through their Immersion profile - forms the motivation for the second part of this thesis. The second part of the thesis looks at email metadata from a temporal perspective in the context of automatically detecting significant life events and presenting the same through an interactive timeline. I introduce this part with a brief overview of related work in the domains of Time-Series Analysis (TSA) and Temporal Network Analysis (TNA). Among the methods reviewed, I elucidate my reasons for choosing a TSAinspired scan-statistic method. This chosen statistical method lies at the heart of an Event Detection Pipeline that is specifically designed to process time-series obtained from timestamps in email metadata. I then explain how this pipeline forms the 14 backbone of the new storyline feature of Immersion that enables people to explore their email history through an interactive storyline. A quick overview of the userinterface of the storyline feature is included, followed by a brief discussion about the computational infrastructure used in the development of the storyline feature. I then conclude this thesis with some comments about future directions for this work. These include plans for the upcoming release of Immersion and identifying avenues for improving efficiency and accuracy of the event detection feature, followed by some closing remarks. 15 Part 1 IMMERSION* This part describes prior work in the domain of email data visualization, and the design and implementation of Immersion - a people-centric visualization platform for email data * Work done in collaboration with my colleague Daniel Smilkov 16 Prior Work Email data has been used to answer research questions in various domains such as (but not limited to) computer-supported cooperative work, network science, information visualization, information retrieval, etc. In the context of the first part of the thesis, we are interested in new ways of representing personal email data. In that vein, various projects by Vi6gas, such as TheMail, Social Network Fragments and The Email Mountain must be mentioned since they provide interesting visual perspectives using the same kind of data that Immersion also uses. The Mountain project (Viegas, 2005), aptly named to reflect the deluge of email we experience these days, visualizes a person's email history expressed in terms of all the people that this person has communicated with over time, with each contact shown as a separate layer. The thickness of the layer is directly proportional to the 'thickness' or strength of the relationship between the layer's corresponding contact and the user. Jar 1: ,reMu i r~c 17 TheMail (Vibgas, Golder, & Donath, Visualizing Email Content: Portraying Relationships from Conversational Histories, 2006), on the other hand, is a visualization that focuses more on how the content of a user's emails has evolved over time. This provides a convenient way to quickly perceive the important conversational topics that a user was involved in with other contacts. The Social Boyd, Network Nguyen, Fragments Potter, & (Vibgas, Donath, 2004) project focuses more on the network aspect of email data, by deriving a graph of social relationships that a user is involved in. It also addresses questions about structures evolve over time. how network This project is much closer to the visualization technique e that Immersion uses, as the next section will reveal. 18 Designing Immersion This section describes the design decisions that informed the interaction and visual design aspects of the various features of Immersion. I shall first provide an overview of the interface, followed by individual descriptions of each major feature. Overview of the User Interface One of the motivations that Immersion is founded on is to provide a new perspective for people to see their own email data. In order to do this, we have to eschew traditional forms of representation of email data, and replace that with a new approach that highlights different aspects of the same data. In a way, I liken the traditional approach to the users always seeing their data framed within the constraints of a square, and Immersion being a tool that reveals to the user that what they're looking at is actually a cube containing many layers of data (Figure 4). Immersion's goal is to use the different dimensions of email data, such as the people, timestamps, conversation topics, etc. to create a new perspective. * * qure 4: Loo!(n~a at at O+a sw tra *o r y da ncat axast o~srmaaO:a 'o :or a rne . Derr-ne. v, brea.<a!e ex:St t!-ai sqwjare to S''ovw the deeper sayers o' d'ata e 19 ml It is also important that the new perspective that we architect is also easily understandable by the user. For that reason, there is a need to keep the user interface elements simple, meaningful and as information-laden as possible with minimal noise. vo nte.Kt'on wath OLionel Messi N 4 yarm ago Sent Rcce3' Now Contact- ago 2 day: A: ' / 1st 629 / 1st - 1 605 .fLdlfBUJhdflrS - F-. :, E. . ! ....i~l -/«, (1= n', !' : ;Fr"-tr,/ r,i, ',^ ;.2 ^~ 01 $as 7005 70 t. 7013 Deepak Jagdish Sevch co,4M013 648 collaborators 3 0 :.000 40 01 h .t ~ 11 t R7OS 1r as 70 a . t1 / 06 30 Ffi uri;e6" -Lr'z ?er C 'veom-i ,,)Le !mcuc li.!i 7p .i majr l'' PEemf 20 olOcJ.O1"*2Ot3 675 collaborators 49.020 emails *cws0 C. 0 u &' *. 5 Octi 0 A~o C. x, 0" stay 0 dS4O IAs Lb" 0 .Pm$ S '. to 1M'.' " B* Yl JE 1( 11 S A ?'~y 0 * Nflae M. A Ttar " * 5ww4" " 0. r 9Z" I' L " S. A . /: Deepak Jagdish Aa4c xe, item! ^e deC f F .ti a nc!ud29 7. ;r 0 0 0 0 0 0 0 0 0 0 0 0 0 Sec th Cortats: o bete:2iw0 .avout1 a* !: FC t i eemetts~r 21 After a number of visual iterations as shown in the series of screenshots above (Figures 5-9), we settled on a UI layout that focuses on the Network view, Person view and Statistics view. The time range selected using the slider at the bottom of the interface filters the data that is fed to each of these views. ~ " Tony Stark Q C Cat 159 cohlaborators i.,a. 20,649 emails " "' o«+ rmam Sent I. C.:. Ww Mrn ey " 't Ua.. 400 Md 2Wel Low ~ . " ush R"* DCCCI " i.9 " w AC.. K"CF bW dwaIUw (1lciN Wr. C . CJ. 0- o u" G6.5 "Ba Uei -:"r D i~ !laot A evo wars e 22 TN Network View As shown in screenshot of the finalized user interface (Figure 10), the Network view takes the center stage in Immersion. In this view, each collaborator (a contact that passes the relevancy test described later in this thesis) is represented as a circular node, and the size of the node corresponds to the number of emails the user has exchanged with that particular collaborator. The node is larger for more number of emails exchanged, and smaller for lesser number of emails exchanged. This allows the user to get an idea of who their top and most connected collaborators are at a quick glance. Two or more nodes are connected by a link if the collaborators that the nodes represent have participated in email conversations together. The link width is also modulated based on the strength of the relationship between the collaborators, as seen in the user's email dataset only. Statistics View On the right hand side of the main user interface is an 14 400 2W00 0 about the user's email usage, number of collaborators - info pane that includes some basic aggregate statistics ^! 1, that Immersion , !1 4 has detected, ranking of the user's collaborators and also a series of histograms. These 10 ooo 4 histograms (Figure 11) represent three significant series - oo of data points. The first one shows how the number of m 2.00 0 " emails sent by the user has evolved over the years. The final bar in the histogram also includes a predictive o- component, represented by the grey bar, which uses the ^. tW9 1?, 14 average of sent emails from the previous year and the current year. The second histogram shows a similar 23 series of data points, but for the number of emails received by the user over the years. The final histogram represents the number of new collaborators the user has emailed each year. This is calculated by detected the number of emails sent to any collaborator for the first time. In the sample histogram shown here, it is clear that 2013 was a year where the user interacted with the most number of new collaborators. Rankings View The info pane also shows the user a ranked list of collaborators based on their relationship strengths. This Travis Villescas view is automatically filtered based on the time range selected. The user can also toggle the rankings between the time-filtered version and the rankings across all time. TerranceSchlitt , Sao Orwig Cecily Dennies Person View Clicking on any node brings the user to the Person view (Figure 13). This view focuses the user interface elements on the collaborator of interest, and shows the other collaborators that this person is connected to via the selected individual. The network view changes into a circular view with the collaborator of interest in the center, and the second-degree connections to that individual being laid out on the circumference. 24 Travis Villescas \ .1 Rrstemail: 6eyears ago 0 Lmntm " lug I&Uv~~. p 1.2years ago SefittprtvataJ 377 (245) RecMVyed (priatek S42 (342) Interattionvoum Tdreawmsto " ME' " "5* DSM - DdPb,j JN 07d May e " S - Attueced ou Sp09J ay1 tte Roco Dhoert Da Marz te t~0 " 0 er Stutesmn~n Jol~y Glassbum -Alln 0 Introduced by: 6.5 years .,_.. _. None Jan 2007 - 06 Jul 2013 .04 The info pane on the right is also updated to reveal information about the user's email communication history with the selected collaborator. In addition to showing the user information such as date of first and most recent email changed with that collaborator, the info pane also shows a histogram of how the relationship between them has evolved over time. The Person view also reveals information about the people that the selected collaborator has introduced the user to. Immersion calculates this by detecting the first time any collaborator(s) appears in the Cc: field of emails. Similarly, it also shows the person that introduced this collaborator to the user, by detecting the first time the selected collaborator appeared in the Cc: field of an email sent by someone else. 25 Snapshot Feature Most of the users of Immersion feel the need to share their Immersion network with their family, friends and/or colleagues. To facilitate this, we implemented a Snapshot feature that allows the user to save (only) the Network view as an image file. Immersion does this by converting the SVG rendering of the network into a data URI that is then used as input to a canvas element, which renders the image pixel by pixel. The link to the generated image is emailed to the user to download, and the image is subsequently deleted from Immersion's server after 30 days. Data Ownership Immersion takes the privacy of users very seriously, and so we give the user full control over their own metadata. At the end of their Immersion experience, every user is given the choice to either delete their metadata from Immersion, or save it in Immersion's server. The former option ensures that no personally identifiable information about the user's email history is accessible to Immersion any longer after logout, whereas the latter option promptly saves the data to the server, thereby enabling faster access to their Immersion profile in the future. 26 Engineering Immersion In order to bring the aforementioned features and user interfaces to life, there is a considerable amount of engineering involved. The engineering stack and architecture of Immersion has evolved since its birth through different iterations as we added new features and supported more users. This section describes the client-server system that Immersion is built on, its evolution over different iterations and also sheds some light on its individual technical components. System Architecture The initial prototype of Immersion that was used for alpha testing within the Media Lab community was feature-driven rather than scale-driven. Most of the data processing was being done on the server and the client simply received a distilled dataset that it then rendered as visualizations. We continued to use this model for the initial public launch of the website as well. However, even though this model worked well for less than 50 users due to low concurrency, it became very clear a few days after the launch that when more than 100 users were on the website at the same time, the server (due to its infrastructural limitations) was starting to choke and fail. This eventually happened, and it led to a complete overhaul of the architecture, which I describe below. The modified architecture, which has a distributed computing flavor, gives more responsibility to the client and only requires the server to do the data-fetching task. This model scales much better for concurrent users since intensive computational tasks like email metadata parsing and network abstraction now happens on the client-side. In 27 El other words, the client is thick and the server is thin in terms of the tasks they perform. A visual representation of the architecture is shown in Figure 14. Thick Client - Email parsing - Data cleaning -Visualization Thin Server User info - User authentication .. ..-..o - Email extraction Rawemails - Database Raw emails Email server - Algorithms JavaScript / D3.js Java / MongoDB igure 14: Syse- ArcPieciture srowcng te :!"mersicn iala 'Pw each 'ayer's data- reated respons<6U!C:y. Source: (/Crmi'!av, 2014) and The client is completely JavaScript driven for computational tasks and uses SVG, HTML and CSS3 for the visual rendering of content. It received serial batches of raw email data from the server that as been fetched from the email service provider after it has mediated the user authentication process. The client then parses this data and pushes data fields of interest into an in-memory database. There are helper functions written in JavaScript that then creates different slices and abstractions of the data, such as ranking of contacts, calculating relationship network, community detection, animating visualizations, etc. Upon logout, if the person wishes to save their data, it is sent back to the server to be stored in a database for future access. The following section explains in detail the implementation of the data flow pipeline involved in the communication between the client and server. Data Flow AUTHENTICATION In order to fetch email data from a service provider, the user has to grant Immersion access to his or her email account. Authentication systems vary depending on the email 28 service provider, and in Immersion we support three providers: Gmail, Yahoo! and Microsoft Exchange. Whenever possible, Immersion keeps the process of receiving and transmitting login credentials as secure as possible. In the context of the overall system, Immersion makes use of SSL encryption to keep information secure during communication between client (user's browser) and the server. In the case of Gmail, users are redirected to a page hosted by Google that returns an authentication token to the Immersion server in case the user's login credentials are validated. The validation process happens outside of Immersion's purview, thereby keeping the process secure. FETCHING EMAIL HEADERS Once the server receives the signal of authentication from the service provider, it creates a fetching task that is handled by a multi-threaded Java application. Our initial prototype used a Python fetching module, but due to scalability issues, we moved to a Java-based solution. For Gmail and Yahoo accounts, Immersion's Java fetcher fetches email data using the Internet Message Access Protocol (IMAP), and in the case of Microsoft Exchange we make use of the official Java Exchange API. The current version of the Immersion server spawns 30 threads in order to handle concurrent users. Immersion fetches only the header information of every email. This means that it never ever accesses or reads the body (and attachment) content of any email, thereby providing a higher level of privacy for the user. It also means that the data fetching process is much faster since headers are quite small in terms of size compared to the actual email. The header only contains some data fields, of which the only ones Immersion gathers are the From, To, Cc and Timestamp fields. In the upcoming version of Immersion, it also gathers the Subject and ThreadlD fields, but those are also part of email metadata contained in the header, and do not require access to the body content of email. 29 For avoiding overloading the server with too much data, Immersion limits the number of email headers fetched for user to 300,000 emails. This number is usually sufficient to gather emails for approximately 3-4 years for any user, thereby giving Immersion enough data to paint an informed picture through the visualizations. Since the IMAP protocol allows for fetching specific emails, Immersion executes the fetching task as a batch process of 10,000 emails each. This means that the user does not have to wait till all of their emails are downloaded, which can sometimes take up to 10 minutes. Once the initial batch of 10,000 emails is received, Immersion starts to render visualizations using that data, this providing the user something useful to work with while the rest of the data is also fetched and included in the visualizations. PACKAGING DATA FOR THE BROWSER For every batch of 10,000 emails fetched, the server compresses all of that metadata into a single JSON (JavaScript Object Notation) file by wrapping it in a gzip format that the client's browser is able to unpack at its end. This greatly reduces the size of each data packet sent to the browser. On average, the size of each JSON compressed batch of emails comes to between 2MB - 5MB. EXTRACTION OF METADATA & MAKING SENSE OF IT This is a critical step in the data processing context since it helps filter out emails that have invalid values in data fields like the From, To, Cc and Timestamp. There are a number of occasions where a timestamp's value is from the distant past or future due to inconsistencies in the sender's email settings. Such emails where there are issues with parsing of the metadata are removed from the data set before visualization. Also, by checking the value of the Auto-Submitted field in the header, we are able to ignore emails sent automatically by machines. Immersion systematically combs through every email in the Sent folder to detect if the user has multiple email addresses (aliases) associated with his or her name. We do this 30 to make sure that all of the emails are correctly mapped to the correct sender and receiver, and that the user himself or herself is not detected as a separate contact. Immersion also does something similar for other contacts appearing in the user's email metadata. Multiple addresses for a contact are collapsed into a single identifiable entity contact appearing in a user's email history with two separate email addresses - defined by the Firstname Lastname of that contact. For example, if John Doe is a john.doe@gmail.com and hi@johndoe.com - they will be mapped to a single entity called John Doe, as long as the name fields for both email addresses are the same. This vastly improves the clarity of the final network of contacts obtained by removing spurious nodes for the same contact. However, in doing this we also risk wrongly combining two different people into a single entity if their first and last names are the same. IDENTIFYING RELEVANT CONTACTS Since the visualizations in Immersion are people-centric, it is important that the dataset does not contain contacts that are irrelevant from the perspective of the user. For example, email addresses associated with mailing lists, social network notifications, etc. are not considered relevant since they do not contribute much meaningful information. We use a simple filtering technique to remove these contacts from the network. A minimum threshold of 2 emails (sent and received) is set, and only contacts that pass this requirement are shown in the network. This effectively rules out inclusion of contacts from which the user has received only one or two emails but never sent an email to. CREATING AN IN-MEMORY DATABASE OBJECT IN THE BROWSER Since all the data processing happens in the browser, the data needs to be organized in a way that makes it easy to retrieve the necessary bits for calculating metrics such as ranking of people, creating the underlying graph, identifying time-stamps, etc. In order 31 to facilitate this, we create an in-memory database object which stores all the metadata sent by the server. This database object has helper functions associated with it that enable the retrieval of specific data points that Immersion needs to generate visualizations. GENERATING THE UNDERLYING GRAPH STRUCTURE In order to transform the disconnected pieces of information from each contact into a larger connected graph that powers the network visualization, Immersion queries the inmemory database to get the number of emails between the user and each contact. We then calculate the communication strength between the user and each contact, and also between contacts themselves from the user's perspective. The communication strength of contact i, ai, is calculated as the generalized mean of the number of emails that the user (or one of his or her contacts) has sent to i, Si, and the number of received emails from i, ri: a ai =2 Sip + rip)1/p We empirically found p = -5 to be the value that highlights symmetric communication (two-way) over asymmetric communication (one-way). The resulting values of ai reflecting the relationship strengths between the user and his or contacts, and also between the contacts themselves in the user's ego-network is then converted into nodes (each contact) and links (value of ai between contacts) to give rise to the network visualization. 32 Launch Versions Interactive Exhibit: Webs of People Soon after the development of the initial prototype, Immersion was selected to be an interactive exhibit at The Other Festival organized at the MIT Media Lab in April 2013. The exhibit was titled Webs of People, in tandem with the Webs of Matter exhibit showcasing the Silkworm Pavilion project from the Mediated Matter group. One of the main goals of our exhibit setup was to provide an experience where the creators of the project did not have to be physically present for a visitor to try out Immersion. PHYSICAL SPACE In terms of the physical space assigned to the exhibit, it was located near the main entrance of the lab. This meant that the margin for failure was very small, both in terms of the software and hardware since it was a high visibility area for visitors. Figure 15 shows a rendering of our vision for the exhibit in that space. Webs of People 02-. t-;9ue J: A r'^er ' n 3:vW Lionw p Mex. o! 'ire !n.L8 !a Vr lor be W'ebs o~ -eop!e exhfOi 33 El We explored the possibility of using projectors to have a large viewable area for different views of the visualizations. But even projectors with high lumen ratings did not succeed in reproducing colors and contrast at a level we desired, made all the more difficult by the lighting situation in the exhibit space. During daytime, there was enough sunlight in that area to wash out any sort of projection on the wall, unless the projectors were kept very close to the wall. This would mean that the viewable area would become much smaller. The other option was to enclose the exhibit space using curtains, or create a kiosk/booth, but we decided against this approach because it would affect the open experience we wanted the visitor to have while experiencing Immersion. HARDWARE & TECH SETUP Given the complexity of using projectors, we instead decided to use three large displays, set up in an angular fashion, each display showing a different view of the * exhibit. A photograph of the exhibit setup is shown below. Llgure 76: Webs or ene f>na exhibit setup 34 The central display (Figure 16) is the one that visitors can log in to, and see their visualization. The display on the left shows the Immersion Guestbook, and the one on the right shows the Immersion Rankings of the visitors who have saved their profile upon log out. The Guestbook and Rankings were special features added for the exhibit version. IMMERSION GUESTBOOK For every visitor saving his or her Immersion profile during log out, we add this person as a node to the Immersion Guestbook network on the left. The idea behind the Guestbook is to show the network that emerges from the connectedness of Media Lab community, including its external connections such as, sponsors, spouses, roommates, etc. Figure 17 shows a screenshot of the Guestbook feature. immersion. Guestbook The cwn'"^ ;r F -r - , . 7 P'" .r. 4;', 0 y:0i 0 0 0w ,ai^6 B'm1a. : t k- Figure 17: Guestbook .";{z o"" feature showing a network' view of connected guests -\1 who saved their /mmersion profile at the Webs of People exhi6it 35 IMMERSION RANKINGS For every visitor who saving his or her profile during log out, we also calculate their collaboration rank - a immersion. Ranking Zuckerman 3iscomx *@Ethan W heuristic that measures how many people the user has akshrn Pratury Leigh exchanged at least 3 emails with. Based on this Oristie *65 w'awt s O tCrnstobal Garcia rizyue measure, we show the top ten users in the Immersion C esar A. Hiedalgo 0 dataset as a ranked list that is updated every time a 0 Matt stempeck A6*wro "4m mu Lisandro Brit 'sp cnnxa new user's profile is saved to the server. This turned 0 0 out to be a great way to motivate visitors to check out the exhibit and also save their profile for future use. Figurev owv 18oT shows the rankings as of May 2013: :00 e8: Ranicinos Kanannka Akanarinak rs Willow Brugh nuova * Sergio Marrero m mw 7 Weobs of P2 0 0 ! cx 1xh1bi Public Website: Immersion v.0 On 30 June 2013, we launched the public version of Immersion as a standalone website, accessible at immersion.media.mit.edu. The website version did not include the Guestbook and Ranking features, but had extra layers of server-side functionality in order to support many hundreds of users concurrently. The website has been tested to work successfully on browsers running the WebKit rendering engine (The WebKit Open Source Project, 2014), such as Google Chrome and Safari. Support for non-WebKit based browsers is not provided since we experienced severe user experience issues due to how those browsers were unable to handle rendering a large number of animated SVG elements (like those appearing in the network visualization). The impact of the launch of the public website, and observations made from interactions with users of the exhibit version are discussed in the next section. 36 Impact Reactions & Quantitative Observations After the launch of the public website, Immersion was immediately picked up by different news agencies for reporting in the context of the global discussion on metadata and privacy. It was clear that the public, informed through the media, was starting to view Immersion not just as a tool for self-reflection, but more so as a tool that reveals the potential of unethical access and misuse of personal metadata by governmental agencies. Using Google Analytics, we were able to keep a track of the location, time and referral source (website) of each visitor to the website. The first news report appeared in the Boston Globe, titled 'What Your Metadata Says About You' (Riesman, 2013). Even though this report by itself only directed about ~1000 users to the website, it spawned a number of other news agencies to write about relevance of the project. NPR published the next major news report, pitching Immersion as a tool that 'lets you spy on yourself' (Goldstein, 2013), and that article brought thousands of visitors to the website, peaking at almost 30,000 visits a day. The figure below shows the number of visits per day for the month of July 2013, right after the launch of the public website. Ssessions 200.000 100.000 .Juge AO13 Ag 20 -igure 19: immersion's web trafic during the month after the !aunch on Jw 27 ,Jure 30, 2013 37 Even though we had initially prepared the server for a high number of concurrent users, the deluge of visitors to the website was far beyond our initial expectations. This resulted in the server crashing on July 2 nd, and consequently Immersion was offline for the next three days while we worked on a more scalable architecture. The new architecture pushed all the processing to the client, with the server only fetching the metadata from the email provider. This in effect created a distributed computing setup, and relieved the Immersion server from having to do all the processing on its own. The website was re-launched on July 4 th, and the timing was perfect because by then a lot more new agencies and influences on social media started to link to the project. The peak day was July 9 th with over 200,000 users visiting the website from many different sources such as Wikileaks (WikiLeaks, 2013), Time magazine (Groden, 2013), The Independent (Vincent, 2013), etc. As of June 2014, Immersion has received over 800,000 unique visitors spanning over 1.3 million visits, of which 43% were returning visitors. Another data point we were tracking was the amount of time that visitors spent on the website. As shown in the figure below, Google Analytics reveals that many users spent less than 10 seconds on the website, probably due to their reluctance to share their email metadata. However, more than 210,000 users spent between a minute to ten minutes on the website (Figure 20), which suggests that the engagement with users who were willing to try out Immersion was substantial. 0-10 seconds 723,306 11-30 seconds 57,394 31-60 seconds 80,533 61-180 seconds 103,146 181400 seconds 108,599 601-1800 seconds 72,141 U 36,265 I 1801+ seconds e 20: C)vera! ,me scent on te mes west 'S5 U y users across multiple sessions 38 Qualitative Observations Another source for observations was the exhibit version of Immersion that allowed us to watch visitors use Immersion in person and gather feedback from them directly. It was fascinating to observe each person's immediate emotional reaction when they saw their Immersion profile for the first time because the visualizations clearly motivated them to share more information about themselves. The emotional reactions of people varied from person to person, and seemed to be influenced by factors such as: - their personality: whether a person was open to discussing their Immersion experience, or whether they preferred not elaborate on the relationships that was revealed visually, - personal/work emails: whether the emails analyzed were of the personal and/or the professional kind, - light or heavy usage: whether they had only a few thousand emails or many thousands of emails in their inbox, etc. The visualizations in Immersion were more insightful and richer in content if the user had more than at least a few thousand emails. One common theme of reaction I noticed among most of the users was a propensity to recall and share, through informal conversation, the different 'eras' of their email life. Some of the sample comments made by people attempting to communicate their findings included were as follows: - "Hey, that's when I moved to Austin!" - "I met my wife around that time..." - "2011 was a low-point for me... I was between jobs." - "My network becomes so dense after I started grad school." 39 This seemed to be a common way for them to frame and elucidate their emotional response to any significant visual finding that they perceived in their Immersion experience. For e.g., validating that a dip in the graph of a relationship with a particular contact coincided with the timing of their real-life separation. This particular observation motivates the goal for part 2 of this thesis: being able to algorithmically detect significant events and time-periods in a person's email life, and present the results in the form of an interactive storyline. 40 Part 2 EVENT DETECTION This part focuses on temporal analysis of email data in order to detect significant life events. The results are used to craft a new feature in Immersion, called Storyline - an interactive timeline of a person's email life. 41 Significance of the temporal dimension This part of the thesis essentially boils down to application of time-series analysis to email metadata. A time-series is a collection of values corresponding to a set of contiguous time instants generated by a process over time (Esling & Agon, 2012). Timeseries data mining is a well-explored research domain because a large percentage of the data that is generated through processes and experiments, either manually or by machines, are anchored with specific timestamps. Time-stamped data is commonplace in scientific domains like medicine (ECGs, EEGs, heart rate, etc.), astronomy (light curve data from exoplanets, positions of heavenly bodies, etc.) and also at the consumer end of the spectrum such as personal data (email, text messaging, financial transactions, etc.). The ubiquity of time-series as a form of representing a process for further analysis is brought about largely because of the convenience of using time as the independent variable in many experiments, and so it easily finds a home on the x-axis of even the most basic graphs. Representing time on a straight line is a motif that a lay person is comfortable with, and from that building block we have managed to come a long way in terms of methods to analyze and represent higher levels of time-stamped data. Apart from being able to delve deeper into the characteristics of a single time-series, the temporal dimension also helps us to compare different processes using time as the common axis. Aptly put by Gershenfeld & Weigend, "measured time-series form the basis for characterizing processes under observation, to better understand the past and at times, to predict their future behavior" (Gershenfeld & Weigend, 1993). 42 One of my main goals with this thesis is to bring the contextually relevant body of knowledge of time-series analysis closer to the domain of email data. Consequently, this enables people to derive more insights out of their own email data though a publically available platform, in this case Immersion. 43 U Prior Work There have been previous works that have analyzed email through temporally motivated questions. The literature review exercise revealed that while there are existing tools to explore one's own email data in the temporal dimension, there is more that could be done from the technical perspective, in terms of using better, more accurate and more efficient methods of mining email data. For this reason, in the second part of this thesis, I focus more on the technical challenge involved in detecting significant events and lesser on the interface aspect, since we would be able create richer narratives if we are able to mine better stories from the underlying data. Fischer & Dourish's (Fischer & Dourish, 2004) work on temporal patterns in coordinated work environments resulted in the creation of a tool called TeliMeAbout (Figure 19), which gives a user temporal information about their > ?.llaab--t -person bnnzkhua messages since Mar 23 'C1, most recently May 12 -02 especially Mar 26 '01-Hay 7 '01, May 28 '01-Jun 11 '01 closest connections include (gayle) 19 Figure 21: TeIIMeAbout individual interactions with their colleagues. Vibgas et al have also published work relating to the temporal aspect of email communication. In their project PostHistory (Figure 20) they aggregate data on temporal dimensions such as daily email averages, 'quality' of emails (spam or relevant contacts) and relative frequency of email exchanges with contacts in order to -! 44 El provide the user a way to access higher-level information about their email habits (Viegas, Boyd, Nguyen, Potter, & Donath, 2004). Immersion makes use of similar and more dimensions inspired by works in the field of time-series analysis that Is discussed next. Even though time-series analysis (TSA) is a fast-moving research domain, my literature review revealed that not many TSA methods are formulated specifically for email datasets. It wasn't surprising to note that prior work in this intersection mostly made use of publicly available email archive data such as the Enron email corpus (Cohen, 2009), rather than allowing people to analyze their own email data. Approaches to temporal analysis of processes can be generally classified into two kinds: - signal processing inspired Time-Series Analysis approach (TSA), - network science inspired Temporal Network Analysis (TNA) approach. The former approach is more statistics driven and has a much larger body of work compared to the latter, especially since network science is a relatively newer field compared to regular time-series data mining. Pertaining to TSA, there are a number of useful survey papers that enabled me (as a beginner in this field) to get a good idea of what are the common analysis methods and representation techniques used when dealing with time-series data. The general timeseries related methods outlined in these papers fall into the following categories (Ratanamahatana, 2010): - Indexing (Query-by-Content) - Clustering - Classification - Prediction 45 - Summarization - Anomaly and Event Detection - Segmentation Each family of methods is best used for purposes that it is more suited for. For example, Indexing is used in the case of quantifying similarity or dissimilarity measures (Euclidean distance) between a given set of time series signals. On the other hand, Prediction is a family of methods that features heavily in many domains, where data points from past time intervals are used to project values of the same process into the future. The combination of analysis techniques to accomplish the goal of the second part of this thesis is wholly inspired from the Anomaly and Event Detection family of methods in the list above. Given that the high-level data structures and visual components used in Immersion are inspired by network science, I've also reviewed technical papers in the research space of temporal aspects of networks to investigate how they can assist in the process of event detection. Holme et al in their book Temporal Networks (Holme & Saramaki, 2013) reviews past and emerging works in the field of TNA -- temporal graphs, evolving graphs, dynamic networks, etc. --- making a clear distinction of which techniques are best suited for static networks and which are best suited for dynamic networks. More often than not, TNA methods simplify the task of dynamic network analysis by sampling a dynamic network at regular time intervals, resulting in a series of static networks each forming the signature network representation for each time interval. 46 Wan et al's work on link-based event detection in email communication has shown that simply relying on analysis of variations in communication volume over time is not enough to surface every event worth detecting in a person's email dataset (Wan, Milios, Janson, & Kalyaniwalla, 2009). There are changes that can happen over time within a person's email life that need not necessarily affect the communication volume, but can radically change the underlying network of people that the person has communicated with. For example, moving to a new city might mean that a person still sends and receives the same number of emails the next month, in effect keeping the communication volume almost the same. But the set of people the person interacts with could change a whole lot in that same time frame which could be detected if one were to use the temporal network analysis approach. A combined approach (TSA+TNA) as proposed by Wan et al could be used to customfit the needs of different levels of abstraction of the data. For example, one data layer could be richly expressed as a set of time series (individual signals of relationships with each person and combined communication volume), and also as a dynamic network graph of people (vertices as people and edges representing connections between people) with whom a person has corresponded with over time. In the proposal stage for this thesis, I put forth a two-pronged approach (TSA+TNA) to tackle the problem of identifying significant events from email data. Given the time constraints for the thesis and taking into account overall goals of the project, my committee advised me to focus on one approach. Based on my literature review and execution timeline, I decided to prioritize the TSA approach and integrate it with a TNA approach in the future when time permits. The technique used in this thesis for detecting events comes from the Anomaly and Event Detection family of methods. Now is a good time to explain this set of techniques in more detail since it is important to understand the difference between 47 Anomaly Detection and Event Detection, and why I choose the latter in the context of working with email data. The logic behind event detection from a time-series is to find time intervals within the time-series that deviate from the baseline signal. This is then followed by calculation of statistical significance of each time-interval in the context of the parameter we wish to observe and create a model for. Separating significant time-intervals' characteristics from the baseline signal requires modeling the noise present in the time-series. Modeling of the noise is usually done by randomization testing in which a time-series is randomly reshuffled many times and the original interval is compared to the most statistically deviant interval found in each shuffle (Neill, Moore, Pereira, & Mitchell, 2005). However, performing this randomization is a computationally intensive process, and for most practical purposes (such as in a web-based system like Immersion) it is not a viable route. In the case of email communication, noise is ever-present in the constituent time-series. Within a given time-series the noise need not necessarily follow any periodic pattern, and for a set of time-series, there need not be a common noise profile among them either. So there is a need to take an approach that is independent of the noise characteristics. This is where the distinction between Anomaly Detection and Event Detection becomes relevant. Anomaly Detection is more suitable for periodic series with less noise (Keogh, Lin, & Fu, 2005). Event Detection on the other hand helps to find subintervals that are most statistically different from the underlying noise. A paper by (Preston, Protopapas, & Brodley, 2009) provides the foundation for the Event Detection technique used in this thesis since the problem they were trying to solve - detecting microlensing events from noisy large-scale data - closely mirrors the goal of event detection that this thesis proposes in the context of email data. 48 Preston et al's method, which was motivated by the field of astronomy where timeseries analysis is very prevalent, is able to detect microlensing events from light curve time-series associated with exoplanets under observation. Microlensing events (example shown in Figure 23) occur when large objects (usually heavenly bodies such as planets, moons, etc.) passes in front of a light source (such as a sun, or reflected light from a planet) and acts as a gravitational lens, thus increasing the magnitude of light being observed. Their method was successfully tested with a large-scale dataset , - pertaining to Massive Astrophysical Compact Halo Objects (MACHO). i 17.2 I 17.4 ,0 VIA I 44.6 ? S.40 Ss~~ MO ure 23: Mfroeonc ee qer. 'ra J ! rowi J (8*0W 'Sau.1ce: (flres Or Re.(.c GJb ,.,..eM/\CHC da~ase! ET ca, 2001)i Each light curve has different noise characteristics, and there are millions of light curves to consider together as a set in order to produce reliable results (Preston, Protopapas, & Brodley, 2009). Interestingly, this problem space is quite similar to that of time-series present in email communication data, where the noise contained in the time-series representing communication volume with person A is different from that of the time-series associated 49 with person B. So how can we detect events from a time-series while also navigating the unique noise characteristics present in each series? The answer lies in the design of the Event Detection pipeline proposed in this thesis, which is detailed in the next section. 50 A Quick Overview of the Technique This section explains the technique I've used for detecting events in email data, the data processing pipeline it necessitates and the challenges faced during implementation of the same as a feature within Immersion. The technique relies on a scan statistic method that uses sliding windows (adjacent equally-spaced intervals within a time-series) to characterize sub-intervals of a given time-series. The independence from the underlying noise is achieved by converting the time-series to rank-space -- a discrete series of N points with each interval assigned a rank value between 1 (lowest value) to N (highest value). This yields a uniform distribution of points of the dependent variable being observed -- in our case, the communication volume -which allows us to form a probability model that does not have to account for the noise since the distribution is known to be uniform. Once we obtain the probability model for each time-series we are interested in, it can be used to find out the p-value for each interval within that time-series. Making this happen for variable window-sizes (1 week, 2 weeks, 1 month, 3 months, etc.) is a computationally intensive task and so an optimization technique needs to be used to approximate model solutions for each window size without sacrificing much on the accuracy of the p-values. 51 A Deep-Dive into the Event Detection Pipeline Framing the aforementioned scan-statistic technique in the context of the email communication dataset required some changes to be made to the original algorithm proposed by (Preston, Protopapas, & Brodley, 2009) and this entire process is detailed in this section. The cascaded series of tasks involved in the event detection pipeline of Immersion are as follows: Step 1: Extract and normalize timestamps Step 2: De-trend the input time-series Step 3: Convert time-series to rank-space Step 4: Calculate rolling-sums using a sliding window Step 5: Determine probability distribution using the Monte Carlo method Step 6: Find significant p-value intervals Step 7: Identify people & topics relevant during each event interval Step 8: Consolidate overlapping event intervals Step 9: Include obvious events based on individual interactions This processing pipeline is used for every time-series of interest, which includes the evolution overall communication volume (sent + received emails) over time, as well as individual time-series representing communication volume with each person of interest. The former helps to identify events at a macro-scale (a big-picture perspective) and the latter is used to profile each detected macro-scale event in depth to accurately identify who are the people in a person's email correspondence history that prominently contribute to that particular time interval's statistical significance. 52 STEP 1: EXTRACT AND NORMALIZE TIMESTAMPS The technology stack of Immersion has been designed in such a way that timestamp data of emails is available on the client-side (browser) through an in-memory database object. A time-series of the overall communication volume for a person is calculated by aggregating emails sent to and received from that person's contacts. The resulting time-series is devoid of irrelevant and uninformative timestamps due to spam, thanks to the filtering that Immersion performs prior to the visualization step (detailed in Part 1 of this thesis). The same process is also followed to obtain time-series of communication volume with individual contacts of that person. It is important to note that each timeseries obtained can have a length that can be different from other time-series, and so our algorithm needs to adapt to time-series of different lengths. Converting each timestamp into the JavaScript date object also allows for normalization of each timestamp into a common time zone -- that of the current region that the person is using Immersion in. The DateO object of JavaScript is flexible enough to give timestamps with resolution ranging between milliseconds to years. For the purpose of this thesis, the highest level of resolution we use is that of 1 day, and there is no distinction made between emails that are received in the morning of a day to those received later in the day, as long as they fall within the same 24-hour time window. This array of timestamps is sent as input to the Immersion server, which utilizes the Python framework. All the code for the following stages of the pipeline has been written in Python. STEP 2: DE-TREND EACH TIME-SERIES Time-series data of email communication volume usually exhibits a trend where the volume of emails increases over time. This observation is expected since we are online more often thanks to the proliferation of mobile devices, and also since email has become a fundamental conduit of communication in the digital age. 53 However, the presence of a trend in a time-series can distort or obscure the relationships or phenomena that we are observing. In order to ensure stationarity of the process captured by the time-series, it is necessary to de-trend the time-series (Meko, 2013). Identification of a trend in a time-series is subjective, and can sometimes be informed by knowledge of the real-world phenomena that influences the process. For the purpose of this thesis, I have assumed that email communication volume increases linearly over time because this was a commonly observed phenomenon in the dataset of people who tried out the Immersion prototype. It is quite possible that this assumption does not hold for all users, and it is necessary to model the trend more accurately in the future. One possible way is to use a piecewise polynomial fit instead of a linear fit, but due to time constraints this approach will be treated in the future work section of this thesis. Figure 24 shows the overall communication volume for my email data between the years 2005 and 2014, and what that time-series looks like before and after de-trending. The red line indicates the least-squares-fit straight line used to de-trend the entire timeseries. This is done for every time-series that is sent for processing through the event detection pipeline. 140 120 120 . .*100 *1 100 104 960 v .. 60 "" ia. 40I .. . ., 40 200 1.05 1.10 1.15 a7 ure 24: (A) 1.20 "1.25" " 30 1 n~ 1.40 1e9 for LFe* remndro X105 1.10 1.15 B A 1.20 ierJ 1.25 1.30 1.35 1.00 le9 PDc C!a'C 54 STEP 3: CONVERT TIME-SERIES TO RANK-SPACE In order to extricate our process from its dependence on the underlying noise, it is necessary to convert the time-series into a uniform distribution. This is achieved by converting the value in each interval of a time-series into a rank-value which is the ranking that the interval would have compared to values in other intervals in the same time-series. The range of ranks extends from 1 to the length of the time-series N, with rank 1 corresponding to the lower end of the set of values, and N representing the interval(s) with the highest value. This raises the possibility that some intervals could have the same original value and hence a decision needs to be made as to how to rank them. The heuristic used in our technique is to take the average of the rank that each interval would be assigned if they were not sharing it with other intervals. In the case of email communication volume, each interval represents the number of emails sent and received within a time interval, and this can vary from one interval to the other. Moreover, it is not possible to restrict the original values within any range (apart from the minimum value which is always 0). This step yields a uniform distribution of values ranging from 1 to N, on which we can perform statistical operations without having to worry about modeling the underlying noise. It also means that for each pair combination of sliding window w and time-series of length N, there will exist a distinct probability distribution of sums of ranks within each sliding window interval. This technique also makes the assumption that the sums in the outer tails of the probability distribution are more likely to map to 'significant' events since the probability of obtaining those sums are lower than that of the sums appearing in the mid-section of the distribution. 55 STEP 4: CALCULATE ROLLING-SUMS USING A SLIDING WINDOW In order to prepare the time-series for calculating p-values associated with each interval, it is now necessary to use a sliding window and calculate the sums for each interval in the time-series. I have used the rolling-sum function in Pandas (a scientific toolkit for Python), after having pre-defined the sliding window size to be 14 days. This number was chosen based on trial-and-error and real-life observations that there is a higher likelihood of a significant life event being be expressed in email communication over the duration of two weeks than over just one day, or one month. In the case of the latter, this technique would result in two consecutive sliding-window time-intervals being deemed as statistically deviant from the rest of the time-series, and that multiplicity is resolved towards the end of the processing pipeline when both will be consolidated into a single significant event. If the sliding window size were to be as small as one day, the resulting time-series would not encode any new information since the highest resolution we have used to bin emails into a time-series is the 24-hour time period. 56 STEP 5: DETERMINE PROBABILITY DISTRIBUTION USING MONTE CARLO Given our rank-transformed time-series TR, we now need to find the statistical significance of a particular sum of an interval. In other words, we need to know the distribution of (w,N) where w is the size of the sliding window and N is the length of the time-series. There are a couple of ways to do this: 1. An analytic method involving a combinatorial problem that yields exact probabilities for all possible sums. However, it is deeply recursive and so not viable for a practical implementation (Preston, Protopapas, & Brodley, 2009). 2. Perform random sampling on all possible sums using the Monte Carlo method. This approach is much less computing memory intensive, and yields an approximate probability distribution curve. Keeping in mind that the goal of this thesis is a scalable and practical solution usable on a contemporary web application's infrastructure, I chose the second approach. Applying the Monte Carlo method to our time-series context, w unique random numbers from 1 to N are repeatedly selected. These numbers are summed, and the number of times each sum 0 is obtained during our trials is counted, and is denoted by no. Clearly, the more trials we are able to perform, the closer the distribution curve (of the frequency of the sums) will be to the analytical approach equivalent. - The number of minimum trials needed can be tuned based on the p-value threshold a - where events with p-value < a are considered statistically significant. The expected accuracy, or in other words, the error mark of the frequency of any sum 0 is given by: 57 U' 1 E~ no In order to ensure accuracy of E for a p-value threshold samples A for a sum 0 is given by: n9N Ca E and a a, the minimum number of 1 aE2 are statistically motivated and are to be pre-defined. This involves a trial-and- error approach checking results for different combinations, and I found the minimum acceptable number to be 1,000,000 samples. However, when the resulting distributions from trials lesser than 1,000,000 were plotted (Figure 26), it was clear that the distributions were very close to a normal distribution and that the mean and standard deviation of each distribution (for different trial counts) were in the same vicinity. U1449 fi[M 4.N15 A, SIt am, awe awl A:. ~t 9 ur TRIALS = 100 mean = 1287.99 stddev - 224.1155 stddev = 280.213 mean = 1272.69 stddev L 278.283 (a) (b) (c) 26: rrne JDsh 17JL; O s S'i-o TRIALS = 1000 mean = 1256.93 ior si~r <~t- Istana rd TRIALS = 100000 EOEE ,1!'y Hererset29mam arC bu do 6edVior. (A) 10 1eci pa, B'00sr~s Ci100 58 This means that we could make do with a probability distribution with a far lesser trial count by using the mean and standard deviation for a combination of (w,N) and plugging that into the equation for generating a normal distribution. If X is a data point in a series, It the mean and 6Y the standard deviation, a normal distribution for that combination is defined as: f (x, /y,Q) 1 = 2 (x- e- This simple optimization technique effectively reduces the event detection pipeline's execution time by almost 10 seconds on a regular laptop. Such a performance increase is most welcome especially in the case of an interactive web application like Immersion. In order to make the technique more versatile to accommodate arbitrary lengths of w (sliding-window length), it is possible to pre-compute many different combinations of (w,N) and store it in a database for faster access in the future. This however has not been implemented for the current prototype since we are operating under the assumption that a window size of 14 days will be sufficient to detect significant events of interest. The 'learned probability distribution' approach is left for future work when event detection technique in this thesis will be added to the next version of Immersion's public release. STEP 6: IDENTIFY SIGNIFICANT P-VALUE INTERVALS This is a straightforward step where intervals of statistical significance -- those with an associated p-value < a -- are filtered and obtained as the result. Framing it in the context of implementation of this technique, this step yields an array of objects that describe the starting date and p-value for different intervals. This array of objects is returned to the client, which then processes (using JavaScript) these intervals to sort them based on their p-value to obtain a ranked list of significant events. Another set of 59 values returned from the server to the client includes the mean t and standard deviation O for the (w,N) combination associated with the input time-series of length N. Using the mean and standard deviation, it is possible to determine whether the significance of an event interval is due to a higher level of email correspondence during that period or whether it is due to a dip in activity. In the case of the former kind of events, it is necessary to correctly estimate the contributing factors (people and topics), whereas the latter case can only be reported as is to the user due to the unavailable of any more related data points for further annotation. STEP 7: IDENTIFY PEOPLE & TOPICS ASSOCIATED WITH EACH EVENT INTERVAL The in-memory database object on the browser contains helper functions to return the top contacts and top topics for any given time-interval. These functions are called for every significant time interval (now considered as significant events) to calculate the related people and topics. For each contact returned in the previous step, their individual email communication volume time-series is sent to the server for detection of micro-events of significance using the same event detection pipeline. As expected, the server returns a set of significant events related to each contact and also the associated mean and standard deviation for the (w,N) combination of each contact's time-series. Armed with both the macro-level events detected from the overall communication volume time-series, and also micro-level events for each top contact, it is now possible to estimate which are the contacts who really made that event interval significant. 60 This is done by computing the z-score associated with each contact for each significant event interval, and if the z-score for any contact is above the threshold (estimated through trial-and-error) it is assumed that that individual played a significant contributing role to the importance of that event interval. The z-score is given by the following formula: x- pi Z = Another helper function associated with the in-memory database in the browser returns the list of clusters of people the user has corresponded with given any time-interval. For any significant interval, this adds a higher level of information that can be provided to the user, ascribing cause of significance of the event to specific groups of people rather than just individual contacts of the user. This also helps with finer filtering of relevant topics for a given combination of time interval and set of contacts. STEP 8: CONSOLIDATE OVERLAPPING EVENT INTERVALS It is possible that the output of the previous stage can include event intervals that are very close to each other. In order to avoid duplicate representations of the same event, characteristics of adjacent events are compared with each other and a distance metric is computed. If there is much in common (for example, if the people and topics characterizing intervals under consideration are the same), then the two event intervals are assumed to be the same and consolidated into a single event, but with an extended time-frame encompassing both event intervals. This is a brute-force comparison performed between each adjacent event interval and it has been observed to be not very computationally intensive for significant events count less than 100. STEP 9: INCLUDE OBVIOUS EVENTS FROM INDIVIDUAL INTERACTIONS Relying on only the macro-events detected from the overall communication volume time-series would mean that we risk losing out on relevant micro-events such as 61 introduction to a contact (date of first email) who then went on to become a long time email collaborator. Similarly, a significant dip in interaction volume with any contact can also signal an event of importance in a person's life. By processing the individual communication volume time-series associated with every relevant contact, we are able to detect micro-events that can add more granularity to the set of significant events detected from a person's email dataset. 62 Results Developer View In the process of developing the Event Detection pipeline, it was necessary to build a user interface that supported authenticated user login, adjustment of the p-value threshold a and a visualization of the resulting dataset of detected events. Figure 27 shows a developer view UI, which proved to be quite useful in debugging and streamlining the results of the Event Detection pipeline based on email data provided by volunteers. pyval less than 0.0000001000 yields 430 events Relevant event count after consolidation: 44 0.00001 0.00002 0.00003 0.00004 0.00005 0.00008 0.00007 0.00009 0.0000e 0.001 40.000- 1~ 30.000- t IM NI - 20,000 10.00 - 0 2006 2008 2007 L~eveo.7 per iw 5how', ' ,he resut 2008 2009 otS0; event dete 2010 2021 2012 2013 2014 a3 'meine alema, wtn'f- 5 The slider at the top allows the developer to change the value of a. The pipeline would return more events for a higher value of a, and significant events detected are overlaid at the top of the horizontal timeline as colored ellipses. A red ellipse denotes an event due to low email activity, and a green ellipse denotes an event detected due to high email activity. Hovering over an ellipse reveals metadata about each event such as the 63 El Ku p-value associated with it, its rank relative to all other detected events, people and topics associated with the event and also its distance from the mean of distribution obtained from Monte Carlo sampling. The line graph denotes the rolling sum obtained from the Event Detection pipeline for the time-series corresponding to the overall email volume. Packaging the Event Detection Technique STORYLINE FEATURE The event detection pipeline results in a filtered ordered list of significant event intervals and their associated people and topics. In the context of Immersion, this result is packaged as a new feature called Storyline which functions as an interactive timeline of a person's email life. 2013 increase During 2013. youcoresponded with134 newcontacts. a 46% from previous year. Qgriwl SrWJy was the leading recipient of your emails andvou received the most emaits from Hyato Ioj 1. There were 8 significant events during this vear.3 of which were becuse ofhighemail activityv 18 _ 09 .. :, 15 AUG rr' eire 'r9 s;ory cars for ec ever7 e28: [ irye feature _ 43 xra:, 27 11 ~ _ 27 AUG , 64 11111 II As the screenshot of the application shows (Figure 28), events are pinned to a vertical scrollable timeline, and a verbal story is constructed around each event's parameters using predefined sentence constructs. Each story-card can be flipped to reveal more information about that event, including a direct link to the emails that meet the filter criteria (people and topics) for that story's time interval. 2711 There wa.s 43% group Yoar main r' increase in your em ifabor, tor" iy iil .t it'ti vy timne were Daniel Smilkov and Cesar Hidalgo ,'ourconver,,t orss inluded and the top o EMAILSP>-J I.JP 11 JUN JUL * Any updates with Immersion? about Immersion " g-Q0 - 27 P seev versi 2.0- Story-cards associated with events detected due to high email activity appear on the right side of the timeline, whereas story-cards for events due to low email activity appear on the left. This helps facilitate getting information at a glance, if a person wants to find out whether a year was particularly high or low in terms of their email activity. The story-card shown above (Figure 29) is from the Storyline based on my email data, and it has accurately detected an event associated with high-activity - the launch of Immersion on June 30, 2014 - when I exchanged a lot of emails with my colleague Daniel Smilkov and my advisor C6sar Hidalgo. The story-card is flipped to show the hyperlinked subjects of emails we exchanged during that time period. 65 In the case of significant events detected due to low email communication volume, it is not very meaningful to provide contextual information such as associated people and topics. So a simple story-card notifying the user of this low activity period, along with an option to select a reason for the inactivity is provided. Such evaluative inputs from the user can be used for categorizing other low-activity events for the same user or across many different users using a machine learning method once enough data has been gathered to form a training dataset. The implementation of such a learning feature is outside the scope of this thesis and is planned to be a future feature. There is also a macro-level timeline on the left side (Figure 29) that automatically highlights the centered story on the page, which the user can use to jump between stories that are temporally far from each other. This view can be filtered to show only events of one year, or compress all years' events together into the same timeline. r~; C 31: Macro-evc ti'menc ACCESS THROUGH A REST API Apart from its integration into Immersion, the event detection pipeline is also available as a general-purpose API where people can make a RESTful POST request to a server, as long as the input data is in the prescribed format -- a JSON representation of an array of timestamps. The pipeline also supports ascribing 'weights' to each timestamp and that will be taken into account when binning events using the 1-day level of resolution. 66 I' Future Work Revamping the Immersion platform The upcoming version of Immersion introduces a host of new UI features, including the Storyline feature, that have been informed through feedback obtained from people who used the first version. Apart from new visualizations, we are also working on a complete overhaul of its technical stack based on the lessons learnt from the previous release. One of the major additions from a data perspective is that we now also fetch data from the Subject field of emails. This opens up a host of new possibilities in terms of data mining and visualization associated with language analysis or a more simple frequency analysis of topics. TOPICS COLLABORATORS pose Dan e 5m kov demo exhibit NIINIIN IIN summer interview AN" - update Boogie " faI IIUi II -ON article data emai deepak version I jagdish city design macro CryE NCW NCLUDEs C TCRCs 3A0 I AIOS project 1 photo media lob .a",esaI mt - i20 analyt Cs typo regstration '.1 *Ol04 u5-4 * x"'' post JC I 4 Ve 2 :4 \e vcQ i o~ ~~ a ~ -~& . c~ 'a i o c ~ '- 67 (, * hS1i'meso.medla.mi.edY/demoa C:NVERSAT -, Aaron Ramsey N TOP!C 0 apartment rent utilities " Michael mirai barcelona lab house transfer 40 20 Jan O7 .1 IIIIII Rob Marcus IIIIII Arya Stark IIIIII $T FMALA 2 years ago S se 1 " Jack Nicholson IIIII " Michael Potsdam 111111 9Jan11 My12 ,T-- El 8.4 months ago SENT :ECEICEL 301 475 table Anna Frishk 111111 " Rob Marcus 111111 " Arya Stark 11111 " Violet Sigurdsson 11111 Jack Nicholson IIII " Michael Potsdam " Anna Frishk eggs F/q ue 33: New 145 design of the Person v!ew 266 SPOWIno IIIII " Violet Sigurdsson IjI y 1lIlll Potsdam " Anna Frishk STATiSTICS 50- IIIIII " Jack Nicholson movie beer . 4- . I11 III RobMarcus I AryaStark I relatec At the heart of the new version is a filter-based UI and data-querying mechanism that enables people to quickly retrieve and visualize results matching a subset of input fields such as time-range, people and topics. This feature has been engineered to achieve an extremely reactive UI where a user can simply hover over a topic or a collaborator and see all associated visualizations update immediately to reflect the data set filtered based on the hovered item. Figures 32 and 33 show work-in-progress screenshots of Network view and the Person view, both powered by the new filter-based querying engine. Keeping in mind that we might want to open source parts of the project in the future, code readability and modularity have been top priorities for the new version of Immersion. For this reason, we shifted the client-side codebase from JavaScript to CoffeeScript, since the latter is well known to be developer-friendly. The entire 68 technical stack of Immersion is now easily deployable by any person who has a basic understanding of launching processes in a UNIX environment. Optimizing the Event Detection pipeline There are multiple ways to improve the efficiency and accuracy of the Event Detection pipeline used in Immersion to detect significant events from email data. In its current avatar, the pipeline supports time-series of any length, but re-computes the mean and standard deviation for every time-series. Assuming that we're working with a fixed window size W, This process could be greatly optimized by pre-computing the mean and standard deviation for different combinations of (w, N), where N is the length of the time-series, and storing these values in a database for faster access in the future. One of the potential challenges with this approach is that it is difficult to predict the length of an input time-series. However, this can be overcome by chopping up a long time-series of length Y into smaller chunks, each with a maximum of length N. Each of these chunks can then be passed to the Event Detection pipeline to be processed individually, and their results (significant events detected) can be combined at the end. It is important to note that the window size being considered is critical here since we want each of these individual chunks to overlap by the same amount as the window size so as to not miss out on detecting significant events around the points where the larger parent time-series was chopped. In addition to this, the accuracy of calculating people and topics associated with an event can be vastly improved if a TNA-based approach is also incorporated, as suggested by Wan et al (Wan, Milios, Janson, & Kalyaniwalla, 2009). This would allow 69 for detection of changes in the network composition and structure even if the volume of emails exchanged remains the same across consecutive time windows. Automatic Classification of Detected Events In the current implementation, the Storyline feature detects significant events and is able to identify associated people and topics. However, it is not able to predict the nature of the event. For example, some events could be ascribed to being on vacation, release of a new project, planning an event, etc. The current interface allows people to input the category for each event. Once we have a large enough corpus of this categories data associated with events, it can then be used to train a machine learning algorithm to predict the nature of events without having to ask the user to fill that in. Of course, it is certainly a good idea to give the user control to point out incorrectly categorized events since a 100% success rate for predicting categories is difficult to achieve using current computing infrastructure. But it would still add an element of predictive surprise to the platform that people could potentially enjoy. Connecting Storylines Across Users The Storyline feature can detect people associated with each event for any given user. A future version could support conversations centered on each event between people who are associated with that event. Consequently, instead of only a single person reflecting upon their email history, this would give people the ability to collectively reflect about past events. A conversation could be between friends revisiting their memory about a favorite camping trip, or between colleagues at work about project from their past. 70 This is admittedly one of my favorite future features for Immersion since it has the potential to transform itself from being a platform that allows people to view their implicit social network created through emails, into a live social network by itself that is pre-populated with people's email data. 71 CONCLUSION 72 We are at a point in time where the devices and services we use on a daily basis leave rich digital footprints from which much can be learnt about ourselves and the communities that we interact with. Through the project Immersion, this thesis presents a way for people to learn about themselves and the communities they are part of based on their email history. Designed as a tool for self-reflection, Immersion has served over a million users by now and the author hopes that it continues to provide more people with new perspectives of their email life. Moreover, the technical foundation proposed in this thesis for event detection from email data will hopefully inspire and spawn rich storytelling and biography authoring platforms that bring people's stories to life. Democratizing the process of archiving and sharing the stories of our lives will help us look back on our pasts and also hopefully allow those who come after us to have a better understanding of what we did during our time here. 73 BIBLIOGRAPHY 74 Cohen, W. (2009). Enron Email Dataset. Retrieved July 2014, from https://www.cs.cmu.edu/-./enron/ Esling, P., & Agon, C. (2012). Time-Series Data Mining. ACM Computing Surveys, 45. Fischer, D., & Dourish, P. (2004). Social and Temporal Structures in Everyday Collaboration. CHI. ACM. Gershenfeld, N., & Weigend, A. (1993). The Future of Time Series. Addison-Wesley. Goldstein, J. (2013, July 1). An MIT Project That Lets You Spy On Yourself. Retrieved from NPR: Planet Money : http://www.npr.org/blogs/money/2013/07/01/197632066/an- mit-project-thatlets-you-spy-on-yourself Groden, C. (2013, July 5). This MIT Website Tracks Your Digital Footprint Through Gmail. Retrieved July 2014, from Time magazine: http://newsfeed.time.com/2013/07/05/this-mitwebsite-tracks-your-digital-footprint-through-gmail/ Holme, P., & Saramaki, J. (2013). Temporal Networks. Springer. Keogh, E., Lin, J., & Fu, A. (2005). HOT SAX: efficiently finding the most unusual time series subsequence. International Conference on Data Mining. IEEE. Macro Connections. (2014). Pantheon - Methods. (MIT Media Lab) Retrieved July 2014, from Pantheon - Mapping Historical Cultural Production: http://pantheon.media.mit.edu/methods Meko, D. (2013). Detrending. Retrieved July 2014, from http://www.Itrr.arizona.edu/-dmeko/notes_7.pdf Neill, D., Moore, A., Pereira, F., & Mitchell, T. (2005). Detecting significant multidimensional . spatial clusters. Advances in Neural Information Processing Systems Preston, D., Protopapas, P., & Brodley, C. (2009). Event Discovery in Time Series. SIAM International Conference on Data Mining. Ratanamahatana, C. (2010). Mining Time Series Data. In Data Mining and Knowledge Discovery Handbook (2nd edition ed.). Springer Science+Business Media. Smilkov, D. (2014). Understanding email communication patterns. Master's Thesis, Massachusetts Institute of Technology. Viegas, F. (2005). Mountain. Retrieved June 2014, from Mountain: http://alumni.media.mit.edu/-fviegas/projects/mountain/index.htm Viegas, F., Boyd, D., Nguyen, D., Potter, J., & Donath, J. (2004). Digital Artifacts for Remembering and Storytelling: PostHistory and Social Network Fragments. 37th Hawaii International Conference on System Sciences. IEEE. 75 Vi6gas, F., Golder, S., & Donath, J. (2006). Visualizing Email Content: Portraying Relationships from Conversational Histories. CHI. Vincent, J. (2013, July 8). MIT's 'Immersion' project reveals the power of metadata. Retrieved July 2014, from The Independent: http://www.independent.co.uk/life-style/gadgets-andtech/mits-immersion-project-reveals-the-power-of-metadata-8695195.html Wan, X., Milios, E., Janson, J., & Kalyaniwalla, N. (2009). Link-based Event Detection in Email Communication Networks. SAC. Honolulu: ACM. WikiLeaks. (2013, July 5). WikiLeaks - Twitter feed. Retrieved July 2014, from https://twitter.com/wikileaks/status/353287879604174848 76