IMMERSION: A Platform for Visualization and ... Analysis of Email Data LIBRARIES

IMMERSION: A Platform for Visualization and Temporal
Analysis of Email Data
MASSACHUSETTS INTI
OF TECHNOLOGY
Deepak Jagdish
M.S. Human-Computer Interaction, 2010
Georgia Institute of Technology
LE RE214
LIBRARIES
B.Tech Information & Communication Technology, 2007
Dhirubhai Ambani Institute for Information & Communication Technology
Submitted to the
Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulfillment of the requirements for the degree of
Master of Science in Media Arts and Sciences
at the Massachusetts Institute of Technology
September 2014
.
Massachusetts Institute of Technology, 2014. All rights reserve
Signature redacted
Author
epak Jagdi h
Program in Media
Certified By
Signature redacted
&
'Dr.tsar A.
Hidalgo
Assistant Professor of Media Arts and Sciences
Program in Media Arts and Sciences
Signature redacted
Accepted By
Dr. Pa Maes
Interim Academic Head
Program in Media Arts and Sciences
E
IMMERSION: A Platform for Visualization and Temporal
Analysis of Email Data
Deepak Jagdish
Submitted to the
Program in Media Arts and Sciences,
School of Architecture and Planning,
on 8 August, 2014
in partial fulfillment of the requirements for the degree of
Master of Science in Media Arts and Sciences
at the Massachusetts Institute of Technology
Abstract
Visual
narratives of our lives enable us to reflect upon our past relationships,
collaborations and significant life events. Additionally, they can also serve as digital
archives, thus making it possible for others to access, learn from and reflect upon our
life's trajectory long after we are gone. In this thesis, I propose and develop a webbased platform called Immersion, which reveals the network of relationships woven by a
person over time and also the significant events in their life. Using only metadata from a
person's email history, Immersion creates a visual account of their life that they can
interactively explore for self-reflection or share it with others as a digital archive.
In the first part of this thesis, I discuss the design, technical and privacy aspects of
Immersion, lessons learnt from its large-scale deployment and the reactions it elicited
from people. In the second part of this thesis, I focus on the technical anatomy of a new
feature of Immersion called Storyline - an interactive timeline of significant life events
detected from a person's email metadata. This feature is inspired by feedback obtained
from people after the initial launch of the platform.
Thesis Supervisor
Dr. Cesar A. Hidalgo
Assistant Professor in Media Arts and Sciences
Program in Media Arts and Sciences
2
IMMERSION: A Platform for Visualization and Temporal
Analysis of Email Data
Deepak Jagdish
Signature redacted
Dr. Ces". Hidalgo
ABC Career Development Professor of Media Arts and Sciences
Massachusetts Institute of Technology
-
Thesis Supervisor
IMMERSION: A Platform for Visualization and Temporal
Analysis of Email Data
Deepak Jagdish
Signature redacted
Thesis Reader
Reader
Thesis
Dr. Sepandar Kamvar
LG Career Development Professor of Media Arts and Sciences
Massachusetts Institute of Technology
IMMERSION: A Platform for Visualization and Temporal
Analysis of Email Data
Deepak Jagdish
Signature redacted
Thesis Reader
T ss "d rDr. Fernanda B. Viegas
Director, 'Big Picture' Visualization Group
Google
Acknowledgments
This thesis is dedicated to my family; especially to those loved ones who have already
departed whose life stories I wish could have been recorded in more detail.
The ideas and work presented in this thesis are the result of innumerable conversations,
email exchanges and data experiments with my advisor Cesar Hidalgo, who has been a
treasure-trove of scientific advice and a paragon of work ethic that I have strived to
learn from for the past two years. I am deeply grateful to him for all his help.
Immersion would not have been possible without the wonderful collaboration I shared
with my colleague Daniel Smilkov. I have had the chance to learn much about his
domain of expertise and hope to work with him again in the future.
Special thanks to my thesis readers Sep Kamvar and Fernanda Viegas, who apart from
providing advisory feedback, have also inspired ideas in this thesis through their past
projects. I would also like to extend my gratitude to Ethan Zuckerman for his inputs
right from the early days of Immersion.
My friends at the lab - including colleagues from the Macro Connections group, the
Funfetti band and extended connections in other research groups - have been
monumental in their support during my thesis writing stage. I'm thankful to them for
keeping me grounded during the tough times and for the good memories. I'm
especially indebted to my friend and fellow thesis writer Dhairya Dand for keeping me
company during the final stages of this thesis.
My thanks also to the Media Lab community for making everyday at the lab such a
pleasure, for being willing volunteers to test out Immersion, and for spreading the word
about the project.
-
Last but not the least, my heartfelt thanks to every person who has used Immersion
we asked you to take a leap of faith with your personal data, and you did. Your
feedback has been invaluable, both from a personal and professional perspective. I
hope you have enjoyed your Immersion experience.
6
Table of Contents
INTRODUCTION
8
Understanding the problem space
9
Motivation
11
Introducing a new perspective
Crystallizing our life's stories
Proposed solution
13
Thesis structure
14
PART 1: IMMERSION
16
Prior work
17
Designing Immersion
19
Engineering Immersion
27
Launch versions
33
Impact
37
PART 2: EVENT DETECTION
41
Significance of the temporal dimension
42
Prior work
44
A quick overview of the technique
51
A deep-dive into the Event Detection pipeline
52
Results
63
Future Work
67
CONCLUSION
72
BIBLIOGRAPHY
74
7
INTRODUCTION
8
Understanding the problem space
Networks of people such as families, friends, teams and firms are social structures that
mediate the flow of information among people, while providing social support and
depositing trust. Such networks leave rich digital trails of data as a by-product of our
usage of new media technologies such as email and social media. These digital trails in
turn have immense potential in terms of information we can unearth about ourselves
and the networks that we are part of. However, revealing useful information embedded
in the underlying datasets requires specialized tools that are designed to adapt to the
ontology of each dataset.
Datasets spawned by our use of technological tools include (but is not limited to): our
email history, banking transactions, texting and phone call logs, GPS destination
requests, location check-ins, social media postings, web browsing history, and so on.
Each of these datasets has an ontology that is different from the other due to
fundamental differences in the nature of data being recorded.
For example, an email dataset contains data fields such as the people a person has
exchanged emails with, content of their conversations, and the timestamp of each
email. On the other hand, a dataset of GPS request logs contains data fields such as the
latitude and longitude of destinations searched for, user ID associated with each
request, device from which each request originated, and the timestamps of each
request. With the exception of timestamps, no other data fields are common between
these two datasets. Hence, a one size fits all approach would not be ideal if our goal is
to build systems that can maximize the amount of information that can be extracted
from each dataset.
9
This thesis focuses on mining a specific kind of dataset - email metadata - involving
networks of people. This is done in the context of designing and developing a web-
based platform called Immersion, which creates a visual narrative of a person's life as
recorded in their email metadata.
It is worthwhile to note that metadata here includes only some data fields from the
whole email data corpus, namely the To, From, Cc and Timestamp fields. The decision
to use only metadata is made for two reasons. One reason is to respect the privacy of
the user by accessing only the data fields that are needed to support the narrative that
Immersion aims to create. Secondly, accessing and processing the all of the data fields
associated with an email, such as the body content and attachments if any, takes much
longer and makes the underlying system architecture more complex than necessary.
10
Motivation
This thesis is borne out of a multiplicity of motivations, each fueling a different aspect of
novelty that Immersion introduces to the domain of experiencing and exploring
personal metadata.
Introducing a New Perspective
Mainstream email platforms more often than not rely on representing email data using a
time-ordered list of messages one after the other regardless of whom the emails are
exchanged with. A quick visual survey of the popular email platforms over the past few
decades reveals that the basic visual structure has remained the same.
As a specialized tool designed to make sense of email data, Immersion is able to ask
specific questions that can reveal much more than what we are currently able to learn
from our email history. By using the people we exchange emails with as the primary
anchor for
questions, more
abstract parameters
such
as
conversation
groups,
conversation topics, introduction trees, evolution of relationships and so on can be
calculated. The results of such calculations are then carefully crafted into interactive
visualizations, thus enabling people to see and explore more of their own email dataset
than was previously possible.
Crystallizing Our Life's Stories
Biographies are a common way to spread information about the life of a person. Some
famous individuals are capable enough of writing their own biographies, whereas other
famous individuals are inspiring or controversial enough to have others write about
11
them. And then there are other influential people who employ ghostwriters to have
their stories told.
Regardless of which of these routes leads to the making of a biography, it is perplexing
to note that only an infinitesimally small fraction of human population across time has
had the luxury or capability to have their stories narrated and archived. The advent of
democratic forms of new media such as wikis have improved this fraction, but only
marginally. According to the most recent count from the dataset released by the
Pantheon project, there are about 997,276 biographies of people that exist in
Wikipedia (Macro Connections, 2014). Even though this is a significant improvement
from before the emergence of Wikipedia, comparing it to the number of people who
have lived across many centuries reveals that we've barely scratched the surface.
One of the primary reasons not every person publishes an account of their life, apart
from the lack of public visibility due to not being famous, is because there is a massive
gap between the desire to publish and capability to actually do it. This gap arises
specifically because we lack the tools to easily crystallize our life's events into a more
concrete shareable format like a book or a website.
12
Proposed Solution
Immersion is a publically available platform that people can use with their own email
data to experience new perspectives of their email life through interactive visualizations.
Even though there are previous projects have solved the visualization aspect of this,
Immersion's goal is also to scale this approach and be able to support hundreds of
thousands of users, which presents complex technical and design challenges.
Moreover, Immersion aims to bridge the gap between having the desire to publish
one's life story and the capability to actually do it, by providing a platform that
automatically detects significant events in the timeline of a person's life using their
email metadata, and annotates each detected event with the people and topics related
to it. It is important to note that Immersion does not create an entire biography of a
person. That task requires more work than just creating a skeletal view about a person's
life. However, what Immersion does provide is make the first step towards a personal
life account easier, by automating a part of the process. It achieves this by automatically
creating a skeletal framework of connected significant events in a person's life, which
they can then later expand into a format they prefer, such as a book or a movie. The
transformation from an outline of events to a richer and longer format is outside the
scope of this thesis, and would require tools specifically designed to solve that
problem. Immersion aspires to motivate future projects that can build on the skeletal
event framework that it provides.
13
Thesis Structure
Beyond the introduction, this thesis is broadly divided into two parts. The first part
includes a background study and detailed explanation of all the features included in the
proposed solution, Immersion. The second part focuses on the implementation details
of a specific new feature - detection of significant life events - that is being introduced
in the next version of Immersion.
In the first part, after reviewing related work that exists in the domain of email data
visualization, I explain the design and implementation of each high-level feature that
was part of the first public release of Immersion. This is followed by an overview of the
impact of that release, challenges faced in the large-scale deployment of the platform
and system-level changes being incorporated in the next version of Immersion. I
conclude this part with a compilation of the observations made based on my interaction
with users of Immersion. One of these observations in particular - users' propensity to
point out how significant moments in their email life is reflected through their
Immersion profile - forms the motivation for the second part of this thesis.
The second part of the thesis looks at email metadata from a temporal perspective in
the context of automatically detecting significant life events and presenting the same
through an interactive timeline. I introduce this part with a brief overview of related
work in the domains of Time-Series Analysis (TSA) and Temporal Network Analysis
(TNA). Among the methods reviewed, I elucidate my reasons for choosing a TSAinspired scan-statistic method. This chosen statistical method lies at the heart of an
Event Detection Pipeline that is specifically designed to process time-series obtained
from timestamps in email metadata. I then explain how this pipeline forms the
14
backbone of the new storyline feature of Immersion that enables people to explore
their email history through an interactive storyline. A quick overview of the userinterface of the storyline feature is included, followed by a brief discussion about the
computational infrastructure used in the development of the storyline feature.
I then conclude this thesis with some comments about future directions for this work.
These include plans for the upcoming release of Immersion and identifying avenues for
improving efficiency and accuracy of the event detection feature, followed by some
closing remarks.
15
Part 1
IMMERSION*
This part describes prior work in the domain of email data visualization,
and the
design and implementation
of Immersion
-
a people-centric
visualization platform for email data
* Work done in collaboration with my colleague Daniel Smilkov
16
Prior Work
Email data has been used to answer research questions in various domains such as (but
not limited to) computer-supported cooperative work, network science, information
visualization, information retrieval, etc. In the context of the first part of the thesis, we
are interested in new ways of representing personal email data. In that vein, various
projects by Vi6gas, such as TheMail, Social Network Fragments and The Email
Mountain must be mentioned since they provide interesting visual perspectives using
the same kind of data that Immersion also uses.
The Mountain project (Viegas, 2005), aptly named to reflect the deluge of email we
experience these days, visualizes a person's email history expressed in terms of all the
people that this person has communicated with over time, with each contact shown as a
separate layer. The thickness of the layer is directly proportional to the 'thickness' or
strength of the relationship between the layer's corresponding contact and the user.
Jar
1: ,reMu
i
r~c
17
TheMail (Vibgas, Golder, & Donath, Visualizing Email Content: Portraying Relationships
from Conversational Histories, 2006), on the other hand, is a visualization that focuses
more on how the content of a user's emails has evolved over time. This provides a
convenient way to quickly perceive the important conversational topics that a user was
involved in with other contacts.
The Social
Boyd,
Network
Nguyen,
Fragments
Potter, &
(Vibgas,
Donath,
2004)
project focuses more on the network aspect
of email data, by deriving a graph of social
relationships that a user is involved in. It also
addresses
questions
about
structures evolve over time.
how
network
This project is
much closer to the visualization technique
e
that Immersion uses, as the next section will
reveal.
18
Designing Immersion
This section describes the design decisions that informed the interaction and visual
design aspects of the various features of Immersion. I shall first provide an overview of
the interface, followed by individual descriptions of each major feature.
Overview of the User Interface
One of the motivations that Immersion is founded on is to provide a new perspective
for people to see their own email data. In order to do this, we have to eschew
traditional forms of representation of email data, and replace that with a new approach
that highlights different aspects of the same data. In a way, I liken the traditional
approach to the users always seeing their data framed within the constraints of a
square, and Immersion being a tool that reveals to the user that what they're looking at
is actually a cube containing many layers of data (Figure 4). Immersion's goal is to use
the different dimensions of email data, such as the people, timestamps, conversation
topics, etc. to create a new perspective.
*
* qure
4: Loo!(n~a at
at
O+a
sw
tra
*o
r y da
ncat
axast
o~srmaaO:a
'o
:or a rne . Derr-ne. v, brea.<a!e
ex:St
t!-ai
sqwjare to S''ovw the deeper sayers o' d'ata
e
19
ml
It is also important that the new perspective that we architect is also easily
understandable by the user. For that reason, there is a need to keep the user interface
elements simple, meaningful and as information-laden as possible with minimal noise.
vo nte.Kt'on wath
OLionel Messi
N
4 yarm ago
Sent
Rcce3'
Now Contact-
ago
2 day:
A:
'
/
1st
629
/
1st
-
1
605
.fLdlfBUJhdflrS
-
F-. :, E. . ! ....i~l
-/«,
(1=
n',
!' : ;Fr"-tr,/
r,i, ',^
;.2
^~
01
$as 7005
70 t. 7013
Deepak Jagdish
Sevch co,4M013
648 collaborators
3 0
:.000
40
01
h .t
~
11
t R7OS
1r
as
70
a
.
t1
/
06
30
Ffi uri;e6"
-Lr'z
?er C 'veom-i
,,)Le
!mcuc
li.!i 7p .i
majr l'' PEemf
20
olOcJ.O1"*2Ot3
675 collaborators
49.020 emails
*cws0
C.
0
u
&'
*. 5 Octi
0
A~o
C. x,
0"
stay
0
dS4O
IAs
Lb"
0
.Pm$
S
'. to
1M'.'
"
B* Yl JE
1( 11
S
A ?'~y
0
*
Nflae
M.
A
Ttar
"
*
5ww4"
"
0.
r
9Z"
I' L
"
S.
A
. /:
Deepak Jagdish
Aa4c
xe, item! ^e deC f F .ti
a
nc!ud29
7. ;r
0
0
0
0
0
0
0
0
0
0
0
0
0
Sec th Cortats:
o bete:2iw0 .avout1 a*
!:
FC t i
eemetts~r
21
After a number of visual iterations as shown in the series of screenshots above (Figures
5-9), we settled on a UI layout that focuses on the Network view, Person view and
Statistics view. The time range selected using the slider at the bottom of the interface
filters the data that is fed to each of these views.
~
"
Tony Stark
Q
C
Cat
159 cohlaborators
i.,a.
20,649 emails
"
"'
o«+
rmam Sent
I.
C.:.
Ww
Mrn ey
"
't Ua..
400
Md
2Wel
Low
~
.
"
ush
R"*
DCCCI
" i.9 " w
AC.. K"CF
bW
dwaIUw
(1lciN
Wr.
C .
CJ.
0-
o
u"
G6.5
"Ba Uei
-:"r
D
i~
!laot
A
evo
wars
e
22
TN
Network View
As shown in screenshot of the finalized user interface (Figure 10), the Network view
takes the center stage in Immersion. In this view, each collaborator (a contact that
passes the relevancy test described later in this thesis) is represented as a circular node,
and the size of the node corresponds to the number of emails the user has exchanged
with that particular collaborator. The node is larger for more number of emails
exchanged, and smaller for lesser number of emails exchanged. This allows the user to
get an idea of who their top and most connected collaborators are at a quick glance.
Two or more nodes are connected by a link if the collaborators that the nodes
represent have participated in email conversations together. The link width is also
modulated based on the strength of the relationship between the collaborators, as seen
in the user's email dataset only.
Statistics View
On the right hand side of the main user interface is an
14
400
2W00
0
about the user's email usage, number of collaborators
-
info pane that includes some basic aggregate statistics
^! 1,
that Immersion
,
!1
4
has detected, ranking of the user's
collaborators and also a series of histograms. These
10
ooo 4
histograms (Figure 11) represent three significant series
-
oo
of data points. The first one shows how the number of
m
2.00
0
"
emails sent by the user has evolved over the years. The
final bar in the histogram also includes a predictive
o-
component, represented by the grey bar, which uses the
^.
tW9
1?,
14
average of sent emails from the previous year and the
current year. The second histogram shows a similar
23
series of data points, but for the number of emails received by the user over the years.
The final histogram represents the number of new collaborators the user has emailed
each year. This is calculated by detected the number of emails sent to any collaborator
for the first time. In the sample histogram shown here, it is clear that 2013 was a year
where the user interacted with the most number of new collaborators.
Rankings View
The
info pane also shows the user a
ranked
list of
collaborators based on their relationship strengths. This
Travis Villescas
view is automatically filtered based on the time range
selected. The user can also toggle the rankings between the
time-filtered version and the rankings across all time.
TerranceSchlitt
,
Sao Orwig
Cecily Dennies
Person View
Clicking on any node brings the user to the Person view (Figure 13). This view focuses
the user interface elements on the collaborator of interest, and shows the other
collaborators that this person is connected to via the selected individual. The network
view changes into a circular view with the collaborator of interest in the center, and the
second-degree connections to that individual being laid out on the circumference.
24
Travis Villescas
\
.1
Rrstemail: 6eyears ago
0
Lmntm
"
lug I&Uv~~.
p
1.2years ago
SefittprtvataJ 377 (245)
RecMVyed (priatek S42 (342)
Interattionvoum
Tdreawmsto
" ME'
" "5*
DSM
-
DdPb,j
JN 07d
May e
" S
-
Attueced ou
Sp09J
ay1
tte
Roco Dhoert
Da Marz te
t~0
"
0
er
Stutesmn~n
Jol~y Glassbum
-Alln
0
Introduced by:
6.5 years
.,_.. _.
None
Jan 2007 - 06 Jul 2013
.04
The info pane on the right is also updated to reveal information about the user's email
communication history with the selected collaborator. In addition to showing the user
information such as date of first and most recent email changed with that collaborator,
the info pane also shows a histogram of how the relationship between them has
evolved over time.
The
Person view also
reveals information
about the
people that the selected
collaborator has introduced the user to. Immersion calculates this by detecting the first
time any collaborator(s) appears in the Cc: field of emails. Similarly, it also shows the
person that introduced this collaborator to the user, by detecting the first time the
selected collaborator appeared in the Cc: field of an email sent by someone else.
25
Snapshot Feature
Most of the users of Immersion feel the need to share their Immersion network with
their family, friends and/or colleagues. To facilitate this, we implemented a Snapshot
feature that allows the user to save (only) the Network view as an image file. Immersion
does this by converting the SVG rendering of the network into a data URI that is then
used as input to a canvas element, which renders the image pixel by pixel. The link to
the generated
image is emailed to the user to download, and the image is
subsequently deleted from Immersion's server after 30 days.
Data Ownership
Immersion takes the privacy of users very seriously, and so we give the user full control
over their own metadata. At the end of their Immersion experience, every user is given
the choice to either delete their metadata from Immersion, or save it in Immersion's
server. The former option ensures that no personally identifiable information about the
user's email history is accessible to Immersion any longer after logout, whereas the
latter option promptly saves the data to the server, thereby enabling faster access to
their Immersion profile in the future.
26
Engineering Immersion
In order to bring the aforementioned features and user interfaces to life, there is a
considerable amount of engineering involved. The engineering stack and architecture
of Immersion has evolved since its birth through different iterations as we added new
features and supported more users. This section describes the client-server system that
Immersion is built on, its evolution over different iterations and also sheds some light on
its individual technical components.
System Architecture
The initial prototype of Immersion that was used for alpha testing within the Media Lab
community was feature-driven rather than scale-driven. Most of the data processing was
being done on the server and the client simply received a distilled dataset that it then
rendered as visualizations. We continued to use this model for the initial public launch
of the website as well. However, even though this model worked well for less than 50
users due to low concurrency, it became very clear a few days after the launch that
when more than 100 users were on the website at the same time, the server (due to its
infrastructural limitations) was starting to choke and fail. This eventually happened, and
it led to a complete overhaul of the architecture, which I describe below.
The modified architecture, which has a distributed computing flavor, gives more
responsibility to the client and only requires the server to do the data-fetching task. This
model scales much better for concurrent users since intensive computational tasks like
email metadata parsing and network abstraction now happens on the client-side.
In
27
El
other words, the client is thick and the server is thin in terms of the tasks they perform.
A visual representation of the architecture is shown in Figure 14.
Thick Client
-
Email parsing
-
Data cleaning
-Visualization
Thin Server
User info
- User
authentication
..
..-..o
- Email extraction
Rawemails
- Database
Raw emails
Email server
- Algorithms
JavaScript / D3.js
Java / MongoDB
igure 14: Syse- ArcPieciture
srowcng te
:!"mersicn
iala
'Pw
each 'ayer's data- reated respons<6U!C:y. Source: (/Crmi'!av, 2014)
and
The client is completely JavaScript driven for computational tasks and uses SVG, HTML
and CSS3 for the visual rendering of content. It received serial batches of raw email
data from the server that as been fetched from the email service provider after it has
mediated the user authentication process.
The client then parses this data and pushes data fields of interest into an in-memory
database. There are helper functions written in JavaScript that then creates different
slices and abstractions of the data, such as ranking of contacts, calculating relationship
network, community detection, animating visualizations, etc. Upon logout, if the person
wishes to save their data, it is sent back to the server to be stored in a database for
future access. The following section explains in detail the implementation of the data
flow pipeline involved in the communication between the client and server.
Data Flow
AUTHENTICATION
In order to fetch email data from a service provider, the user has to grant Immersion
access to his or her email account. Authentication systems vary depending on the email
28
service provider, and in Immersion we support three providers: Gmail, Yahoo! and
Microsoft Exchange. Whenever possible, Immersion keeps the process of receiving and
transmitting login credentials as secure as possible. In the context of the overall system,
Immersion
makes
use
of SSL encryption to
keep information
secure during
communication between client (user's browser) and the server. In the case of Gmail,
users are redirected to a page hosted by Google that returns an authentication token to
the Immersion server in case the user's login credentials are validated. The validation
process happens outside of Immersion's purview, thereby keeping the process secure.
FETCHING EMAIL HEADERS
Once the server receives the signal of authentication from the service provider, it
creates a fetching task that is handled by a multi-threaded Java application. Our initial
prototype used a Python fetching module, but due to scalability issues, we moved to a
Java-based solution. For Gmail and Yahoo accounts, Immersion's Java fetcher fetches
email data using the Internet Message Access Protocol (IMAP), and in the case of
Microsoft Exchange we make use of the official Java Exchange API. The current version
of the Immersion server spawns 30 threads in order to handle concurrent users.
Immersion fetches only the header information of every email. This means that it never
ever accesses or reads the body (and attachment) content of any email, thereby
providing a higher level of privacy for the user. It also means that the data fetching
process is much faster since headers are quite small in terms of size compared to the
actual email. The header only contains some data fields, of which the only ones
Immersion gathers are the From, To, Cc and Timestamp fields. In the upcoming
version of Immersion, it also gathers the Subject and ThreadlD fields, but those are
also part of email metadata contained in the header, and do not require access to the
body content of email.
29
For avoiding overloading the server with too much data, Immersion limits the number
of email headers fetched for user to 300,000 emails. This number is usually sufficient to
gather emails for approximately 3-4 years for any user, thereby giving Immersion
enough data to paint an informed picture through the visualizations. Since the IMAP
protocol allows for fetching specific emails, Immersion executes the fetching task as a
batch process of 10,000 emails each. This means that the user does not have to wait till
all of their emails are downloaded, which can sometimes take up to 10 minutes. Once
the initial batch of 10,000 emails is received, Immersion starts to render visualizations
using that data, this providing the user something useful to work with while the rest of
the data is also fetched and included in the visualizations.
PACKAGING DATA FOR THE BROWSER
For every batch of 10,000 emails fetched, the server compresses all of that metadata
into a single JSON (JavaScript Object Notation) file by wrapping it in a gzip format that
the client's browser is able to unpack at its end. This greatly reduces the size of each
data packet sent to the browser. On average, the size of each JSON compressed batch
of emails comes to between 2MB - 5MB.
EXTRACTION OF METADATA & MAKING SENSE OF IT
This is a critical step in the data processing context since it helps filter out emails that
have invalid values in data fields like the From, To, Cc and Timestamp. There are a
number of occasions where a timestamp's value is from the distant past or future due to
inconsistencies in the sender's email settings. Such emails where there are issues with
parsing of the metadata are removed from the data set before visualization. Also, by
checking the value of the Auto-Submitted field in the header, we are able to ignore
emails sent automatically by machines.
Immersion systematically combs through every email in the Sent folder to detect if the
user has multiple email addresses (aliases) associated with his or her name. We do this
30
to make sure that all of the emails are correctly mapped to the correct sender and
receiver, and that the user himself or herself is not detected as a separate contact.
Immersion also does something similar for other contacts appearing in the user's email
metadata. Multiple addresses for a contact are collapsed into a single identifiable entity
contact appearing in a user's email history with two separate email addresses
-
defined by the Firstname Lastname of that contact. For example, if John Doe is a
john.doe@gmail.com and hi@johndoe.com - they will be mapped to a single entity
called John Doe, as long as the name fields for both email addresses are the same. This
vastly improves the clarity of the final network of contacts obtained by removing
spurious nodes for the same contact. However, in doing this we also risk wrongly
combining two different people into a single entity if their first and last names are the
same.
IDENTIFYING RELEVANT CONTACTS
Since the visualizations in Immersion are people-centric, it is important that the dataset
does not contain contacts that are irrelevant from the perspective of the user. For
example, email addresses associated with mailing lists, social network notifications, etc.
are not considered relevant since they do not contribute much meaningful information.
We use a simple filtering technique to remove these contacts from the network. A
minimum threshold of 2 emails (sent and received) is set, and only contacts that pass
this requirement are shown in the network. This effectively rules out inclusion of
contacts from which the user has received only one or two emails but never sent an
email to.
CREATING AN IN-MEMORY DATABASE OBJECT IN THE BROWSER
Since all the data processing happens in the browser, the data needs to be organized in
a way that makes it easy to retrieve the necessary bits for calculating metrics such as
ranking of people, creating the underlying graph, identifying time-stamps, etc. In order
31
to facilitate this, we create an in-memory database object which stores all the metadata
sent by the server. This database object has helper functions associated with it that
enable the retrieval of specific data points that Immersion
needs to generate
visualizations.
GENERATING THE UNDERLYING GRAPH STRUCTURE
In order to transform the disconnected pieces of information from each contact into a
larger connected graph that powers the network visualization, Immersion queries the inmemory database to get the number of emails between the user and each contact. We
then calculate the communication strength between the user and each contact, and
also between contacts themselves from the user's perspective. The communication
strength of contact i, ai, is calculated as the generalized mean of the number of emails
that the user (or one of his or her contacts) has sent to i, Si, and the number of received
emails from i,
ri:
a
ai =2
Sip + rip)1/p
We empirically found p = -5 to be the value that highlights symmetric communication
(two-way) over asymmetric communication
(one-way). The resulting values of ai
reflecting the relationship strengths between the user and his or contacts, and also
between the contacts themselves in the user's ego-network is then converted into
nodes (each contact) and links (value of ai between contacts) to give rise to the network
visualization.
32
Launch Versions
Interactive Exhibit: Webs of People
Soon after the development of the initial prototype, Immersion was selected to be an
interactive exhibit at The Other Festival organized at the MIT Media Lab in April
2013. The exhibit was titled Webs of People, in tandem with the Webs of Matter
exhibit showcasing the Silkworm Pavilion project from the Mediated Matter group. One
of the main goals of our exhibit setup was to provide an experience where the creators
of the project did not have to be physically present for a visitor to try out Immersion.
PHYSICAL SPACE
In terms of the physical space assigned to the exhibit, it was located near the main
entrance of the lab. This meant that the margin for failure was very small, both in terms
of the software and hardware since it was a high visibility area for visitors. Figure 15
shows a rendering of our vision for the exhibit in that space.
Webs of People
02-.
t-;9ue
J:
A r'^er
' n
3:vW
Lionw
p Mex.
o! 'ire !n.L8 !a Vr lor be W'ebs o~ -eop!e exhfOi
33
El
We explored the possibility of using projectors to have a large viewable area for
different views of the visualizations. But even projectors with high lumen ratings did not
succeed in reproducing colors and contrast at a level we desired, made all the more
difficult by the lighting situation in the exhibit space. During daytime, there was enough
sunlight in that area to wash out any sort of projection on the wall, unless the projectors
were kept very close to the wall. This would mean that the viewable area would become
much smaller. The other option was to enclose the exhibit space using curtains, or
create a kiosk/booth, but we decided against this approach because it would affect the
open experience we wanted the visitor to have while experiencing Immersion.
HARDWARE & TECH SETUP
Given the complexity of using projectors, we instead decided to use three large
displays, set up in an angular fashion, each display showing a different view of the
*
exhibit. A photograph of the exhibit setup is shown below.
Llgure
76: Webs or ene
f>na exhibit setup
34
The central display (Figure 16) is the one that visitors can log in to, and see their
visualization. The display on the left shows the Immersion Guestbook, and the one on
the right shows the Immersion Rankings of the visitors who have saved their profile
upon log out. The Guestbook and Rankings were special features added for the exhibit
version.
IMMERSION GUESTBOOK
For every visitor saving his or her Immersion profile during log out, we add this person
as a node to the Immersion Guestbook network on the left. The idea behind the
Guestbook is to show the network that emerges from the connectedness of Media Lab
community, including its external connections such as, sponsors, spouses, roommates,
etc. Figure 17 shows a screenshot of the Guestbook feature.
immersion. Guestbook
The cwn'"^
;r
F -r
-
, .
7
P'"
.r.
4;',
0
y:0i
0
0
0w
,ai^6 B'm1a. :
t k-
Figure 17: Guestbook
.";{z
o""
feature showing a network' view of connected guests
-\1
who saved their
/mmersion profile at the Webs of People exhi6it
35
IMMERSION RANKINGS
For every visitor who saving his or her profile during
log out, we also calculate their collaboration rank - a
immersion. Ranking
Zuckerman 3iscomx
*@Ethan
W
heuristic that measures how many people the user has
akshrn Pratury
Leigh
exchanged at least 3 emails with. Based on this
Oristie
*65 w'awt
s
O tCrnstobal Garcia rizyue
measure, we show the top ten users in the Immersion
C esar A. Hiedalgo
0
dataset as a ranked list that is updated every time a
0
Matt stempeck
A6*wro
"4m
mu
Lisandro Brit 'sp cnnxa
new user's profile is saved to the server. This turned
0
0
out to be a great way to motivate visitors to check out
the exhibit and also save their profile for future use.
Figurev owv
18oT shows
the rankings as of May 2013:
:00
e8: Ranicinos
Kanannka Akanarinak
rs
Willow Brugh nuova
*
Sergio Marrero m mw
7
Weobs
of
P2
0 0
!
cx 1xh1bi
Public Website: Immersion v.0
On 30 June 2013, we launched the public version of Immersion as a standalone
website, accessible at immersion.media.mit.edu. The website version did not include
the Guestbook and Ranking features, but had extra layers of server-side functionality in
order to support many hundreds of users concurrently. The website has been tested to
work successfully on browsers running the WebKit rendering engine (The WebKit Open
Source Project, 2014), such as Google Chrome and Safari. Support for non-WebKit
based browsers is not provided since we experienced severe user experience issues
due to how those browsers were unable to handle rendering a large number of
animated SVG elements (like those appearing in the network visualization). The impact
of the launch of the public website, and observations made from interactions with users
of the exhibit version are discussed in the next section.
36
Impact
Reactions & Quantitative Observations
After the launch of the public website, Immersion was immediately picked up by
different news agencies for reporting in the context of the global discussion on
metadata and privacy. It was clear that the public, informed through the media, was
starting to view Immersion not just as a tool for self-reflection, but more so as a tool that
reveals the potential of unethical access and misuse of personal metadata by
governmental agencies. Using Google Analytics, we were able to keep a track of the
location, time and referral source (website) of each visitor to the website.
The first news report appeared in the Boston Globe, titled 'What Your Metadata Says
About You' (Riesman, 2013). Even though this report by itself only directed about
~1000 users to the website, it spawned a number of other news agencies to write about
relevance of the project. NPR published the next major news report, pitching Immersion
as a tool that 'lets you spy on yourself' (Goldstein, 2013), and that article brought
thousands of visitors to the website, peaking at almost 30,000 visits a day. The figure
below shows the number of visits per day for the month of July 2013, right after the
launch of the public website.
Ssessions
200.000
100.000
.Juge
AO13
Ag
20
-igure 19: immersion's web trafic during the month after the !aunch on
Jw 27
,Jure
30, 2013
37
Even though we had initially prepared the server for a high number of concurrent users,
the deluge of visitors to the website was far beyond our initial expectations. This
resulted in the server crashing on July
2 nd,
and consequently Immersion was offline for
the next three days while we worked on a more scalable architecture. The new
architecture pushed all the processing to the client, with the server only fetching the
metadata from the email provider. This in effect created a distributed computing
setup,
and relieved the Immersion server from having to do all the processing on its own. The
website was re-launched on July 4 th, and the timing was perfect because by then a lot
more new agencies and influences on social media started to link to the project. The
peak day was July
9 th
with over 200,000 users visiting the website from many different
sources such as Wikileaks (WikiLeaks, 2013), Time magazine (Groden, 2013), The
Independent (Vincent, 2013), etc.
As of June 2014, Immersion has received over 800,000 unique visitors spanning over
1.3 million visits, of which 43% were returning visitors. Another data point we were
tracking was the amount of time that visitors spent on the website. As shown in the
figure below, Google Analytics reveals that many users spent less than 10 seconds on
the website, probably due to their reluctance to share their email metadata.
However,
more than 210,000 users spent between a minute to ten minutes on the website (Figure
20), which suggests that the engagement with users who were willing to try out
Immersion was substantial.
0-10 seconds
723,306
11-30 seconds
57,394
31-60 seconds
80,533
61-180 seconds
103,146
181400 seconds
108,599
601-1800 seconds
72,141
U
36,265
I
1801+ seconds
e 20: C)vera!
,me
scent on
te
mes
west 'S5
U
y
users
across multiple
sessions
38
Qualitative Observations
Another source for observations was the exhibit version of Immersion that allowed us to
watch visitors use Immersion in person and gather feedback from them directly. It was
fascinating to observe each person's immediate emotional reaction when they saw their
Immersion profile for the first time because the visualizations clearly motivated them to
share more information about themselves.
The emotional reactions of people varied from person to person, and seemed to be
influenced by factors such as:
- their personality: whether a person was open
to discussing their Immersion
experience, or whether they preferred not elaborate on the relationships that was
revealed visually,
- personal/work emails: whether the emails analyzed were of the personal and/or the
professional kind,
- light or heavy usage: whether they had only a few thousand emails or many thousands
of emails in their inbox, etc. The visualizations in Immersion were more insightful and
richer in content if the user had more than at least a few thousand emails.
One common theme of reaction I noticed among most of the users was a propensity to
recall and share, through informal conversation, the different 'eras' of their email life.
Some of the sample comments made by people attempting to communicate their
findings included were as follows:
- "Hey, that's when I moved to Austin!"
- "I met my wife around that time..."
- "2011 was a low-point for me... I was between jobs."
- "My network becomes so dense after I started grad school."
39
This seemed to be a common way for them to frame and elucidate their emotional
response to any significant visual finding that they perceived in their Immersion
experience. For e.g., validating that a dip in the graph of a relationship with a particular
contact coincided with the timing of their real-life separation.
This particular observation motivates the goal for part 2 of this thesis: being able to
algorithmically detect significant events and time-periods in a person's email life, and
present the results in the form of an interactive storyline.
40
Part 2
EVENT DETECTION
This part focuses on temporal analysis of email data in order to detect significant
life events. The results are used to craft a new feature in Immersion, called
Storyline - an interactive timeline of a person's email life.
41
Significance of the temporal dimension
This part of the thesis essentially boils down to application of time-series analysis to
email metadata.
A time-series is a collection of values corresponding to a set of
contiguous time instants generated by a process over time (Esling & Agon, 2012). Timeseries data mining is a well-explored research domain because a large percentage of
the data that is generated through processes and experiments, either manually or by
machines, are anchored with specific timestamps. Time-stamped data is commonplace
in scientific domains like medicine (ECGs, EEGs, heart rate, etc.), astronomy (light curve
data from exoplanets, positions of heavenly bodies, etc.) and also at the consumer end
of the spectrum such as personal data (email, text messaging, financial transactions,
etc.).
The ubiquity of time-series as a form of representing a process for further analysis is
brought about largely because of the convenience of using time as the independent
variable in many experiments, and so it easily finds a home on the x-axis of even the
most basic graphs. Representing time on a straight line is a motif that a lay person is
comfortable with, and from that building block we have managed to come a long way
in terms of methods to analyze and represent higher levels of time-stamped data. Apart
from being able to delve deeper into the characteristics of a single time-series, the
temporal dimension also helps us to compare different processes using time as the
common axis.
Aptly put by Gershenfeld & Weigend, "measured
time-series form the basis for
characterizing processes under observation, to better understand the past and at times,
to predict their future behavior" (Gershenfeld & Weigend, 1993).
42
One of my main goals with this thesis is to bring the contextually relevant body of
knowledge of time-series analysis closer to the domain of email data. Consequently,
this enables people to derive more insights out of their own email data though a
publically available platform, in this case Immersion.
43
U
Prior Work
There
have
been
previous works that
have
analyzed
email
through temporally
motivated questions. The literature review exercise revealed that while there are
existing tools to explore one's own email data in the temporal dimension, there is more
that could be done from the technical perspective, in terms of using better, more
accurate and more efficient methods of mining email data. For this reason, in the
second part of this thesis, I focus more on the technical challenge involved in detecting
significant events and lesser on the interface aspect, since we would be able create
richer narratives if we are able to mine better stories from the underlying data.
Fischer & Dourish's (Fischer & Dourish, 2004)
work on temporal patterns in coordinated work
environments resulted in the creation of a tool
called TeliMeAbout (Figure 19), which gives a
user
temporal
information
about
their
> ?.llaab--t -person bnnzkhua
messages since Mar 23 'C1,
most recently May 12 -02
especially
Mar 26 '01-Hay 7 '01,
May 28 '01-Jun 11 '01
closest connections include (gayle)
19
Figure 21: TeIIMeAbout
individual interactions with their colleagues.
Vibgas
et al
have also published work
relating to the temporal aspect of email
communication. In their project PostHistory
(Figure
20)
they
aggregate
data
on
temporal dimensions such as daily email
averages,
'quality'
of
emails
(spam
or
relevant contacts) and relative frequency of
email exchanges with contacts in order to
-!
44
El
provide the user a way to access higher-level information about their email habits
(Viegas, Boyd, Nguyen, Potter, & Donath, 2004). Immersion makes use of similar and
more dimensions inspired by works in the field of time-series analysis that Is discussed
next.
Even though time-series analysis (TSA) is a fast-moving research domain, my literature
review revealed that not many TSA methods are formulated specifically for email
datasets. It wasn't surprising to note that prior work in this intersection mostly made use
of publicly available email archive data such as the Enron email corpus (Cohen, 2009),
rather than allowing people to analyze their own email data.
Approaches to temporal analysis of processes can be generally classified into two kinds:
- signal processing inspired Time-Series Analysis approach (TSA),
- network science inspired Temporal Network Analysis (TNA) approach.
The former approach is more statistics driven and has a much larger body of work
compared to the latter, especially since network science is a relatively newer field
compared to regular time-series data mining.
Pertaining to TSA, there are a number of useful survey papers that enabled me (as a
beginner in this field) to get a good idea of what are the common analysis methods and
representation techniques used when dealing with time-series data. The general timeseries related methods outlined in these papers fall into the following categories
(Ratanamahatana, 2010):
- Indexing (Query-by-Content)
- Clustering
- Classification
- Prediction
45
- Summarization
- Anomaly and Event Detection
- Segmentation
Each family of methods is best used for purposes that it is more suited for. For example,
Indexing is used in the case of quantifying similarity or dissimilarity measures (Euclidean
distance) between a given set of time series signals. On the other hand, Prediction is a
family of methods that features heavily in many domains, where data points from past
time intervals are used to project values of the same process into the future.
The combination of analysis techniques to accomplish the goal of the second part of
this thesis is wholly inspired from the Anomaly and Event Detection family of methods
in the list above.
Given that the high-level data structures and visual components used in Immersion are
inspired by network science, I've also reviewed technical papers in the research space
of temporal aspects of networks to investigate how they can assist in the process of
event detection.
Holme et al in their book Temporal Networks (Holme & Saramaki, 2013) reviews past
and emerging works in the field of TNA -- temporal graphs, evolving graphs, dynamic
networks, etc. --- making a clear distinction of which techniques are best suited for static
networks and which are best suited for dynamic networks.
More often than not, TNA methods simplify the task of dynamic network analysis by
sampling a dynamic network at regular time intervals, resulting in a series of static
networks each forming the signature network representation for each time interval.
46
Wan et al's work on link-based event detection in email communication has shown that
simply relying on analysis of variations in communication volume over time is not
enough to surface every event worth detecting in a person's email dataset (Wan, Milios,
Janson, & Kalyaniwalla, 2009). There are changes that can happen over time within a
person's email life that need not necessarily affect the communication volume, but can
radically change the underlying network of people that the person has communicated
with. For example, moving to a new city might mean that a person still sends and
receives the
same
number
of emails the
next
month,
in
effect keeping
the
communication volume almost the same. But the set of people the person interacts with
could change a whole lot in that same time frame which could be detected if one were
to use the temporal network analysis approach.
A combined approach (TSA+TNA) as proposed by Wan et al could be used to customfit the needs of different levels of abstraction of the data. For example, one data layer
could be richly expressed as a set of time series (individual signals of relationships with
each person and combined communication volume), and also as a dynamic network
graph of people (vertices as people and edges representing connections between
people) with whom a person has corresponded with over time.
In the proposal stage for this thesis, I put forth a two-pronged approach (TSA+TNA) to
tackle the problem of identifying significant events from email data. Given the time
constraints for the thesis and taking into account overall goals of the project, my
committee advised me to focus on one approach. Based on my literature review and
execution timeline, I decided to prioritize the TSA approach and integrate it with a TNA
approach in the future when time permits.
The technique used in this thesis for detecting events comes from the Anomaly and
Event Detection family of methods. Now is a good time to explain this set of
techniques in more detail since it is important to understand the difference between
47
Anomaly Detection and Event Detection, and why I choose the latter in the context of
working with email data.
The logic behind event detection from a time-series is to find time intervals within the
time-series that deviate from the baseline signal. This is then followed by calculation of
statistical significance of each time-interval in the context of the parameter we wish to
observe and create a model for. Separating significant time-intervals' characteristics
from the baseline signal requires modeling the noise present in the time-series.
Modeling of the noise is usually done by randomization testing in which a time-series is
randomly reshuffled many times and the original interval is compared to the most
statistically deviant interval found in each shuffle (Neill, Moore, Pereira, & Mitchell,
2005). However, performing this randomization is a computationally intensive process,
and for most practical purposes (such as in a web-based system like Immersion) it is not
a viable route.
In the case of email communication, noise is ever-present in the constituent time-series.
Within a given time-series the noise need not necessarily follow any periodic pattern,
and for a set of time-series, there need not be a common noise profile among them
either. So there is a need to take an approach that is independent of the noise
characteristics. This is where the distinction between Anomaly Detection and Event
Detection becomes relevant. Anomaly Detection is more suitable for periodic series
with less noise (Keogh, Lin, & Fu, 2005). Event Detection on the other hand helps to
find subintervals that are most statistically different from the underlying noise.
A paper by (Preston, Protopapas, & Brodley, 2009) provides the foundation for the
Event Detection technique used in this thesis since the problem they were trying to
solve - detecting microlensing events from noisy large-scale data - closely mirrors the
goal of event detection that this thesis proposes in the context of email data.
48
Preston et al's method, which was motivated by the field of astronomy where timeseries analysis is very prevalent, is able to detect microlensing events from light curve
time-series
associated
with
exoplanets
under
observation.
Microlensing
events
(example shown in Figure 23) occur when large objects (usually heavenly bodies such as
planets, moons, etc.) passes in front of a light source (such as a sun, or reflected light
from a planet) and acts as a gravitational lens, thus increasing the magnitude of light
being observed. Their method was successfully tested with a large-scale dataset
,
-
pertaining to Massive Astrophysical Compact Halo Objects (MACHO).
i
17.2
I
17.4
,0
VIA
I
44.6
? S.40
Ss~~
MO
ure 23:
Mfroeonc
ee
qer. 'ra
J
!
rowi
J
(8*0W
'Sau.1ce: (flres Or
Re.(.c
GJb
,.,..eM/\CHC da~ase!
ET
ca,
2001)i
Each light curve has different noise characteristics, and there are millions of light curves
to consider together as a set in order to produce reliable results (Preston, Protopapas,
& Brodley, 2009).
Interestingly, this problem space is quite similar to that of time-series present in email
communication data, where the noise contained in the time-series representing
communication volume with person A is different from that of the time-series associated
49
with person B. So how can we detect events from a time-series while also navigating
the unique noise characteristics present in each series?
The answer lies in the design of the Event Detection pipeline proposed in this thesis,
which is detailed in the next section.
50
A Quick Overview of the Technique
This section explains the technique I've used for detecting events in email data, the
data
processing
pipeline
it
necessitates
and
the
challenges
faced
during
implementation of the same as a feature within Immersion.
The technique relies on a scan statistic method that uses sliding windows (adjacent
equally-spaced intervals within a time-series) to characterize sub-intervals of a given
time-series.
The independence from the underlying noise is achieved by converting the time-series
to rank-space -- a discrete series of N points with each interval assigned a rank value
between 1 (lowest value) to N (highest value). This yields a uniform distribution of points
of the dependent variable being observed -- in our case, the communication volume -which allows us to form a probability model that does not have to account for the noise
since the distribution is known to be uniform.
Once we obtain the probability model for each time-series we are interested in, it can
be used to find out the p-value for each interval within that time-series. Making this
happen for variable window-sizes (1 week, 2 weeks, 1 month, 3 months, etc.) is a
computationally intensive task and so an optimization technique needs to be used to
approximate model solutions for each window size without sacrificing much on the
accuracy of the p-values.
51
A Deep-Dive into the Event Detection
Pipeline
Framing the aforementioned scan-statistic technique in the context of the email
communication dataset required some changes to be made to the original algorithm
proposed by (Preston, Protopapas, & Brodley, 2009) and this entire process is detailed
in this section. The cascaded series of tasks involved in the event detection pipeline of
Immersion are as follows:
Step 1:
Extract and normalize timestamps
Step 2:
De-trend the input time-series
Step 3:
Convert time-series to rank-space
Step 4:
Calculate rolling-sums using a sliding window
Step 5:
Determine probability distribution using the Monte Carlo method
Step 6:
Find significant p-value intervals
Step 7:
Identify people & topics relevant during each event interval
Step 8:
Consolidate overlapping event intervals
Step 9:
Include obvious events based on individual interactions
This processing pipeline is used for every time-series of interest, which includes the
evolution overall communication volume (sent + received emails) over time, as well as
individual time-series representing communication volume with each person of interest.
The former helps to identify events at a macro-scale (a big-picture perspective) and the
latter is used to profile each detected macro-scale event in depth to accurately identify
who are the people in a person's email correspondence history that prominently
contribute to that particular time interval's statistical significance.
52
STEP 1: EXTRACT AND NORMALIZE TIMESTAMPS
The technology stack of Immersion has been designed in such a way that timestamp
data of emails is available on the client-side (browser) through an in-memory database
object. A time-series of the overall communication volume for a person is calculated by
aggregating emails sent to and received from that person's contacts. The resulting
time-series is devoid of irrelevant and uninformative timestamps due to spam, thanks to
the filtering that Immersion performs prior to the visualization step (detailed in Part 1 of
this thesis). The same process is also followed to obtain time-series of communication
volume with individual contacts of that person. It is important to note that each timeseries obtained can have a length that can be different from other time-series, and so
our algorithm needs to adapt to time-series of different lengths.
Converting each timestamp into the JavaScript date object also allows for normalization
of each timestamp into a common time zone -- that of the current region that the
person is using Immersion in. The DateO object of JavaScript is flexible enough to give
timestamps with resolution ranging between milliseconds to years. For the purpose of
this thesis, the highest level of resolution we use is that of 1 day, and there is no
distinction made between emails that are received in the morning of a day to those
received later in the day, as long as they fall within the same 24-hour time window.
This array of timestamps is sent as input to the Immersion server, which utilizes the
Python framework. All the code for the following stages of the pipeline has been written
in Python.
STEP 2: DE-TREND EACH TIME-SERIES
Time-series data of email communication volume usually exhibits a trend where the
volume of emails increases over time. This observation is expected since we are online
more often thanks to the proliferation of mobile devices, and also since email has
become a fundamental conduit of communication in the digital age.
53
However, the presence of a trend
in a time-series can distort or obscure the
relationships or phenomena that we are observing. In order to ensure stationarity of the
process captured by the time-series, it is necessary to de-trend the time-series (Meko,
2013).
Identification of a trend in a time-series is subjective, and can sometimes be informed
by knowledge of the real-world phenomena that influences the process. For the
purpose of this thesis, I have assumed that email communication volume increases
linearly over time because this was a commonly observed phenomenon in the dataset
of people who tried out the Immersion
prototype. It is quite possible that this
assumption does not hold for all users, and it is necessary to model the trend more
accurately in the future. One possible way is to use a piecewise polynomial fit instead of
a linear fit, but due to time constraints this approach will be treated in the future work
section of this thesis.
Figure 24 shows the overall communication volume for my email data between the
years 2005 and 2014, and what that time-series looks like before and after de-trending.
The red line indicates the least-squares-fit straight line used to de-trend the entire timeseries. This is done for every time-series that is sent for processing through the event
detection pipeline.
140
120
120
.
.*100
*1
100
104
960
v
..
60
""
ia.
40I
..
.
.,
40
200
1.05
1.10
1.15
a7 ure 24: (A)
1.20
"1.25" "
30
1
n~
1.40
1e9
for LFe* remndro
X105
1.10
1.15
B A
1.20
ierJ
1.25
1.30
1.35
1.00
le9
PDc
C!a'C
54
STEP 3: CONVERT TIME-SERIES TO RANK-SPACE
In order to extricate our process from its dependence on the underlying noise, it is
necessary to convert the time-series into a uniform distribution. This is achieved by
converting the value in each interval of a time-series into a rank-value which is the
ranking that the interval would have compared to values in other intervals in the same
time-series. The range of ranks extends from 1 to the length of the time-series N, with
rank 1 corresponding to the lower end of the set of values, and N representing the
interval(s) with the highest value. This raises the possibility that some intervals could
have the same original value and hence a decision needs to be made as to how to rank
them. The heuristic used in our technique is to take the average of the rank that each
interval would be assigned if they were not sharing it with other intervals.
In the case of email communication volume, each interval represents the number of
emails sent and received within a time interval, and this can vary from one interval to
the other. Moreover, it is not possible to restrict the original values within any range
(apart from the minimum value which is always 0).
This step yields a uniform distribution of values ranging from 1 to N, on which we can
perform statistical operations without having to worry about modeling the underlying
noise. It also means that for each pair combination of sliding window w and time-series
of length N, there will exist a distinct probability distribution of sums of ranks within
each sliding window interval. This technique also makes the assumption that the sums
in the outer tails of the probability distribution are more likely to map to 'significant'
events since the probability of obtaining those sums are lower than that of the sums
appearing in the mid-section of the distribution.
55
STEP 4: CALCULATE ROLLING-SUMS USING A SLIDING WINDOW
In order to prepare the time-series for calculating p-values associated with each interval,
it is now necessary to use a sliding window and calculate the sums for each interval in
the time-series.
I have used the rolling-sum function in Pandas (a scientific toolkit for Python), after
having pre-defined the sliding window size to be 14 days. This number was chosen
based on trial-and-error and real-life observations that there is a higher likelihood of a
significant life event being be expressed in email communication over the duration of
two weeks than over just one day, or one month. In the case of the latter, this technique
would result in two consecutive sliding-window time-intervals being deemed as
statistically deviant from the rest of the time-series, and that multiplicity is resolved
towards the end of the processing pipeline when both will be consolidated into a single
significant event. If the sliding window size were to be as small as one day, the resulting
time-series would not encode any new information since the highest resolution we have
used to bin emails into a time-series is the 24-hour time period.
56
STEP 5: DETERMINE PROBABILITY DISTRIBUTION USING MONTE CARLO
Given our rank-transformed time-series TR, we now need to find the statistical
significance of a particular sum of an interval. In other words, we need to know the
distribution of (w,N) where w is the size of the sliding window and N is the length of
the time-series.
There are a couple of ways to do this:
1. An analytic method involving a combinatorial problem that yields exact probabilities
for all possible sums. However, it is deeply recursive and so not viable for a practical
implementation (Preston, Protopapas, & Brodley, 2009).
2. Perform random sampling on all possible sums using the Monte Carlo method. This
approach is much less computing memory intensive, and yields an approximate
probability distribution curve.
Keeping in mind that the goal of this thesis is a scalable and practical solution usable on
a contemporary web application's infrastructure, I chose the second approach. Applying
the Monte Carlo method to our time-series context, w unique random numbers from 1
to N are repeatedly selected. These numbers are summed, and the number of times
each sum
0
is obtained during our trials is counted, and is denoted by no. Clearly, the
more trials we are able to perform, the closer the distribution curve (of the frequency of
the sums) will be to the analytical approach equivalent.
-
The number of minimum trials needed can be tuned based on the p-value threshold a
- where events with p-value < a are considered statistically significant. The expected
accuracy, or in other words, the error mark of the frequency of any sum
0 is given by:
57
U'
1
E~
no
In order to ensure accuracy of E for a p-value threshold
samples
A for a sum
0
is given by:
n9N
Ca
E and
a
a, the minimum number of
1
aE2
are statistically motivated and are to be pre-defined. This involves a trial-and-
error approach checking results for different combinations, and I found the minimum
acceptable number to be 1,000,000 samples.
However, when the resulting distributions from trials lesser than 1,000,000 were plotted
(Figure 26), it was clear that the distributions were very close to a normal distribution
and that the mean and standard deviation of each distribution (for different trial counts)
were in the same vicinity.
U1449
fi[M
4.N15
A,
SIt
am,
awe
awl
A:.
~t 9
ur
TRIALS = 100
mean = 1287.99
stddev - 224.1155
stddev = 280.213
mean = 1272.69
stddev L 278.283
(a)
(b)
(c)
26:
rrne
JDsh 17JL; O s
S'i-o
TRIALS = 1000
mean = 1256.93
ior
si~r
<~t- Istana
rd
TRIALS = 100000
EOEE ,1!'y Hererset29mam arC bu do
6edVior. (A) 10
1eci
pa, B'00sr~s Ci100
58
This means that we could make do with a probability distribution with a far lesser trial
count by using the mean and standard deviation for a combination of (w,N) and
plugging that into the equation for generating a normal distribution.
If X is a data point in a series,
It
the mean and 6Y the standard deviation, a normal
distribution for that combination is defined as:
f (x,
/y,Q)
1
=
2
(x-
e-
This simple optimization technique effectively reduces the event detection pipeline's
execution time by almost 10 seconds on a regular laptop. Such a performance increase
is most welcome especially in the case of an interactive web application like Immersion.
In order to make the technique more versatile to accommodate arbitrary lengths of w
(sliding-window length), it is possible to pre-compute many different combinations of
(w,N) and store it in a database for faster access in the future. This however has not
been implemented for the current prototype since we are operating under the
assumption that a window size of 14 days will be sufficient to detect significant events
of interest. The 'learned probability distribution' approach is left for future work when
event detection technique in this thesis will be added to the next version of Immersion's
public release.
STEP 6: IDENTIFY SIGNIFICANT P-VALUE INTERVALS
This is a straightforward step where intervals of statistical significance -- those with an
associated p-value <
a
-- are filtered and obtained as the result. Framing it in the
context of implementation of this technique, this step yields an array of objects that
describe the starting date and p-value for different intervals. This array of objects is
returned to the client, which then processes (using JavaScript) these intervals to sort
them based on their p-value to obtain a ranked list of significant events. Another set of
59
values returned from the server to the client includes the mean
t
and standard
deviation O for the (w,N) combination associated with the input time-series of
length N.
Using the mean and standard deviation, it is possible to determine whether the
significance of an event interval is due to a higher level of email correspondence during
that period or whether it is due to a dip in activity. In the case of the former kind of
events, it is necessary to correctly estimate the contributing factors (people and topics),
whereas the latter case can only be reported as is to the user due to the unavailable of
any more related data points for further annotation.
STEP 7: IDENTIFY PEOPLE & TOPICS ASSOCIATED WITH EACH EVENT
INTERVAL
The in-memory database object on the browser contains helper functions to return the
top contacts and top topics for any given time-interval. These functions are called for
every significant time interval (now considered as significant events) to calculate the
related people and topics.
For each contact returned in the previous step, their individual email communication
volume time-series is sent to the server for detection of micro-events of significance
using the same event detection pipeline. As expected, the server returns a set of
significant events related to each contact and also the associated mean and standard
deviation for the (w,N) combination of each contact's time-series.
Armed with both the macro-level events detected from the overall communication
volume time-series, and also micro-level events for each top contact, it is now possible
to estimate which are the contacts who really made that event interval significant.
60
This is done by computing the z-score associated with each contact for each
significant event interval, and if the z-score for any contact is above the threshold
(estimated through trial-and-error) it is assumed that that individual played a significant
contributing role to the importance of that event interval. The z-score is given by the
following formula:
x-
pi
Z =
Another helper function associated with the in-memory database in the browser returns
the list of clusters of people the user has corresponded with given any time-interval. For
any significant interval, this adds a higher level of information that can be provided to
the user, ascribing cause of significance of the event to specific groups of people rather
than just individual contacts of the user. This also helps with finer filtering of relevant
topics for a given combination of time interval and set of contacts.
STEP 8: CONSOLIDATE OVERLAPPING EVENT INTERVALS
It is possible that the output of the previous stage can include event intervals that are
very close to each other. In order to avoid duplicate representations of the same event,
characteristics of adjacent events are compared with each other and a distance metric is
computed.
If there is much in common (for example, if the people and topics
characterizing intervals under consideration are the same), then the two event intervals
are assumed to be the same and consolidated into a single event, but with an extended
time-frame encompassing both event intervals. This is a brute-force comparison
performed between each adjacent event interval and it has been observed to be not
very computationally intensive for significant events count less than 100.
STEP 9: INCLUDE OBVIOUS EVENTS FROM INDIVIDUAL INTERACTIONS
Relying on only the macro-events detected from the overall communication volume
time-series would mean that we risk losing out on relevant micro-events such as
61
introduction to a contact (date of first email) who then went on to become a long time
email collaborator. Similarly, a significant dip in interaction volume with any contact can
also signal an event of importance in a person's life.
By processing the individual communication volume time-series associated with every
relevant contact, we are able to detect micro-events that can add more granularity to
the set of significant events detected from a person's email dataset.
62
Results
Developer View
In the process of developing the Event Detection pipeline, it was necessary to build a
user interface that supported authenticated user login, adjustment of the p-value
threshold
a
and a visualization of the resulting dataset of detected events. Figure 27
shows a developer view UI, which proved to be quite useful in debugging and
streamlining the results of the Event Detection pipeline based on email data provided
by volunteers.
pyval less than 0.0000001000 yields 430 events
Relevant event count after consolidation: 44
0.00001
0.00002
0.00003
0.00004
0.00005
0.00008
0.00007
0.00009
0.0000e
0.001
40.000-
1~
30.000-
t IM
NI
-
20,000
10.00
-
0
2006
2008
2007
L~eveo.7
per iw 5how', ' ,he resut
2008
2009
otS0; event dete
2010
2021
2012
2013
2014
a3 'meine alema, wtn'f-
5
The slider at the top allows the developer to change the value of a. The pipeline would
return more events for a higher value of a, and significant events detected are overlaid
at the top of the horizontal timeline as colored ellipses. A red ellipse denotes an event
due to low email activity, and a green ellipse denotes an event detected due to high
email activity. Hovering over an ellipse reveals metadata about each event such as the
63
El
Ku
p-value associated with it, its rank relative to all other detected events, people and
topics associated with the event and also its distance from the mean of distribution
obtained from Monte Carlo sampling. The line graph denotes the rolling sum obtained
from the Event Detection pipeline for the time-series corresponding to the overall email
volume.
Packaging the Event Detection Technique
STORYLINE FEATURE
The event detection pipeline results in a filtered ordered list of significant event
intervals and their associated people and topics. In the context of Immersion, this result
is packaged as a new feature called Storyline which functions as an interactive timeline
of a person's email life.
2013
increase
During 2013. youcoresponded with134 newcontacts. a 46%
from previous year.
Qgriwl SrWJy was the leading recipient of your emails
andvou received the most emaits from Hyato Ioj 1.
There were 8 significant events during this vear.3 of which were becuse ofhighemail activityv
18
_
09
.. :,
15
AUG
rr' eire 'r9
s;ory cars for ec ever7
e28: [
irye
feature
_
43
xra:,
27
11
~
_
27
AUG
,
64
11111
II
As the screenshot of the application shows (Figure 28), events are pinned to a vertical
scrollable timeline, and a verbal story is constructed around each event's parameters
using predefined sentence constructs. Each story-card can be flipped to reveal more
information about that event, including a direct link to the emails that meet the filter
criteria (people and topics) for that story's time interval.
2711
There wa.s
43%
group
Yoar main
r'
increase in your em
ifabor, tor" iy
iil
.t
it'ti
vy
timne were
Daniel Smilkov and Cesar Hidalgo
,'ourconver,,t orss inluded
and the top o
EMAILSP>-J
I.JP
11
JUN
JUL
* Any updates with Immersion?
about Immersion
"
g-Q0
-
27
P
seev
versi
2.0-
Story-cards associated with events detected due to high email activity appear on the
right side of the timeline, whereas story-cards for events due to low email activity
appear on the left. This helps facilitate getting information at a glance, if a person wants
to find out whether a year was particularly high or low in terms of their email activity.
The story-card shown above (Figure 29) is from the Storyline based on my email data,
and it has accurately detected an event associated with high-activity - the launch of
Immersion on June 30, 2014 - when I exchanged a lot of emails with my colleague
Daniel Smilkov and my advisor C6sar Hidalgo. The story-card is flipped to show the
hyperlinked subjects of emails we exchanged during that time period.
65
In the case of significant events detected due to low email communication volume, it is
not very meaningful to provide contextual information such as associated people and
topics. So a simple story-card notifying the user of this low activity period, along with an
option to select a reason for the inactivity is provided.
Such evaluative inputs from the user can be used for categorizing
other low-activity events for the same user or across many different
users using a machine learning method once enough data has been
gathered to form a training dataset. The implementation of such a
learning feature is outside the scope of this thesis and is planned to
be a future feature.
There is also a macro-level timeline on the left side (Figure 29) that
automatically highlights the centered story on the page, which the
user can use to jump between stories that are temporally far from
each other. This view can be filtered to show only events of one
year, or compress all years' events together into the same timeline.
r~;
C
31:
Macro-evc
ti'menc
ACCESS THROUGH A REST API
Apart from its integration into Immersion, the event detection pipeline is also available
as a general-purpose API where people can make a RESTful POST request to a server,
as long as the input data is in the prescribed format -- a JSON representation of an
array of timestamps. The pipeline also supports ascribing 'weights' to each timestamp
and that will be taken into account when binning events using the 1-day level of
resolution.
66
I'
Future Work
Revamping the Immersion platform
The upcoming version of Immersion introduces a host of new UI features, including the
Storyline feature, that have been informed through feedback obtained from people
who used the first version. Apart from new visualizations, we are also working on a
complete overhaul of its technical stack based on the lessons learnt from the previous
release.
One of the major additions from a data perspective is that we now also fetch data from
the Subject field of emails. This opens up a host of new possibilities in terms of data
mining and visualization associated with language analysis or a more simple frequency
analysis of topics.
TOPICS
COLLABORATORS
pose
Dan e 5m kov
demo
exhibit
NIINIIN
IIN
summer
interview
AN"
-
update
Boogie
"
faI
IIUi
II
-ON
article
data
emai
deepak
version
I
jagdish
city
design
macro
CryE
NCW
NCLUDEs
C
TCRCs
3A0
I
AIOS
project
1
photo
media
lob
.a",esaI
mt
-
i20
analyt Cs
typo
regstration
'.1
*Ol04 u5-4
* x"''
post
JC
I
4
Ve
2 :4 \e vcQ
i o~
~~
a
~
-~&
. c~
'a
i o c
~
'-
67
(,
*
hS1i'meso.medla.mi.edY/demoa
C:NVERSAT
-,
Aaron Ramsey
N TOP!C
0
apartment
rent
utilities
" Michael
mirai
barcelona
lab
house
transfer
40
20
Jan O7
.1
IIIIII
Rob Marcus
IIIIII
Arya Stark
IIIIII
$T FMALA
2 years ago
S
se
1
" Jack Nicholson
IIIII
" Michael Potsdam
111111
9Jan11 My12
,T-- El
8.4 months ago
SENT
:ECEICEL
301
475
table
Anna Frishk
111111
" Rob Marcus
111111
" Arya Stark
11111
" Violet Sigurdsson
11111
Jack Nicholson
IIII
" Michael
Potsdam
" Anna Frishk
eggs
F/q ue 33:
New
145
design of the Person v!ew
266
SPOWIno
IIIII
" Violet Sigurdsson
IjI
y
1lIlll
Potsdam
" Anna Frishk
STATiSTICS
50-
IIIIII
" Jack Nicholson
movie
beer
.
4-
.
I11
III
RobMarcus
I
AryaStark
I
relatec
At the heart of the new version is a filter-based UI and data-querying mechanism that
enables people to quickly retrieve and visualize results matching a subset of input fields
such as time-range, people and topics. This feature has been engineered to achieve an
extremely reactive UI where a user can simply hover over a topic or a collaborator and
see all associated visualizations update immediately to reflect the data set filtered
based on the hovered item. Figures 32 and 33 show work-in-progress screenshots of
Network view and the Person view, both powered by the new filter-based querying
engine.
Keeping in mind that we might want to open source parts of the project in the future,
code readability and modularity have been top priorities for the new version of
Immersion. For this reason, we shifted the client-side codebase from JavaScript to
CoffeeScript, since the latter is well known to be developer-friendly. The entire
68
technical stack of Immersion is now easily deployable by any person who has a basic
understanding of launching processes in a UNIX environment.
Optimizing the Event Detection pipeline
There are multiple ways to improve the efficiency and accuracy of the Event Detection
pipeline used in Immersion to detect significant events from email data.
In its current avatar, the pipeline supports time-series of any length, but re-computes
the mean and standard deviation for every time-series. Assuming that we're working
with a fixed window size W, This process could be greatly optimized by pre-computing
the mean and standard deviation for different combinations of (w, N), where N is the
length of the time-series, and storing these values in a database for faster access in the
future.
One of the potential challenges with this approach is that it is difficult to predict the
length of an input time-series. However, this can be overcome by chopping up a long
time-series of length Y into smaller chunks, each with a maximum of length N. Each of
these chunks can then be passed to the Event Detection pipeline to be processed
individually, and their results (significant events detected) can be combined at the end.
It is important to note that the window size being considered is critical here since we
want each of these individual chunks to overlap by the same amount as the window size
so as to not miss out on detecting significant events around the points where the larger
parent time-series was chopped.
In addition to this, the accuracy of calculating people and topics associated with an
event can be vastly improved if a TNA-based approach is also incorporated, as
suggested by Wan et al (Wan, Milios, Janson, & Kalyaniwalla, 2009). This would allow
69
for detection of changes in the network composition and structure even if the volume of
emails exchanged remains the same across consecutive time windows.
Automatic Classification of Detected Events
In the current implementation, the Storyline feature detects significant events and is
able to identify associated people and topics. However, it is not able to predict the
nature of the event. For example, some events could be ascribed to being on vacation,
release of a new project, planning an event, etc. The current interface allows people to
input the category for each event. Once we have a large enough corpus of this
categories data associated with events, it can then be used to train a machine learning
algorithm to predict the nature of events without having to ask the user to fill that in. Of
course, it is certainly a good idea to give the user control to point out incorrectly
categorized events since a 100% success rate for predicting categories is difficult to
achieve using current computing infrastructure. But it would still add an element of
predictive surprise to the platform that people could potentially enjoy.
Connecting Storylines Across Users
The Storyline feature can detect people associated with each event for any given user.
A future version could support conversations centered on each event between people
who are associated with that event. Consequently, instead of only a single person
reflecting upon their email history, this would give people the ability to collectively
reflect about past events. A conversation could be between friends revisiting their
memory about a favorite camping trip, or between colleagues at work about project
from their past.
70
This is admittedly one of my favorite future features for Immersion since it has the
potential to transform itself from being a platform that allows people to view their
implicit social network created through emails, into a live social network by itself that is
pre-populated with people's email data.
71
CONCLUSION
72
We are at a point in time where the devices and services we use on a daily basis leave
rich digital footprints from which much can be learnt about ourselves and the
communities that we interact with.
Through the project Immersion, this thesis presents a way for people to learn about
themselves and the communities they are part of based on their email history. Designed
as a tool for self-reflection, Immersion has served over a million users by now and the
author hopes that it continues to provide more people with new perspectives of their
email life.
Moreover, the technical foundation proposed in this thesis for event detection from
email data will hopefully inspire and spawn rich storytelling and biography authoring
platforms that bring people's stories to life. Democratizing the process of archiving and
sharing the stories of our lives will help us look back on our pasts and also hopefully
allow those who come after us to have a better understanding of what we did during
our time here.
73
BIBLIOGRAPHY
74
Cohen, W. (2009). Enron Email Dataset. Retrieved July 2014, from
https://www.cs.cmu.edu/-./enron/
Esling, P., & Agon, C. (2012). Time-Series Data Mining. ACM Computing Surveys, 45.
Fischer, D., & Dourish, P. (2004). Social and Temporal Structures in Everyday Collaboration. CHI.
ACM.
Gershenfeld, N., & Weigend, A. (1993). The Future of Time Series. Addison-Wesley.
Goldstein, J. (2013, July 1). An MIT Project That Lets You Spy On Yourself. Retrieved from NPR:
Planet Money : http://www.npr.org/blogs/money/2013/07/01/197632066/an- mit-project-thatlets-you-spy-on-yourself
Groden, C. (2013, July 5). This MIT Website Tracks Your Digital Footprint Through Gmail.
Retrieved July 2014, from Time magazine: http://newsfeed.time.com/2013/07/05/this-mitwebsite-tracks-your-digital-footprint-through-gmail/
Holme, P., & Saramaki, J. (2013). Temporal Networks. Springer.
Keogh, E., Lin, J., & Fu, A. (2005). HOT SAX: efficiently finding the most unusual time series
subsequence. International Conference on Data Mining. IEEE.
Macro Connections. (2014). Pantheon - Methods. (MIT Media Lab) Retrieved July 2014, from
Pantheon - Mapping Historical Cultural Production: http://pantheon.media.mit.edu/methods
Meko, D. (2013). Detrending. Retrieved July 2014, from
http://www.Itrr.arizona.edu/-dmeko/notes_7.pdf
Neill, D., Moore, A., Pereira, F., & Mitchell, T. (2005). Detecting significant multidimensional
.
spatial clusters. Advances in Neural Information Processing Systems
Preston, D., Protopapas, P., & Brodley, C. (2009). Event Discovery in Time Series. SIAM
International Conference on Data Mining.
Ratanamahatana, C. (2010). Mining Time Series Data. In Data Mining and Knowledge Discovery
Handbook (2nd edition ed.). Springer Science+Business Media.
Smilkov, D. (2014). Understanding email communication patterns. Master's Thesis,
Massachusetts Institute of Technology.
Viegas, F. (2005). Mountain. Retrieved June 2014, from Mountain:
http://alumni.media.mit.edu/-fviegas/projects/mountain/index.htm
Viegas, F., Boyd, D., Nguyen, D., Potter, J., & Donath, J. (2004). Digital Artifacts for
Remembering and Storytelling: PostHistory and Social Network Fragments. 37th Hawaii
International Conference on System Sciences. IEEE.
75
Vi6gas, F., Golder, S., & Donath, J. (2006). Visualizing Email Content: Portraying Relationships
from Conversational Histories. CHI.
Vincent, J. (2013, July 8). MIT's 'Immersion' project reveals the power of metadata. Retrieved
July 2014, from The Independent: http://www.independent.co.uk/life-style/gadgets-andtech/mits-immersion-project-reveals-the-power-of-metadata-8695195.html
Wan, X., Milios, E., Janson, J., & Kalyaniwalla, N. (2009). Link-based Event Detection in Email
Communication Networks. SAC. Honolulu: ACM.
WikiLeaks. (2013, July 5). WikiLeaks - Twitter feed. Retrieved July 2014, from
https://twitter.com/wikileaks/status/353287879604174848
76