1

advertisement
1
>> We're honored today to have Nathan Eagle visiting with us and sharing some very
exciting work and exciting data collection and analysis going on that he's leading
up. Nathan is a research scientist at M.I.T. and a post doctoral fellow at the Santa
Fe Institute. He's been applying machine learning and graph theoretic analysis,
graph analysis to understanding from massive data sets, particularly mobile phone
and land line phone communications data, insights about human behavior and related
phenomena. For example, disease processes he's been getting into these days.
Nathan was a Fulbright scholar in 2006 where he went off to Kenya and Ethiopia and
spent time working in the mobile phone space. In fact, developing educational
curricula. Had to program mobile phones that's now being taught widely throughout
Africa.
He did his Bachelor's and Master's at Stanford, his Ph.D. at M.I.T., and he's working
on founding a kind of a new area of work we've been referring to as AID, artificial
intelligence, learning and reasoning for development, kind of an active sub-field,
probably one of the more sub-fields of IC for D, information and communication
technology -- actually, ICT for D, communication technology for development. With
that, we'll hear from Nathan today about the complex social systems from
communication data. From hundreds to thousands to millions to soon billions of
nodes and their links amongst one another. Thanks.
>> Nathan Eagle: Thanks, Eric. It's a pleasure being here. Maybe a little bit
more background about kind of where I'm coming from. So I was -- so I joined the,
you know, Sandy Pentland, the wearable computing group with [unintelligible] and
others back in 2001. And back then, wearable computing really meant strapping
computer literally to your back and occasionally you'd see guys with head mounted
displays on their heads and they'd be walking around M.I.T. and Boston capturing
really interesting things about themselves and about their environment. Sometimes
much to the amusement of people who were watching them walk around M.I.T. and Boston.
So I was -- there was some -- there was at least a trajectory where I would basically
be strapping a computer to myself and collecting similar types of data. But since
2001-2002, and this has kind of converged when Nokia launched their first mainstream
programmable phone. So instead of going down that wearable computing route, it
started hacking on phones, and I've been basically programming phones ever since
finished up my Ph.D. and had started a small company that essentially connected
socially awkward singles in Manhattan together based on their mobile phones.
2
I felt like there was something more -- what I really wanted to do was build an
application that had a massive impact, and this impact just wasn't happening in the
states or Finland or Korea, for that matter. But in 2005-2006, things were really
kicking off in Africa. And so I was lucky enough to get a research scientist position
at M.I.T., but instead of having my office in Boston, I lived in a small village
out on the coast of Kenya, village called Khalifi, and there I did a variety of
different things that Eric alluded to.
Starting this kind of mobile phone programming curriculum, originally at the
University of Nairobi, but it's now being taught in ten sub-Saharan African countries
and literally thousands of African computer scientists have gone through the program
and learned how to program phones, which has resulted in hundreds of applications
designed specifically for the African market, and quite a few startups based in
Nairobi and Adosaba [phonetic] and elsewhere.
One such startup is my own, a company based in Nairobi called TxtEagle, and it enables
mobile phone subscribers to earn small amounts of money on their phone, by completing
simple tasks for corporations who pay them either an air time or M-Pesa, which is
the local mobile money. This is just really kind of a pipe dream, this time last
year, but it's now a working product across East Africa and the reason why it happened
so quickly is because I've got fairly strong connections with literally every mobile
phone operator in east Africa and dozens around the world. So those relationships
are going to be what enables a lot of analysis that you see at the latter half of
this talk when I'm talking about scaling to millions and tens of millions. And soon,
I hope to have a data set that solves over a billion people by this time next year.
So that's a little bit about who am, and now I'm going to kind of talk a little,
kind of overview of what this talk's going to go into, and it's going to be on scales
of magnitude. So we're going to start off with what I did in my doctoral research,
where we studied 100 people and were able to start inferring the social relationships
between people based on behavioral data.
Then I'm going to talk about how that study scaled to a thousand people, we did the
same thing, but now this is a thousand people working in two office buildings in
Helsinki. From a thousand people, I'll talk about a data set that we have collected
with a company, eye in the my, where this is now ten thousand random sample people
across America where again it's the same type of uber spyware that's installed on
their phone that's I'll go into detail about.
From 10,000 to 100,000 criminals in Philadelphia in talking about things like whether
3
or not a crime wave exists. And if a crime wave does exist, can we quantify it in
terms of its magnitude and its speed?
>>:
[inaudible].
>> Nathan Eagle: This is what happens if you do a Google image search for a criminal,
and what the other image shows is Bill Gates and his photo. From 100,000 criminals,
I'm going to be talking about a million mobile phone subscribers in Rwanda. And
the questions we're discussing there are how does urbanization affect people's
social support groups and the dynamics of things like slums. As Eric alluded to,
looking at disease and patterns of disease and see if we can start seeing, is there
kind of a behavioral signature associated with an outbreak of a disease in a
particular region that gets reflected in the CDR, the call data records.
From 1 million, I'm going to go to the 10 million mobile phone subscribers in Kenya,
and this is particular data set that I find fascinating, because it's not just
communication anymore. And it's not just people's movements but now in Kenya,
people can send and receive money on their phones, and so we now have this flow of
actually economic resources throughout the country that is fascinating to study.
From the 10 million Kenyans, we're going to go into 100 million people in the U.K.,
and this is now in collaboration with British telecom and the civil service in the
U.K. We're coupling just about every phone call made in the country, both land line
and mobile, with things like socioeconomic status data, access to healthcare,
education levels, average income and trying to figure out, basically trying to
establish causality, like are there signatures associated with wealth that are
indicative of, again, within the call data records.
And then lastly, you know what does it mean when we start comparing this data across
cultures, across continents. What kind of new research questions can be addressed
and gets inevitably to the level of billions, which it will in the very near future.
Okay. So that's kind of an outline of the talk. For people who are familiar with
social network analysis, these two graphs, that one in particular should be
fairly -- should be, you know, fairly familiar. These are -- the graph on the top
was collected back in the '30s by two guys, Roethlisberger and Dixon, and this is
a graph where these are social networks so the nodes are people and you have an edge
in this case, on the top graph if you were seen playing cards with another person
in this particular electric company, which is being studied. And so that's one
social network.
4
And this is a social network that was recorded maybe 40 years later, back in the
'70s, by -- and this is a famous or infamous Zachary Karate Club data. And there's
been a lot of people, you know a lot of my friends and peers and people who are much
smarter than I am who have dedicated a large fraction of their lives to studying
this particular graph. The Zachary Karate Club data has been in literally hundreds
of physics journals, where people are trying to look at graph segmentation, trying
to pull out community structure so while social network analysis has kind of veered
into -- has veered well beyond social sciences to statisticians to physicists and
yet generally, when typically when you read these papers, they're really only for
kind of the -- the papers have algorithms that are only applicable for small and
static graphs, despite the fact that we're now inundated with behavioral data that's
many orders of magnitude larger than these types of graphs, originally from the
Internet, but now just from everyday life.
And so a lot of the projects that I'm going to be overviewing come from problems,
you know. They come from people who need help. And those people are, you know,
anyone from the police chief in Philadelphia, you are an planners in Kigali,
epidemiologists in Kenya, the civil service in the U.K., all these individuals are
practitioners who suddenly found themselves inundated with data, and they're looking
to the academic community for help and are understandably a little bit confused why
there's so many brilliant people focused on a particular Karate Club from over 30
years ago when, in reality, the data that we have, you know, the data that these
guys have is just much richer and much, in my opinion, more interesting.
But the problem is it's not small, it's not static, it's not computationally
tractable. And so fundamentally, we need a new set of tools to deal with data that
we're currently -- well, that we currently have today.
Things like betweenness centrality, you know. These are metrics that -- these are
fairly sophisticated and interesting metrics that can characterize, in centrality,
for example, how many shores past go through a single node. But something like
betweenness centrality is more or less intractable for any of the type of data that
I work with. And so what I'm trying to push for is, like, come up with these types
of analogs that can do better at analyzing what we have and what we need.
So the real focus here is going to be shifting from, like, just thinking about theory
and thinking about, you know, building a generative model that can mimic a particular
degree distribution and going towards really focusing on outcomes. So trying to
understand what is it the process that we want to get a better handle on? And
5
outcomes of interest for me are things like disease. They're things like traffic
jams. There's things like access to healthcare. And all these outcomes of interest
are basically can be informed by and conditioned on the peta scale data that we
currently have.
But there's -- it is fundamentally two different philosophies. I'm coming at it
from the outcomes perspective, rather than looking at the data for the data's sake.
And that's how I at least try to differentiate some of this particular work.
I also don't want to be misunderstood. A lot of the results here do come from
collaborations a long series of collaborations with academics ranging from social
scientists to particle physicists. Where we do try to validate theory and come up
with better theory about the dynamics of complex social systems.
But for me, I think the harder question here really is, you know, once you've got
all this data and once you fit a particular distribution, how can we start using
this data to improve people's lives? Like how can we make the society that wherever
this data's coming from, how can we make that society better? And that's kind of
going to be an underlying theme in a lot of the work that I'm going to be talking
about.
So with that as a long-winded introduction, let me start on the first order of
magnitude here. This is N equals 100. These are individuals who are subjects in
this in our original study, where we gave out phones that had more or less uber spyware
on it. It started on boot, it was invisible to the end user and it literally logged
everything. Logged communication, logged proximity, logged location.
And what you can do with this type of data is some fairly, well, interesting human
modeling. Given the fact that subject 104 is on their way to central square and
it's 3:00 p.m. on a Tuesday, can we predict where this person's going to have dinner?
Can we predict who this person is going to have dinner with? You know, given the
proximity patterns of S22 and S18, if they're proximate by the coffee machine at
the media lab every Tuesday afternoon, that corresponds to one type of relationship.
But if they're proximate on downtown Boston late Saturday night at 3:00 in the
morning, every single Saturday night, it's a very different type of relationship.
Lastly, looking at the aggregate behavior, you can start seeing things like the onset
of, for example, M.I.T. finals week, by people moving faster. Or we were capturing
data when the Red Sox won the World Series for the first time. And then suddenly,
all the models broke. Everyone became unpredictable. And what happened was the
6
majority of the subjects went into downtown Boston, and there's a big rally down
there. And the urban planners, they were interested in looking at how people
dispersed from that particular gathering point. How many people took the subway?
How many people walked over the bridge? How many people rode their bikes?
With this type of data, we can start answering questions about how people utilize
urban infrastructure. And so that's kind of a bit of a broad overview of this type
of data set. Is there a question?
>>: So you said that the spyware was invisible.
all your data ->> Nathan Eagle:
>>:
So just as a general question for
It's a privacy question.
Did the subjects know they were subjects?
>> Nathan Eagle: Oh, to do this study, we can tell who is sleeping with whom, right?
I mean, so there's a massive amount of intrusion in terms of privacy. You'd never
be able to do this, at least in an academic context, without having -- signing a
massive consent form. Absolutely.
>>: [inaudible] if they know that they're being noticed, they'll behave slightly
differently.
>> Nathan Eagle: People were -- and I was concerned about that as well. And if
you saw the first week of behavior, it does deviate significantly from the rest of
the behavior. But what you can see is that people basically kind of slip into a
routine, like basically follow a routine. Whether or not that's a routine that's
influenced by the fact that they know that they have spyware on their phones, it
very well might be. But I think that that it's first week where they got nervous
and then they forget about it.
>>: And I'm just curious to see if that first week scales do, like,
different -- [unintelligible] does that number change
from ->> Nathan Eagle: I think it really depends on the demographic. The next data set
I'm going to be talking about, these are individuals who work for a large Finnish
cell phone manufacturer and these guys, like they're so used to participating in
these types of studies, where, like, I think it would be even less.
7
But the end of the day, it's hard to answer that question, for obvious reasons.
One of the things that I've been trying to push with very minimal success is this
idea of characterizing an individual or a demographic with this notion of entropy.
So this idea of how much structure is in their behavior, how much information, how
much randomness so this idea of information entropy is how much information is in
a particular time series in this case.
So this is an example of an individual who is a low entropy person, meaning so this
is the white parts is him at home, and he's at work, and then he's back home again.
His day-to-day structure is dictated by dropping off his kids and picking them up
to and from school.
Whereas this is an individual who actually was my office mate at the time, who is
a grad student, much -- he's just as likely to be in the office at 3:00 a.m. as 3:00
p.m. like so you'd have to really squint to see much structure here.
So high entropy, low entropy. And it should be pretty obvious,
then that like when we're starting to parameterize individuals in terms of their
behavior in terms predicting what they're going to do next, we can do a lot better
on the low entropy people. Like you can see the patterns. You can figure out, you
can get a pretty good guess when he's going to show up to work the next day. Whereas
the higher entropy, it's a bit trickier.
Then you can start characterizing demographics this way too. We gave out phones
to these 18-year-olds who just arrived on campus, you know. The business school
students, admin staff, and it's, you know, and again it's not particularly
surprising, but it's the freshmen that are the most entropic.
We also had one other subject that I ended up throwing out, and I'll talk about him
a little bit more in detail later but he was off the charts in terms of entropy.
He was well beyond 50, and he was -- this is Marvin Minski. And I like basically,
I threw him out because, like, he was messing up everything and I figured that he
would be -- he was such an outlier that it didn't matter.
I'll talk a little bit later about how we're discovering people, these individuals
across different demographics who are extremely, very entropic. And it makes me
kind of reconsider throwing him out of the data set.
8
>>: So there are different ways to define location.
type X within time Y and T and.
>> Nathan Eagle:
It would be visits, place of
That's true, yeah.
>>: And exact same place within a larger time prize [inaudible] to define sameness,
you get different [inaudible] kinds of analyses.
>> Nathan Eagle: That's totally true. And in this case, this was broken down on
hourly windows. And placed in this context is things like home, work, elsewhere,
that kind of thing. But depending upon how you structure it, you can get I'm sure
radically different entropy scores.
But the point originally was simply to get a single metric and try to quantify how
much structure is in someone's life. And then once you do that, one of the things
that you can start doing is trying to model a particular, you know, every single
day. And in this case what we're doing is you can think of every day in a subject's
life as a single point in a really high dimensional space, right.
And the point -- one of the issues is that we don't live our lives as random number
generators, but rather these points, they're not randomly distributed through the
high dimensional space, but they're clustered, and so if you just do a simple
dimensionality reduction on this, you can actually start pulling out what the
principal components are. We're calling these Eigen behaviors, because my advisor
at the time, you know, created this Eigen face literature, and so this is -- using
more or less the identical technique but trying to quantify these salient behaviors
associated with individuals' lives.
>>:
So the PCA was on an individual's overall breakdown?
>> Nathan Eagle: Yeah. So basically, we get a matrix of these -- this time series,
and this is we're just taking the Eigen decomposition of the co-variate matrix.
>>:
You're doing a separate decomposition for every person?
>> Nathan Eagle:
>>:
That's correct.
Two slides later, I'll show we change that.
Okay.
>> Nathan Eagle:
Yeah.
But, you know, I just wanted -- this is something that kind
9
of pulls out some obvious things where you can start looking at how many Eigen
behaviors you need. How many of these vectors do you need to recreate someone's
day. And so for low entropy subjects, you don't need very many. Whereas in higher
entropy subjects, you do, to get the same type of accuracy. But then to assume its
point, originally, this was just for a single you, know, each day was a particular
point.
Now you can start characterizing individuals as a get point in a very high dimensional
space. And then characterize demographics. This is kind of a toy example. In this
case we're trying to quantify the Sloan Business School behavior space so collapsing
all the business school students down and trying to quantify their behavior on, in
this case, a 2D. So two vectors. So what we would do is try to figure out whether
someone's a business school student by measuring their Euclidian distance between
that space and where they lie, and we got something like over 90% accuracy in terms
of differentiating business school students.
So that's looking at kind of on an individual level and now I'm going to talk about
dyads, pairs of people. This is a graph here of basically proximity over the course
of a day, and this is the friendship graph.
So the research challenge is trying to figure out, how do you winnow down the edges
in this particular graph to get at what the underlying friendship graph is?
And I'll be kind of approaching this problem from a lot of angles, whether this is,
you know, proximity in this case via blue tooth or the operators have a will the
of communication data. So trying to figure out -- so what the operators have is
this massive communication graph, and what they want to get at is a social network.
So trying to look at a social network through the lens of proximity or through the
lens of communication and get at what the underlying relationships are is an
important question.
What we've been able to get, though, is much richer data than just operators have.
I mean, we can think about a graph that has multidimensional edges. You know, one
type of edge could be things like whether you're approximate on Saturday nights.
Another type of edge could be are you approximate at that person's house? Another
edge would be communication. So we've got this graph, this social network with lots
of different types of edges. And the challenge here is trying to infer where are
the friends? Can we infer that topology of the friendship graph, based on all these
higher dimensional edges?
10
And it turns out you can pretty trivially. I mean, this is -- my collaborator likes
to call this the relationship EKG. There's probably a better name for it. But I
haven't come up with it. This is probability of proximity at M.I.T. in this case,
and then off campus. And there's three different relationship types, right. If
I'm friends with Mike or if I say I'm friends with Mike and Mike says he's friends
with me, that's reciprocal. That's green. If I say I'm friends with Mike, but Mike
says he's not friends with me, that's an asymmetric relationship.
And then, of course, there's the symmetric non-friends. And what we're finding is
that this asymmetric friendships are generally when you're really proximate a lot
at work. Meaning if you're working with Mike a lot, you may assume that you guys
are friends, but, you know, Mike may not think that you are.
But if you look at proximity on Saturday nights, for example, you get a much more,
you know, it's, you know, if you're proximate on Saturday nights off campus,
especially, you're much more likely to be friends.
And so we can get something like a 96% accuracy in terms of inferring the edges of
the friendship network. But we can do a little bit, I think what we can do a little
bit better, because the survey traditionally was just, you know, mark off the people
you're friends with, yes or no.
What we can recreate is an actual weight associated with the relationship where that
weight is characterized by, you know, how much time you spent having dinner with
that person at that person's house, having lunch, traveling with the person,
communicating over the phone, et cetera.
And so I think we can get a much, kind of richer depiction of what the true
relationship is. And what's nice about -- I mean, this is going to be coming out
in PNAS, I think in a month and a half. What's nice about this particular technique
is that we're able to kind of scale traditional social network analysis to a much
larger group of people, potentially. Meaning like, so we had to kind of get ground
truth and so we had to have people go through a survey where they read all 100 names
off and had to check the names that they are friends with.
That obviously just won't scale as you want to increase your sample size. Like
they're not going to -- you're not going to be able to get people to do that for
a thousand people or 10,000 people or the 100 million people I'll be talking about
at the end. And so you need another system to be able to kind of quantify what the
11
social network is without having to deal with these types of surveys.
All right. So now from 100 M.I.T. students, we're going to 1,000 individuals who
are living in Helsinki. And what we've done here is we've basically put the same
uber spyware on all of their phones, and we did one more really obnoxious thing.
Every 20 calls they make, we have a little survey that pops up and asks them to label
the relationship. Is this person a friend? Is this person an office worker,
acquaintance, colleague, no relationship?
This is the won't disclose pattern. So this is just the temporal distribution of
when individuals make calls to what type of relationship. So we not only have the
temporal, but we also have the location where those calls were made. Yeah?
>>:
It's all making calls, not getting calls?
>> Nathan Eagle:
make a call.
This is making calls, yeah.
We had the survey only go when you
But again, this kind of leads us into this problem that the operators have, right.
Where they've got all this communication data, and they really want to get at
relationship. And the operators have come up with sneaky ways of getting at it.
Right, you've heard of like the friends and family plan, or the fave five or whatever
the T-Mobile is pushing. Like they're trying to get access to this type of data
from you. And what we're trying to do is build a little classifier to see if there
are kind of discrete signatures associated with these types of relationships.
So from a thousand fins, now this is 10,000 random sampled individuals across
America. And the real company component here, this is random sample. This is a
collaboration with IMMI. What IMMI did is the original reality mining study. So
that uber spyware on people's phones, but they used one additional sensor that I
wasn't allowed to use due to Couhes constraints, as well as the fact that I don't
think anyone would sign up for it, or I didn't think so at the time. The sensor
is the microphone. They're turning the microphone on on these phones for ten seconds
every minute and recording the ambient audio, pushing it back to a server.
So have you guys heard of that -- there's these companies where you hold up the phone
to the radio, and you get a text message back about what the song is being played.
So the guy who started that company, he did his Ph.D., finished it up about ten years
ago. It started a couple companies that were basically doing this acoustic
matching.
12
You know, both in Europe and here in the states. This is his latest company. And
he's found that this -- I think he just finished a closing a $50 million round from
Draper. Nielsen is trying to hostilely acquire them, like a hot commodity. And
the reason why they're a hot commodity, because his algorithm not only lets you figure
out what song is being played, but what advertising you're being exposed to.
So you can actually get a much better sense of whether, you know, what media people
are consuming. What TV show you're watching, what movie you're watching, what radio
station you're listening to in the car. And what's the really cool thing, I
think -- well, one of the really cool things is that they're able to show kind of
marketing efficacy. So if you watch the co-branded Burger King Simpsons ad, you're
actually much more likely to go into the Simpsons movie. Like they can
actually -- they can make these types of claims now and they can make them because
this is 10,000 randomly sampled individuals. And one of the major critiques with
the earlier work I talked about, okay, so you've parameterized the lives of 100 M.I.T.
students. So what? Like how does that scale to the population at large?
Suddenly, we're able to start talking now about, well, basically about the patterns
of, you know, much more generalizable patterns about demographics. Not just M.I.T.
freshmen, but we can say now that the most entropic group that we've found in the
100,000 people turn out to be women under 30 who are making more than $60,000 and
are college educated. These are now claims that we can make because we have a random
sample.
So from 10,000 random sampled individuals across America, this is now 100,000
criminals. So a red car corresponds to a carjacking. The little circle corresponds
to a gravity event. The martini class is public drunkenness. The -- let's see,
the RX symbol is a drug bust. So what we're trying to do here, the research question
is, you know, is crime contagious? Is there such a thing as a crime wave?
Originally when I wrote the little script that would plot these things as a time
series, I was hoping to start seeing kind of, you know, spread over something like
a lattice, but it's clearly a bit more noisy than anticipated, and so that's -- this
is a project going won a student of mine and an epidemiologist, from U pen, trying
to get it answering this question.
But there's been other in past, a lot of literature on things like is obesity
contagious? Is smoking contagious? And using a similar technique, actually, in
the -- I think it was the New England Journal of Medicine recently, there's a paper
13
where they showed that height was contagious and acne was contagious.
So you have to be really careful when you're caulking about contagion over a social
network when there's things as homophily, this idea of birds of a feather flocking
together. Like disambiguating, like what is homophily and what is an actual
contagion. For me, it was a lot harder than I anticipated.
>>:
[inaudible].
>> Nathan Eagle: So anyways, that's my hundred thousand.
And now I'm going to launch into CDR. So CDR stands for call data records. It's
the data that mobile operators capture about their subscribers. And just to make
sure that everyone's on the same page, this data is far richer than just a
communication graph. I mean, while we've got things like the caller and the
receiver, we've got the time, the duration of the call, the cell tower that's
associated with the call, and this data set we've got four years of data and this
is data from the only telecommunication company in the country.
So this represents every phone call made in the country over four years. So what
that enables you to do is ask some really interesting questions about, you know,
how society in general is evolving over time. We've got things like Me2U air time
transfer so you can actually send and receive air time on these handsets, and that's
being done on average about four times a month per subscriber.
So not only do you have these multidimensional edges, where edges could be a text
sending a text message, making a phone call, sending air time, but you've also got
a bunch of attributes for the nodes. You know, we know what phones people are using.
We know what region they're calling. We know scratch card denomination.
Eric and I were talking about this earlier today. So all these markets or at least
most of the markets that I am involved in are generally prepaid. So most of the
world has prepaid markets, where you buy a scratch card at -- you know, and you can
buy scratch cards all over Rwanda and most places on earth now where -- for air time.
And so you can buy -- in Rwanda's case, you can buy 25 cents worth of air time all
the way up to cards that have a $25 denomination. So what we've been doing is looking
at how this denomination is a proxy for socioeconomic status in looking at basically
how -- like using that to couple that with things like the census data that we have
from Rwanda and understanding what the distribution of wealth is in the country and
how these calling patterns are reflective of that.
14
The other thing I wanted to focus on before I move forward is the fact -- is the
privacy implications again. So these are individuals now who did not sign a consent
form. And originally, I thought this was going to be a real challenge to get it
through Couhes, which is the IRB at M.I.T.
But there's two things that make this -- made it a lot easier. One thing was that
these -- I have no phone numbers. I have anonymized I.D.s that are associated with
a sim card. And secondly, I'm not doing any data collection. This data already
exists. And sitting on a server, actually sitting on a bunch of servers underground
in Kigali at the moment. So with those two caveats, this type of research becomes
viable in an academic sense.
So one of the students who is working with me at SFI right now is writing his
thesis -- well, he's getting distracted in a lot of projects, but his thesis was
how -- the structure of society and the stability of society. How looking at this
data over a long period of time, trying to identify cliques that form and cliques
that basically, that dissolve. Are there particular characteristics of groups of
people that are correlated with the stability of their relationships?
Like so we identify relationships that are going to be persistent. Can we identify
groups of people that are base catty going to stay a group for a long period of time.
A group is defined as individuals who communicate a lot with each other.
So that's -- we've made less progress on that than I wanted to, but actually it was
inspired by a nature paper Laszlo Barabasi wrote maybe three or four years ago, where
he was basically asking similar times of questions, but getting at it in a different
way. He was using data, I think, from on the order of about a year, but now with
four years of data and literally every phone call in the country, I think we can
do a pretty good job at this.
>>:
[inaudible] does that represent?
>> Nathan Eagle: So when we started -- well, when the data started back in January
of 2005, it was 200,000 people in Rwanda had cell phones. By the time the data ended,
it was 1.4 million, which was about 15% of the country at the moment. And that number
is -- I mean, ARF case the fastest growing mobile phone market in the world. That
number is rising almost exponentially, which is exciting.
>>:
I would think it was a much higher percentage of family unit level.
15
>> Nathan Eagle:
over 50%.
Yeah.
So if you look at number of phones per household, it's well
All right. So one of the reasons why I kind of keep focusing on the fact that this
is every phone call in the country is that, you know, in previous work, in Laszlo's
as well, like there have been working with operators that have less than 100% market
share, generally less than 30% market share. And so we have a paper now, basically
showing how -- what the implications are when you're having that kind of sampling.
So if you have all the data from a single operator, and that operator has 20% of
market share, what it turns out is you only have 4% of the edges. And less than
a percent of the triangles. So when you start wanting to kind of build something
that diffuses over that graph, you get radically different dynamics than if you had
the full graph. And Sune Lehmann over at Barabasi's lab, he's been a big advocate
of the fact that we need to get away from this idea of discrete community structure.
You know, when people talk about community structure and graphs, generally we assume
that a node can only belong to one community.
That kind of, seems a bit silly if you think about it. If you think about your own
social networks, you're affiliated with your college, your college community and
then your a Microsoft community and your high school community, your family
community, so we've got a lot of different communities. But it turns out if you
sample social network down to something like 20%, you do get this kind of discrete
community topology. But as unique your sample size, you get overlapping
communities. You get a densification of the graph, and so that's kind of what we're
showing here is that, you know, the number of memberships increase as unique sample
size.
What we were able to do is recreate all of the previous results that have been, you
know, looking at processes that diffuse over a social network that have been
previously published at least when those authors were looking at 20% graph. And
we could recreate those dynamics when we sampled our graph at 20%. But then as we
increased the sampling size, those dynamics, they changed quite a bit.
And so that's -- so we've got a paper now in submission on talking about how those
dynamics change and the fact that when you want to think about constraining an
epidemic, initially we focus on looking at the super spreaders, which are generally
the hubs. But what we found is actually membership for a large number of communities
is four times more important than degree. Meaning it's more important how many
16
memberships you have with these different communities than simply the fact that
you've got a high degree.
All right.
>>:
So that's --
[inaudible].
>> Nathan Eagle:
>>:
How something spreads over a graph.
Meaning what in this case?
>> Nathan Eagle: Okay. In this case, you can think of that as information. I mean,
right now, we just used a simple SI model, susceptible infected model. This
shouldn't be confused with disease, per se, but this is -- you know, people have
been using social graphs to -- as a proxy for how disease would diffuse.
>>: But just to talk details [inaudible] what are you doing, are you saying given
this graph, how could I go from one place in the graph to another?
>> Nathan Eagle: No. What we do is we look at how many people you ultimately infect.
So we run the simulation thousands of times. We take one individual out of the 1.4
million, we infect them and we watch this thing spread over time. And over a long
period of time.
>>: [Inaudible] standard contagion model and applying to the graph and [inaudible]
as [inaudible] for a probability of [inaudible] with or without the arc in place?
>> Nathan Eagle: Yes, right. Okay. So that's -- and that's basically just
talking about the implications of sampling.
>>:
But that model does allow for a jump without an arc?
>> Nathan Eagle:
in that case.
>>:
You have to be connected?
>> Nathan Eagle:
>>:
Yeah, you have to be connected in order to disseminate something
Yeah, you have to have an edge.
I can imagine noisier models where you can [inaudible] ambience without the
17
arc or ->> Nathan Eagle:
>>:
>>
do
As
of
That's true.
You look at the arc it's noisy.
But when the arc is there, boom [inaudible].
Nathan Eagle: Yeah, there's all sorts of -- I mean, what we wanted to do was
the absolute dead simplest model to show the effect of membership versus degree.
unique the levels of sophistication, unique the realism but then you add a budge
things that may make your point less clear.
In this case, this is now another ->>:
Or clearer.
>> Nathan Eagle: Or potentially clearer, but with more variables.
susceptible to getting shot down by reviewers.
So more
So another study that we wanted to -- or another question we wanted to ask was about
the effect of cities, and in this case, this is -- okay. So I should spend a little
bit of time explaining this graph.
This
they
they
back
is these are all individuals who moved from a rural area into the capital, and
moved at zero. So they're spending their time in the rural area here, and then
move to the capital, and then the back movers, the individuals who end up moving
are characterized by the solid line.
And so you can start seeing, so this is the degree of -- this is their degree. So
this is the number of people that they call in the rural area. And as they move
to the capital, it goes down. But then the back movers, they move back after about
six months and then establish the same rule degree.
And the same kind of goes for this is their capital degree, so it goes up when they
move there, but then they never -- they never really fully integrate and then move
back and then it goes so it goes low again. Whereas the people who permanently move
is this dash line here. For moving from a rural region to capital, this is your
capital degree. So it's more or less kind of static and then it jumps and keeps
kind of just going up.
So the idea here is, you know, how do cities affect individuals?
One notion is this
18
hypothesis called differential selection, meaning that you behave
differently -- individuals in the rural area who are behaving like an urban person,
the more they behave like that urban person, the more they kind of get pulled into
the capital.
So that is -- that's kind of the theory of differential selection. They're
different from their home environment and that's what's driving them to move. And
you can actually see, you know, the individuals who do move are different. This
is kind of the average rural degree. And the people who do move have a lower, kind
of have less integration, have less contacts within that rural area.
>>:
So this is behavior described as the way they communicate with each other?
>> Nathan Eagle: This is purely looking at the degree. So is -- I mean, which is
kind of a proxy for behavior, but this is the number of people these individuals
called both in their home environment.
>>: [Inaudible] pattern and divvy those up into rural and urban and say if you've
got a communication pattern -- you're rural, but you've got a communication pattern
that's closer to the ->> Nathan Eagle: Urban, then you're more likely to leave. And that's kind of the
theory of the differential selection. The other theory that has been in the
literature for decades is this behavioral adaptation. So meaning you move to the
city and suddenly then you behave more like an individual who is in the city.
So what we're trying to do is try to figure out how much is differential selection,
how much is behavioral adaptation and it's still very much a work in progress, but
we're seeing clear signs of both. And then the other thing that has -- we're making
progress on, but I think we're not quite there yet, is this idea of integration.
So looking at how long does it take for individuals to integrate into these different
communities, both these urban communities and these article communities. And we're
finding that it's a lot easier for people to integrate into an urban environment
than a rural environment.
And then there's fun other things about rural environments, whereas you may have
lesser degree, but the tie strength is much greater. People in rural areas, they
don't have as many contacts, but their relationships are stronger, as measured by
call volume. So there's some interesting dynamics going on in urban versus rural,
and I've got another student at if SFI now hoist working on building a generative
19
model for this you are been growth. She's looking at Kibera, which is the largest
slum in the world, just outside of Nairobi, and what we're able to do is look at
how this slum is changing over time. We're inferring tribe by looking at what
regions of Kenya individuals living in this area are calling. And you can start
seeing the dynamics of how -- of the kind of turnover of these different tribes moving
in and out of these neighborhoods.
And the hope here is we can start parameterizing this in a way that we can quantify
how the slum is growing over time and with which type of inhabitants. And this is
important for people doing urban planning. I mean, these guys, they need to figure
out where to put the next latrines, where to put the next pieces of infrastructure,
where is the slum going to go next. These slums, in these developing world
countries, are very organic, they're ad hoc. They grow in very much an unsupervised
manner.
But I think they -- and this is very much work in progress, but I think that it's
looking like they have particular patterns in their dynamics that can be quantified.
And if they can be quantified, we can get some insight into really what's going on.
>>:
So is that data or that stuff, that's also coming from the Rwanda data?
>> Nathan Eagle: This is coming from, yeah, Rwanda and Kenya. Actually, this slum
is Kenya, which is ten millions. I jumped ahead. So we're working with the city
planners in Kigali. Kigali has their own sets of slums, but it's not the world's
biggest slum so we're basically applying these models, but we're using Kibera as
well right now.
>>:
Okay.
>> Nathan Eagle: So I hate the term universal law of human anything, but this is
kind of -- well, this is the name of the grant that we got so it's now the title
of the slide. But the idea is looking at quantifying how mobility models scale over
different people from different cultures in different continents.
And so what Marta did earlier, last year, was basically parameterize a population's
mobility in terms of a radius gyration. It's kind of using a standard gravity model.
She's able to find kind of the distribution of these radiuses. And what we're doing
now is looking at what this distribution is across a variety of different regions.
Both in America, Latin America, South America, Africa, Europe, and seeing how, you
know, what holds as kind of commonality and what is different across these different
20
regions.
In terms of diffusion, not only do we have -- so when we talk about CDR data, we
not only have the communication data, but we also have this product adoption. So
what products people are adopting over time, and especially in places like Kenya,
like Rwanda, like the domestic Republic, where there hasn't been as much mass
marketing, simply because people live in more rural areas and are less connected.
The importance of kind of how people hear about a product is, you know, basically
comes from your peers. And so what we've found -- and what's striking is this holds
true both in the U.K. as well as in Kenya, as well as in the DR. If you want to
start quantifying the probability that an individual adopts a particular product,
you know, you have a tried a, so you have three people. A and B both adopted and
you want to know what the probability of C adopting is. It's significantly more.
Meaning the probability of C adopting goes up significantly when the tried a is
closed, when A and B are friends. When A and B are friends, they exert more mutual
influence over C than if they're not connected. And it's striking how that result
seems to hold again across a wide range of markets and across a wide range of products.
This is now kind of work that's veering towards developmental economics. But I think
kind of should scale to a wide range of events. This is the implications of what
happened in an earthquake back in February of 2008. And what you're seeing here,
I mean, so the first thing is, you know, you type in Rwanda earthquake and you see
this massive spike and you find out that there was an earthquake in Nyamasheke region
on February 3rd.
And then you can start plotting what -- these are the behavior that's coming from
cell towers in this particular region. And on the Y axis, this is outgoing minus
incoming calls. So the net outgoing minus the net incoming. And you can see,
basically, the cell towers were -- the time series started about here and you
generally have more outgoing than incoming until this earthquake happened. And then
suddenly things changed dramatically. This is a plot that literally, we generated
about five days ago.
So there's no more kind of analysis beyond what you're seeing. But I think what
it points to is an ability to start -- ability to do things that Eric and I have
spent the morning talking about, being able to do surveillance, whether it's disease
surveillance or flooding, trying to detect flooding or crisis or market collapse.
We have this data on how people are communicating, how people are traveling, how
people are sharing air time, and the hope is that we can -- go to the next slide,
21
we can start parameterizing people's reactions to these catastrophes in a way that
we can use it for surveillance. You know, can we do something analogous to that
Google -- you know that going the flu trends paper.
So I liked the analogy that we could do something similar with CDR. And instead
of people typing in search queries about sicknesses, we can identify key behaviors
that are indicative of an outbreak. You know, signatures that are associated with
people who are getting sick or who have recently been flooded or recently been exposed
to something like an earthquake. Do you have a question?
>>: So you touched upon that triad, that if it's a close triad, it affects the
decision making of the third person. Have you applied that to the moving data that
you had, like who affects -- who is the influencer in where people move from urban
areas to city.
>> Nathan Eagle:
That's a good idea.
No, we haven't.
>>: [inaudible] in finding, like, something to the effect of who influences things
in social graphs. That's a very recent study.
>> Nathan Eagle: Right now, we're just looking at aggregate behavior.
trying to pull out individuals who are influencing more than others.
>>:
We're not
Okay.
>> Nathan Eagle: But that probably, at some point, we will do that. And we're still
just trying to get our head around kind of what the aggregate is looking at doing
first.
And so anyway, so there's a lot of questions that can be addressed, whether it's,
you know, trying to identify -- so I'm working with a group of epidemiologists at
imperial, and what they're trying to do is quantify how malaria is -- how mobility
affects malaria, so how people's movements around east Africa change the spread of
that particular disease.
And, you know, and with Eric, we've been talking about trying to quantify people's
reactions to something like cholera. Like when there's a cholera outbreak, and
there have been dozens of cholera outbreaks during the time period where we're
looking at this data. Can we start characterizing how a region responds to something
like that. And ideally, can we characterize the onset of a cholera outbreak? Can
22
we see the events leading up to a potential disease outbreak.
that, then we can do some real good.
>>:
If we can quantify
[inaudible].
>> Nathan Eagle: Yeah. So if you can close the loop, if you can identify when a
disaster is about to happen, you know, there's a lot of organizations, both
governmental and nongovernmental, that would be very eager to help out.
Josh and I, so Josh is a student at Berkeley. He just arrived in Kigali today, and
we're conducting a study a phone study, where we're surveying people and asking them
a variety of both socioeconomic questions, but also about, you know, what their
livelihood is dependent on and whether or not they've been involved in an economic
crash of some sort or, like, been touched that way with the idea of trying to figure
out how behavior is correlated with things like market prices.
All right. So now from a million Rwandans, we've got essentially the same data for
Kenya. Now it's 10 million mobile phone subscribers in Kenya, but we've got an
additional edge type. Instead of just communication and air time sharing, we've
got actually how people are sending and receiving money across these different
regions. It's a product called M-Pesa, which is kind of mobile money in Swahili.
And what M-Pesa has enabled is a variety of really interesting applications that
lie on top of it. So I mentioned earlier this company, TxtEagle, based in Nairobi,
and what it simply does is enables people to earn small amounts of money on their
phone by completing really simple tasks for corporations who pay them generally in
air time or M-Pesa.
Tasks are things like transcription, translation, image tagging, surveys. We've
got -- well, I mean, right now we've got something like 15 million subscribers who
now have phones where they can start earning money. And the question is, coming
up with enough tasks to meet the demand. I mean, we soft launched in February, and
with a series of translation tasks, and we had to shut the service down after
something like six hours because we ran out of [unintelligible] tasks. We
introduced the service to some taxicab drivers and some high school students, and
just within hours, we had thousands of users.
Like there's a huge demand for being able to start making money on your phone by
completing work. And it's now just a question of figuring out how to get enough
tasks.
23
And I mean, it's pretty exciting. I mean, and so if people in this room have an
idea about how crowd sourcing can, you know, what particular paid tasks can be done
on a handset, and it's important to think that these are not kind of smart phones.
This is a handset from -- the handset that you had ten years ago. So initially,
we deployed via SMS, but now we've gone to USSD, which is -- I mean, so SMS is to
USSD what e-mail is to Telenet. So USSD is kind of this persistent session-spaced
GSM protocol that is appropriate -- that works on literally every single one of the
four billion GSM phones on the planet today. So it's a fantastic opportunity, I
think, and it's a great protocol. But the problem with USSD is that you have to
be an operator in order to deploy USSD service. And so this is done very, you know,
this is basically done in house with these operators.
So that's -- and then I kind of, I put mechanism design. So this is an example of
trying to figure out what the kind of incentive and reward structure should be so
that you can get this work done. And when we initially launched, I think we were
probably paying people too much. People were earning upwards of three dollars an
hour, and that was -- I think, you know, when we redo this again, I think really
what we're going to try to do is cap how much the daily -- the daily amount you can
earn is to maybe a dollar or even less. So that we get a wider selection of people
who are completing these tasks.
So from those, you know, 10 million east Africans, this is now 100 million individuals
in the U.K. This is a graph actually, it's a social network now of 250 million unique
phone numbers, 12 billion edges, and we've got things like the notes are
characterized by region, by product adoption and the edges have a time and duration
associated with them.
This represents virtually every mobile -- every phone -- every land line phone call
in about 80% of the mobile phones for the month of August of 2005. So it's a very
large graph.
And I'm at the Santa Fe Institute now, and so one of the job requirements is to show
straight lines on log log plots. And this is a particularly good straight line.
This shows calls made for, in this case, this is August 19th, 2005, and what you're
seeing is 10 million people in the U.K. made one call. 1 million people made ten
calls, 100,000 people made 100 calls. And it just keeps going down and down. And
the thing that kind of -- you know, one of the things that initially gets pointed
out is like who are these people who are making, you know, a hundred thousand phone
calls on that particular day? And this seems to be true across these different data
sets. Like this is -- it's a true parallel all the way out to the tail and these
24
are clearly not human.
And actually, we have a little -- we're trying to build a little classifier that
tries to pull out, you know, what machine related properties. So if you behave like
a machine, we're not going to count you anymore.
So this is the unfiltered graph. And kind of the thing that's striking for me, I
think, is just the fact that out of the behavior of hundreds of millions of
idiosyncratic individuals, every single day, this is a straight line. But the
interesting thing is the slope of the line, the exponent of the power law changes
with the day of the week. So you have like a Monday, Tuesday, Wednesday, Thursday,
Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday,
Sunday. So that's calls out, and you have a mirrored exponent for calls received.
So there's something fundamental about what's going on with the dynamics there that
I think is interesting.
We've been doing a little bit of work just looking at how distance affects probability
of communication. So this idea of pro pin equity, is this motion, the fact that
you're much more likely to be friends with people who are proximate to you. And
what we've done is this is kind of similar to a PNAS paper by David Noll where he
took a live journal data set and looked at the probabilities that individuals link
based on zip codes. And this is what we've been able to do is recreate that
distribution that he find for live journal with the U.K. and it looks like Juan da
and the DR follow a similar distribution as well.
>>: [inaudible] messenger data the month of June [inaudible] communications. We
did an additional study, how often do you talk to somebody, given the fact that you
talked with them before and it goes one this.
>> Nathan Eagle:
>>:
You see people hopefully [inaudible] talk to people far away.
>> Nathan Eagle:
>>:
Goes up with this?
So your low weight ties are your proximate ones?
Right.
>> Nathan Eagle: We should do that too. So in any case, we're seeing similar -- I
mean, that's kind of the neat -- that's a neat finding, right, whether it's instant
messenger data, whether it's live journal data, whether it's communication data from
England, from the middle of a village in Rwanda. Getting these same types of
25
distributions repeatedly across different types of communication he'd medium,
across very different types of demographics. I think there's something interesting
there.
But it gets more interesting once we have that outcome of interest. So looking at
the outcome of interest for this slide is the multiple index of deprivation, which
is an aggregate score that's associated with access to healthcare, education levels
and average income.
and this is for the U.K. And so what we're doing here is we're testing Granovetter
hypothesis, the strength of weak ties. Right, so Granovetter many years ago has
kind of has theorized that it's going to be that acquaintance that's going to get
you your next job. It's not your best friend who is going to be able to kind of
push you forward socioeconomically, but rather it's the weaker ties. And what we've
done is we've looked at diversity.
So again, diversity now is -- well, in this case, this is a Shannon entropy metric,
but we've also used Simpson metric for diversity. It's more or less the same, an
idea of how much entropy is in an egocentric graph, based on the geography that
they're calling. And what's striking is that socioeconomic status, it's really not
that correlated with your degree or your call volume. In other words, it's not the
number of friends you have or how much time you spend on the phone, but rather it's
the diversity score. It's whether you have contacts in a wide range of areas.
So this is a pretty amazingly strong correlation especially in the social sciences
where we're explaining more than half of the variance in a group of people that
numbers 50 million.
So I this is a major result. But the controversy right now is figuring out what's
really driving this, because there's another kind of -- there's another theory from
Bert about structural holes and constraints, and I could spend a little bit of detail
talking about that. But basically, essentially, it's just if your friends are
friends, so if you have an insular group, what Bert theorized was that's going to
constrain you. That's going to mean that you have -- it's harder for you to change
and that constraint is going to affect your socioeconomic status.
Whereas if you have a lot of holes in your system, meaning if you have a lot of friends
in a lot of different places, then basically you're going to do -- you're going to
be better off. And so you can see how there's parallels there with diversity. And
so trying to figure out what's really the driving force is still an open question.
26
But at the end of the day, we're able to start validating these types of theories
because we have this amazing -- these amazing data sets. And it's not just all the
communication graph, but it's also -- through the U.K. civil service, they've done
very detailed studies about the socioeconomic status of all these regions in the
U.K., and so it's really, when combining these two things, where we can get some
interesting results.
Yeah?
>>: [Inaudible] services on top of that to address these [inaudible] based on this
data?
>> Nathan Eagle: That would be the ultimate goal. To date, no. And one of the
reasons is I can't, I can't really advocate them to do that, because what we haven't
done is talked about causality. What we've only done -- well, we talked about
correlation. If I can establish causality, then I can actually make a case to the
government saying, like, look, this is a particular policy. You need to start
adopting. We need you to, you know, create these longer edged ties across the nation
and improve everyone's socioeconomic status.
I can't make that case to the government yet, because I don't know whether or
not -- which direction it goes. My guess is it's going both ways, right. So people
who have higher, you know, who have higher diversity, they're going to be exposed
to more opportunities and so that's going to drive socioeconomic status up. But
the wealthy individuals, by the nature of being wealthy, will have these longer ties
as well. So trying to figure out which way causality works is very much an open
question.
Okay. And then I'm going to wrap it up now, I mean with the billions. And I'm
working with like I said, dozens of mobile phone operators from around the world,
and it allows us to ask a variety of really interesting questions, whether it's
looking at the dynamics of relations between nation states, looking at how culture
affect things, disease, outbreak warnings, et cetera.
So to wrap it up, I mean, I've talked about where N equals 100, I'm starting to infer
the friendship graph based on these behavioral signatures.
When N is a thousand, trying to figure out what kind of relationships people have,
friends, office workers, colleagues, based on their communication behavior. N
equals 10,000, this is the random sample, 10,000 individuals across America in
learning how -- learning, basically, what random sampling can do for us. So the
27
hundred thousand criminals in Philly, trying to figure out whether or not there is
such a thing as a crime wave and whether or not crime is contagious.
To a million people where we're looking at, you know, how the effect of urbanization
affects people support social networks, trying to understand how disease and disease
dynamics interplay with things like mobility and communication. 10 million people
looking at how resources flow through the network, economic resources and then this
platform TxtEagle that enables people to earn money on their phone. Then a hundred
million trying to validate for, I believe the first time, this theory that
Granovetter had about whether or not these weak ties really are strong economically.
And then into the billions, where it's possible to start comparing across cultures
and across continents and it is striking the similarities that we're initially
finding. So this study with Marta, it seems like that people in the Dominican
Republic move in similar patterns to the way people move around Rwanda, which bears
striking parallels to how people move around San Francisco and L.A. and Belgium.
There's particular types of information that spread over greater London with exactly
the same dynamics as it spreads through a village in Kenya. So we're at a time now
where we can almost -- well, I'm still pushing a little bit against it, but we can
almost start talking about universal laws. Universal laws of human behavior.
But I think kind of the bigger question, the -- in fact, like kind of the ultimate
open question here is really an engineering question, like once we've got all this
data, once we've got all these insights, once we've built this model of a system,
how can we use these insights and use this data to actively improve people's lives?
Because I'm a big a fan as anyone about plotting the aggregate behavior of 100 million
people on a log log plot and fitting a distribution and then raising up your arms
and yelling, eureka and claiming that you've found a new law of human behavior.
That's the easy part. I think the fundamentally, the harder part is trying to figure
out a way that we can use this data to improve the lives and better both the lives
of the billions of people who are generating this data continuously and the societies
in which they live.
And for that, it's no longer the social scientists who want to be validates theories
or physicists who want to start messing with the degree distributions. But now is
the time for the engineers. Now is the time for the applications and I think we
have the potential to really make a difference here. This could be extremely high
impact work. And so I think it's an exciting -- it's exciting space to play in.
And I'm playing in it with a variety of people and I'm welcome to continue to add
28
to this list if other people are interested.
So with that, I'll open it up for questions.
Thanks.
[applause]
>>: Should I give out my cell phone number?
Questions? CDR land here.
>> Nathan Eagle:
You can call me on my cell phone.
Yeah.
>>:
Any questions?
>>:
[inaudible] realtime or are you always getting chunks of data?
>> Nathan Eagle: Getting realtime data is possible, but I haven't -- to justify
getting realtime data, we'd have to do kind of -- we'd have to propose something
like what Eric was talking about earlier, interventions. At the moment, I can't
make the case that I have any intervention or even the ability to detect something
that the operator or the government would care about but when we do, there's no
technical reason why that's impossible. Like we should -- and that would be the
ultimate goal would be build a little filter on the realtime CDR to start flagging
regions that look like they might have just had a disease outbreak.
>>:
[inaudible] uploads.
>>: So are they going to take the same models that apply other ways to communication
like instant messenger, Facebook messages and Twitter exchanges. But if that's a
realtime field that's open, and you can just grab data out of that, because a lot
of these models should be applicable regardless of the mode of communication.
That's the relationship in every graph. You should be able to relate something in
one to see how it applies to the other one [inaudible].
>> Nathan Eagle: I'd like -- to be honest, I've got more data, I'm drowning in data.
And, you know, earlier in my life, I was a data junkie, like there's never enough
data. I always wanted more. And I'm kind of reconsidering that position and I'm
trying to be more kind of focused on questions that you want to address rather than
just trying to build the biggest social network that has ever existed.
But, you know, that question appeals to me as well.
So yeah, I mean, I'd love to
29
be able to couple different types of data sets together, especially if you can
legitimately couple them. You know, you know, the hashed ID, the phone number, you
know the hashed idea of the messenger account. You know the hashed idea of the bank
account. So I'm collaborating with people who have a bunch of Bank of America data.
So there's a lot of privacy implications, obviously, associated with, you know,
coupling these different data sets.
>>: So I set stuff out for a minute [inaudible] when you said you started working
with Kenya as well, some of the mobile [inaudible]; is that right?
>> Nathan Eagle:
Um-hmm.
>>: So are you seeing sort of the same kind of relationships with the call
structures? Are you getting data with an M-Pesa? I'm just curious about sort of
is there symmetry or similarity in the networks between the call structure stuff
and the embanking kind of relationship?
>> Nathan Eagle: Like this is early stages. Like I would -- I can tell you kind
of what things like the degree distribution looks like for these different networks.
And also, the air time sharing. I mean, there's kind of a lot of different -- there's
SMS, there's communication, there's air time and there's M-Pesa. And as you go down
that line, you get a sparser and sparser graph. And typically, it seems to be
correlated with volume. So if there's a lot of volume between a dyad, you're much
more likely to have the M-Pesa transaction. But we're in early stages and I really
can't -- I'm not comfortable saying more than that, until we really kind of stare
at the data more.
>>:
Okay.
>>: So I was wondering if you started doing anything in terms of mass intervention,
like providing people with access to information about their own behavior in the
web or on the phone and ->> Nathan Eagle: That would be the hope. I mean, the first thing is trying to
establish a piece of information that would be relevant to these individuals. And
I think something like the fact that there's, you know, a potential disease outbreak
would be a pretty good one. The other intervention, which is less of an experiment
and more of a commercial entity, is this idea of improving people's socioeconomic
status by providing them work. So giving them income based on them doing these tasks
and then seeing how that changes their behavior over time. I mean, that's nice
30
because, I mean, that really -- I mean, we've got the 15 million people in east Africa
who participating in that. So that's about as big of an intervention as I can
imagine.
>>: What sort of restrictions do you have to work on when you're working with these
data sets. I mean, do you have to go to these locations and work in these locations?
You can take the data ->> Nathan Eagle: Well, taking the data out is generally okay under major
constraints. But typically, I mean, a lot of this data, I've spent weeks, literally
weeks in underground basements in Kigali, in Nairobi, and it's a surreal experience,
actually. You're in this giant server farm that's just -- because you're on the
equator and you're being blasted by air conditioning and you take, like, the elevator
up, and you walk out the door and year in the middle of Africa.
But yeah, to deal with that, like to do the access to this data, I have to work very
intimately with the operators. And the operators are interested, because, you know,
we're providing them services, like TxtEagle. And then also insight into the
dynamics of their subscriber base.
At the end of the day, they don't have the computational resources, nor the human
resources to analyze this type of data. And so that's kind of the value proposition
there. And so I can take it out. I have to -- it has to, for obvious reasons, be
encrypted. There can't be any phone numbers, anything that can identify an
individual. And I want that for my own protection as well as for the privacy
protection. Like the last thing I want is to be indicted and have this data, you
know, come out and be going into a court case or something. So it needs to be
de-identified in a very rigorous way. And some operators are more -- are more
paranoid about that than other operators. But for me, that's kind of a mandatory
thing.
>>: Like taking out the phone numbers is one thing. But you have a lot of
information. Even if you can't -- like if you have someone's phone number, clearly
you can identify them pretty easily. But with the type of data you have, you could
probably identify a lo the of these people even without the phone number.
>> Nathan Eagle: That's possible. I mean, like I hope not. I have to make the
case to the IRB that that's not possible. But I think that's, you know, like, you
know, I mean, it's -- I don't know.
31
>>:
A grad student proving it is possible.
>> Nathan Eagle: We were talking at lunch the Climer paper, looking at topology
alone, you can start cracking some of the stuff. To be honest, I'm less worried
about those attacks, because those types of attacks, they let you find yourself and
maybe figure out who your friends are friends with. But it doesn't go beyond that.
So my real concern is making sure that, like -- and this goes to the fact of just
being able to -- I mean, ideally, I'd like to release some of this data to the academic
community. I mean, the biggest contribution of my Ph.D. was not these little
generative models of M.I.T. behavior, but rather it was releasing that data set
online. And it's generated literally hundreds of journal publications thousands
of times. It's been a pretty big contribution. Like having this type of data,
getting it out there in a way that's ethical would, I think, be a phenomenal
contribution, but the problem is trying to figure out a way to do that while
preserving privacy. And at the moment, I don't know the right answer to how to do
that.
>>: What sort of volume in terms of terabytes are you dealing with the largest data
sets here?
>> Nathan Eagle: The largest data sets, I mean, these things compress phenomenally
well. But uncompressed, we're talking maybe a half a petabyte.
>>:
Okay.
[applause]
Thanks very much.
Download