1

1 >> We're honored today to have Nathan Eagle visiting with us and sharing some very exciting work and exciting data collection and analysis going on that he's leading up. Nathan is a research scientist at M.I.T. and a post doctoral fellow at the Santa Fe Institute. He's been applying machine learning and graph theoretic analysis, graph analysis to understanding from massive data sets, particularly mobile phone and land line phone communications data, insights about human behavior and related phenomena. For example, disease processes he's been getting into these days. Nathan was a Fulbright scholar in 2006 where he went off to Kenya and Ethiopia and spent time working in the mobile phone space. In fact, developing educational curricula. Had to program mobile phones that's now being taught widely throughout Africa. He did his Bachelor's and Master's at Stanford, his Ph.D. at M.I.T., and he's working on founding a kind of a new area of work we've been referring to as AID, artificial intelligence, learning and reasoning for development, kind of an active sub-field, probably one of the more sub-fields of IC for D, information and communication technology -- actually, ICT for D, communication technology for development. With that, we'll hear from Nathan today about the complex social systems from communication data. From hundreds to thousands to millions to soon billions of nodes and their links amongst one another. Thanks. >> Nathan Eagle: Thanks, Eric. It's a pleasure being here. Maybe a little bit more background about kind of where I'm coming from. So I was -- so I joined the, you know, Sandy Pentland, the wearable computing group with [unintelligible] and others back in 2001. And back then, wearable computing really meant strapping computer literally to your back and occasionally you'd see guys with head mounted displays on their heads and they'd be walking around M.I.T. and Boston capturing really interesting things about themselves and about their environment. Sometimes much to the amusement of people who were watching them walk around M.I.T. and Boston. So I was -- there was some -- there was at least a trajectory where I would basically be strapping a computer to myself and collecting similar types of data. But since 2001-2002, and this has kind of converged when Nokia launched their first mainstream programmable phone. So instead of going down that wearable computing route, it started hacking on phones, and I've been basically programming phones ever since finished up my Ph.D. and had started a small company that essentially connected socially awkward singles in Manhattan together based on their mobile phones. 2 I felt like there was something more -- what I really wanted to do was build an application that had a massive impact, and this impact just wasn't happening in the states or Finland or Korea, for that matter. But in 2005-2006, things were really kicking off in Africa. And so I was lucky enough to get a research scientist position at M.I.T., but instead of having my office in Boston, I lived in a small village out on the coast of Kenya, village called Khalifi, and there I did a variety of different things that Eric alluded to. Starting this kind of mobile phone programming curriculum, originally at the University of Nairobi, but it's now being taught in ten sub-Saharan African countries and literally thousands of African computer scientists have gone through the program and learned how to program phones, which has resulted in hundreds of applications designed specifically for the African market, and quite a few startups based in Nairobi and Adosaba [phonetic] and elsewhere. One such startup is my own, a company based in Nairobi called TxtEagle, and it enables mobile phone subscribers to earn small amounts of money on their phone, by completing simple tasks for corporations who pay them either an air time or M-Pesa, which is the local mobile money. This is just really kind of a pipe dream, this time last year, but it's now a working product across East Africa and the reason why it happened so quickly is because I've got fairly strong connections with literally every mobile phone operator in east Africa and dozens around the world. So those relationships are going to be what enables a lot of analysis that you see at the latter half of this talk when I'm talking about scaling to millions and tens of millions. And soon, I hope to have a data set that solves over a billion people by this time next year. So that's a little bit about who am, and now I'm going to kind of talk a little, kind of overview of what this talk's going to go into, and it's going to be on scales of magnitude. So we're going to start off with what I did in my doctoral research, where we studied 100 people and were able to start inferring the social relationships between people based on behavioral data. Then I'm going to talk about how that study scaled to a thousand people, we did the same thing, but now this is a thousand people working in two office buildings in Helsinki. From a thousand people, I'll talk about a data set that we have collected with a company, eye in the my, where this is now ten thousand random sample people across America where again it's the same type of uber spyware that's installed on their phone that's I'll go into detail about. From 10,000 to 100,000 criminals in Philadelphia in talking about things like whether 3 or not a crime wave exists. And if a crime wave does exist, can we quantify it in terms of its magnitude and its speed? >>: [inaudible]. >> Nathan Eagle: This is what happens if you do a Google image search for a criminal, and what the other image shows is Bill Gates and his photo. From 100,000 criminals, I'm going to be talking about a million mobile phone subscribers in Rwanda. And the questions we're discussing there are how does urbanization affect people's social support groups and the dynamics of things like slums. As Eric alluded to, looking at disease and patterns of disease and see if we can start seeing, is there kind of a behavioral signature associated with an outbreak of a disease in a particular region that gets reflected in the CDR, the call data records. From 1 million, I'm going to go to the 10 million mobile phone subscribers in Kenya, and this is particular data set that I find fascinating, because it's not just communication anymore. And it's not just people's movements but now in Kenya, people can send and receive money on their phones, and so we now have this flow of actually economic resources throughout the country that is fascinating to study. From the 10 million Kenyans, we're going to go into 100 million people in the U.K., and this is now in collaboration with British telecom and the civil service in the U.K. We're coupling just about every phone call made in the country, both land line and mobile, with things like socioeconomic status data, access to healthcare, education levels, average income and trying to figure out, basically trying to establish causality, like are there signatures associated with wealth that are indicative of, again, within the call data records. And then lastly, you know what does it mean when we start comparing this data across cultures, across continents. What kind of new research questions can be addressed and gets inevitably to the level of billions, which it will in the very near future. Okay. So that's kind of an outline of the talk. For people who are familiar with social network analysis, these two graphs, that one in particular should be fairly -- should be, you know, fairly familiar. These are -- the graph on the top was collected back in the '30s by two guys, Roethlisberger and Dixon, and this is a graph where these are social networks so the nodes are people and you have an edge in this case, on the top graph if you were seen playing cards with another person in this particular electric company, which is being studied. And so that's one social network. 4 And this is a social network that was recorded maybe 40 years later, back in the '70s, by -- and this is a famous or infamous Zachary Karate Club data. And there's been a lot of people, you know a lot of my friends and peers and people who are much smarter than I am who have dedicated a large fraction of their lives to studying this particular graph. The Zachary Karate Club data has been in literally hundreds of physics journals, where people are trying to look at graph segmentation, trying to pull out community structure so while social network analysis has kind of veered into -- has veered well beyond social sciences to statisticians to physicists and yet generally, when typically when you read these papers, they're really only for kind of the -- the papers have algorithms that are only applicable for small and static graphs, despite the fact that we're now inundated with behavioral data that's many orders of magnitude larger than these types of graphs, originally from the Internet, but now just from everyday life. And so a lot of the projects that I'm going to be overviewing come from problems, you know. They come from people who need help. And those people are, you know, anyone from the police chief in Philadelphia, you are an planners in Kigali, epidemiologists in Kenya, the civil service in the U.K., all these individuals are practitioners who suddenly found themselves inundated with data, and they're looking to the academic community for help and are understandably a little bit confused why there's so many brilliant people focused on a particular Karate Club from over 30 years ago when, in reality, the data that we have, you know, the data that these guys have is just much richer and much, in my opinion, more interesting. But the problem is it's not small, it's not static, it's not computationally tractable. And so fundamentally, we need a new set of tools to deal with data that we're currently -- well, that we currently have today. Things like betweenness centrality, you know. These are metrics that -- these are fairly sophisticated and interesting metrics that can characterize, in centrality, for example, how many shores past go through a single node. But something like betweenness centrality is more or less intractable for any of the type of data that I work with. And so what I'm trying to push for is, like, come up with these types of analogs that can do better at analyzing what we have and what we need. So the real focus here is going to be shifting from, like, just thinking about theory and thinking about, you know, building a generative model that can mimic a particular degree distribution and going towards really focusing on outcomes. So trying to understand what is it the process that we want to get a better handle on? And 5 outcomes of interest for me are things like disease. They're things like traffic jams. There's things like access to healthcare. And all these outcomes of interest are basically can be informed by and conditioned on the peta scale data that we currently have. But there's -- it is fundamentally two different philosophies. I'm coming at it from the outcomes perspective, rather than looking at the data for the data's sake. And that's how I at least try to differentiate some of this particular work. I also don't want to be misunderstood. A lot of the results here do come from collaborations a long series of collaborations with academics ranging from social scientists to particle physicists. Where we do try to validate theory and come up with better theory about the dynamics of complex social systems. But for me, I think the harder question here really is, you know, once you've got all this data and once you fit a particular distribution, how can we start using this data to improve people's lives? Like how can we make the society that wherever this data's coming from, how can we make that society better? And that's kind of going to be an underlying theme in a lot of the work that I'm going to be talking about. So with that as a long-winded introduction, let me start on the first order of magnitude here. This is N equals 100. These are individuals who are subjects in this in our original study, where we gave out phones that had more or less uber spyware on it. It started on boot, it was invisible to the end user and it literally logged everything. Logged communication, logged proximity, logged location. And what you can do with this type of data is some fairly, well, interesting human modeling. Given the fact that subject 104 is on their way to central square and it's 3:00 p.m. on a Tuesday, can we predict where this person's going to have dinner? Can we predict who this person is going to have dinner with? You know, given the proximity patterns of S22 and S18, if they're proximate by the coffee machine at the media lab every Tuesday afternoon, that corresponds to one type of relationship. But if they're proximate on downtown Boston late Saturday night at 3:00 in the morning, every single Saturday night, it's a very different type of relationship. Lastly, looking at the aggregate behavior, you can start seeing things like the onset of, for example, M.I.T. finals week, by people moving faster. Or we were capturing data when the Red Sox won the World Series for the first time. And then suddenly, all the models broke. Everyone became unpredictable. And what happened was the 6 majority of the subjects went into downtown Boston, and there's a big rally down there. And the urban planners, they were interested in looking at how people dispersed from that particular gathering point. How many people took the subway? How many people walked over the bridge? How many people rode their bikes? With this type of data, we can start answering questions about how people utilize urban infrastructure. And so that's kind of a bit of a broad overview of this type of data set. Is there a question? >>: So you said that the spyware was invisible. all your data ->> Nathan Eagle: >>: So just as a general question for It's a privacy question. Did the subjects know they were subjects? >> Nathan Eagle: Oh, to do this study, we can tell who is sleeping with whom, right? I mean, so there's a massive amount of intrusion in terms of privacy. You'd never be able to do this, at least in an academic context, without having -- signing a massive consent form. Absolutely. >>: [inaudible] if they know that they're being noticed, they'll behave slightly differently. >> Nathan Eagle: People were -- and I was concerned about that as well. And if you saw the first week of behavior, it does deviate significantly from the rest of the behavior. But what you can see is that people basically kind of slip into a routine, like basically follow a routine. Whether or not that's a routine that's influenced by the fact that they know that they have spyware on their phones, it very well might be. But I think that that it's first week where they got nervous and then they forget about it. >>: And I'm just curious to see if that first week scales do, like, different -- [unintelligible] does that number change from ->> Nathan Eagle: I think it really depends on the demographic. The next data set I'm going to be talking about, these are individuals who work for a large Finnish cell phone manufacturer and these guys, like they're so used to participating in these types of studies, where, like, I think it would be even less. 7 But the end of the day, it's hard to answer that question, for obvious reasons. One of the things that I've been trying to push with very minimal success is this idea of characterizing an individual or a demographic with this notion of entropy. So this idea of how much structure is in their behavior, how much information, how much randomness so this idea of information entropy is how much information is in a particular time series in this case. So this is an example of an individual who is a low entropy person, meaning so this is the white parts is him at home, and he's at work, and then he's back home again. His day-to-day structure is dictated by dropping off his kids and picking them up to and from school. Whereas this is an individual who actually was my office mate at the time, who is a grad student, much -- he's just as likely to be in the office at 3:00 a.m. as 3:00 p.m. like so you'd have to really squint to see much structure here. So high entropy, low entropy. And it should be pretty obvious, then that like when we're starting to parameterize individuals in terms of their behavior in terms predicting what they're going to do next, we can do a lot better on the low entropy people. Like you can see the patterns. You can figure out, you can get a pretty good guess when he's going to show up to work the next day. Whereas the higher entropy, it's a bit trickier. Then you can start characterizing demographics this way too. We gave out phones to these 18-year-olds who just arrived on campus, you know. The business school students, admin staff, and it's, you know, and again it's not particularly surprising, but it's the freshmen that are the most entropic. We also had one other subject that I ended up throwing out, and I'll talk about him a little bit more in detail later but he was off the charts in terms of entropy. He was well beyond 50, and he was -- this is Marvin Minski. And I like basically, I threw him out because, like, he was messing up everything and I figured that he would be -- he was such an outlier that it didn't matter. I'll talk a little bit later about how we're discovering people, these individuals across different demographics who are extremely, very entropic. And it makes me kind of reconsider throwing him out of the data set. 8 >>: So there are different ways to define location. type X within time Y and T and. >> Nathan Eagle: It would be visits, place of That's true, yeah. >>: And exact same place within a larger time prize [inaudible] to define sameness, you get different [inaudible] kinds of analyses. >> Nathan Eagle: That's totally true. And in this case, this was broken down on hourly windows. And placed in this context is things like home, work, elsewhere, that kind of thing. But depending upon how you structure it, you can get I'm sure radically different entropy scores. But the point originally was simply to get a single metric and try to quantify how much structure is in someone's life. And then once you do that, one of the things that you can start doing is trying to model a particular, you know, every single day. And in this case what we're doing is you can think of every day in a subject's life as a single point in a really high dimensional space, right. And the point -- one of the issues is that we don't live our lives as random number generators, but rather these points, they're not randomly distributed through the high dimensional space, but they're clustered, and so if you just do a simple dimensionality reduction on this, you can actually start pulling out what the principal components are. We're calling these Eigen behaviors, because my advisor at the time, you know, created this Eigen face literature, and so this is -- using more or less the identical technique but trying to quantify these salient behaviors associated with individuals' lives. >>: So the PCA was on an individual's overall breakdown? >> Nathan Eagle: Yeah. So basically, we get a matrix of these -- this time series, and this is we're just taking the Eigen decomposition of the co-variate matrix. >>: You're doing a separate decomposition for every person? >> Nathan Eagle: >>: That's correct. Two slides later, I'll show we change that. Okay. >> Nathan Eagle: Yeah. But, you know, I just wanted -- this is something that kind 9 of pulls out some obvious things where you can start looking at how many Eigen behaviors you need. How many of these vectors do you need to recreate someone's day. And so for low entropy subjects, you don't need very many. Whereas in higher entropy subjects, you do, to get the same type of accuracy. But then to assume its point, originally, this was just for a single you, know, each day was a particular point. Now you can start characterizing individuals as a get point in a very high dimensional space. And then characterize demographics. This is kind of a toy example. In this case we're trying to quantify the Sloan Business School behavior space so collapsing all the business school students down and trying to quantify their behavior on, in this case, a 2D. So two vectors. So what we would do is try to figure out whether someone's a business school student by measuring their Euclidian distance between that space and where they lie, and we got something like over 90% accuracy in terms of differentiating business school students. So that's looking at kind of on an individual level and now I'm going to talk about dyads, pairs of people. This is a graph here of basically proximity over the course of a day, and this is the friendship graph. So the research challenge is trying to figure out, how do you winnow down the edges in this particular graph to get at what the underlying friendship graph is? And I'll be kind of approaching this problem from a lot of angles, whether this is, you know, proximity in this case via blue tooth or the operators have a will the of communication data. So trying to figure out -- so what the operators have is this massive communication graph, and what they want to get at is a social network. So trying to look at a social network through the lens of proximity or through the lens of communication and get at what the underlying relationships are is an important question. What we've been able to get, though, is much richer data than just operators have. I mean, we can think about a graph that has multidimensional edges. You know, one type of edge could be things like whether you're approximate on Saturday nights. Another type of edge could be are you approximate at that person's house? Another edge would be communication. So we've got this graph, this social network with lots of different types of edges. And the challenge here is trying to infer where are the friends? Can we infer that topology of the friendship graph, based on all these higher dimensional edges? 10 And it turns out you can pretty trivially. I mean, this is -- my collaborator likes to call this the relationship EKG. There's probably a better name for it. But I haven't come up with it. This is probability of proximity at M.I.T. in this case, and then off campus. And there's three different relationship types, right. If I'm friends with Mike or if I say I'm friends with Mike and Mike says he's friends with me, that's reciprocal. That's green. If I say I'm friends with Mike, but Mike says he's not friends with me, that's an asymmetric relationship. And then, of course, there's the symmetric non-friends. And what we're finding is that this asymmetric friendships are generally when you're really proximate a lot at work. Meaning if you're working with Mike a lot, you may assume that you guys are friends, but, you know, Mike may not think that you are. But if you look at proximity on Saturday nights, for example, you get a much more, you know, it's, you know, if you're proximate on Saturday nights off campus, especially, you're much more likely to be friends. And so we can get something like a 96% accuracy in terms of inferring the edges of the friendship network. But we can do a little bit, I think what we can do a little bit better, because the survey traditionally was just, you know, mark off the people you're friends with, yes or no. What we can recreate is an actual weight associated with the relationship where that weight is characterized by, you know, how much time you spent having dinner with that person at that person's house, having lunch, traveling with the person, communicating over the phone, et cetera. And so I think we can get a much, kind of richer depiction of what the true relationship is. And what's nice about -- I mean, this is going to be coming out in PNAS, I think in a month and a half. What's nice about this particular technique is that we're able to kind of scale traditional social network analysis to a much larger group of people, potentially. Meaning like, so we had to kind of get ground truth and so we had to have people go through a survey where they read all 100 names off and had to check the names that they are friends with. That obviously just won't scale as you want to increase your sample size. Like they're not going to -- you're not going to be able to get people to do that for a thousand people or 10,000 people or the 100 million people I'll be talking about at the end. And so you need another system to be able to kind of quantify what the 11 social network is without having to deal with these types of surveys. All right. So now from 100 M.I.T. students, we're going to 1,000 individuals who are living in Helsinki. And what we've done here is we've basically put the same uber spyware on all of their phones, and we did one more really obnoxious thing. Every 20 calls they make, we have a little survey that pops up and asks them to label the relationship. Is this person a friend? Is this person an office worker, acquaintance, colleague, no relationship? This is the won't disclose pattern. So this is just the temporal distribution of when individuals make calls to what type of relationship. So we not only have the temporal, but we also have the location where those calls were made. Yeah? >>: It's all making calls, not getting calls? >> Nathan Eagle: make a call. This is making calls, yeah. We had the survey only go when you But again, this kind of leads us into this problem that the operators have, right. Where they've got all this communication data, and they really want to get at relationship. And the operators have come up with sneaky ways of getting at it. Right, you've heard of like the friends and family plan, or the fave five or whatever the T-Mobile is pushing. Like they're trying to get access to this type of data from you. And what we're trying to do is build a little classifier to see if there are kind of discrete signatures associated with these types of relationships. So from a thousand fins, now this is 10,000 random sampled individuals across America. And the real company component here, this is random sample. This is a collaboration with IMMI. What IMMI did is the original reality mining study. So that uber spyware on people's phones, but they used one additional sensor that I wasn't allowed to use due to Couhes constraints, as well as the fact that I don't think anyone would sign up for it, or I didn't think so at the time. The sensor is the microphone. They're turning the microphone on on these phones for ten seconds every minute and recording the ambient audio, pushing it back to a server. So have you guys heard of that -- there's these companies where you hold up the phone to the radio, and you get a text message back about what the song is being played. So the guy who started that company, he did his Ph.D., finished it up about ten years ago. It started a couple companies that were basically doing this acoustic matching. 12 You know, both in Europe and here in the states. This is his latest company. And he's found that this -- I think he just finished a closing a $50 million round from Draper. Nielsen is trying to hostilely acquire them, like a hot commodity. And the reason why they're a hot commodity, because his algorithm not only lets you figure out what song is being played, but what advertising you're being exposed to. So you can actually get a much better sense of whether, you know, what media people are consuming. What TV show you're watching, what movie you're watching, what radio station you're listening to in the car. And what's the really cool thing, I think -- well, one of the really cool things is that they're able to show kind of marketing efficacy. So if you watch the co-branded Burger King Simpsons ad, you're actually much more likely to go into the Simpsons movie. Like they can actually -- they can make these types of claims now and they can make them because this is 10,000 randomly sampled individuals. And one of the major critiques with the earlier work I talked about, okay, so you've parameterized the lives of 100 M.I.T. students. So what? Like how does that scale to the population at large? Suddenly, we're able to start talking now about, well, basically about the patterns of, you know, much more generalizable patterns about demographics. Not just M.I.T. freshmen, but we can say now that the most entropic group that we've found in the 100,000 people turn out to be women under 30 who are making more than $60,000 and are college educated. These are now claims that we can make because we have a random sample. So from 10,000 random sampled individuals across America, this is now 100,000 criminals. So a red car corresponds to a carjacking. The little circle corresponds to a gravity event. The martini class is public drunkenness. The -- let's see, the RX symbol is a drug bust. So what we're trying to do here, the research question is, you know, is crime contagious? Is there such a thing as a crime wave? Originally when I wrote the little script that would plot these things as a time series, I was hoping to start seeing kind of, you know, spread over something like a lattice, but it's clearly a bit more noisy than anticipated, and so that's -- this is a project going won a student of mine and an epidemiologist, from U pen, trying to get it answering this question. But there's been other in past, a lot of literature on things like is obesity contagious? Is smoking contagious? And using a similar technique, actually, in the -- I think it was the New England Journal of Medicine recently, there's a paper 13 where they showed that height was contagious and acne was contagious. So you have to be really careful when you're caulking about contagion over a social network when there's things as homophily, this idea of birds of a feather flocking together. Like disambiguating, like what is homophily and what is an actual contagion. For me, it was a lot harder than I anticipated. >>: [inaudible]. >> Nathan Eagle: So anyways, that's my hundred thousand. And now I'm going to launch into CDR. So CDR stands for call data records. It's the data that mobile operators capture about their subscribers. And just to make sure that everyone's on the same page, this data is far richer than just a communication graph. I mean, while we've got things like the caller and the receiver, we've got the time, the duration of the call, the cell tower that's associated with the call, and this data set we've got four years of data and this is data from the only telecommunication company in the country. So this represents every phone call made in the country over four years. So what that enables you to do is ask some really interesting questions about, you know, how society in general is evolving over time. We've got things like Me2U air time transfer so you can actually send and receive air time on these handsets, and that's being done on average about four times a month per subscriber. So not only do you have these multidimensional edges, where edges could be a text sending a text message, making a phone call, sending air time, but you've also got a bunch of attributes for the nodes. You know, we know what phones people are using. We know what region they're calling. We know scratch card denomination. Eric and I were talking about this earlier today. So all these markets or at least most of the markets that I am involved in are generally prepaid. So most of the world has prepaid markets, where you buy a scratch card at -- you know, and you can buy scratch cards all over Rwanda and most places on earth now where -- for air time. And so you can buy -- in Rwanda's case, you can buy 25 cents worth of air time all the way up to cards that have a $25 denomination. So what we've been doing is looking at how this denomination is a proxy for socioeconomic status in looking at basically how -- like using that to couple that with things like the census data that we have from Rwanda and understanding what the distribution of wealth is in the country and how these calling patterns are reflective of that. 14 The other thing I wanted to focus on before I move forward is the fact -- is the privacy implications again. So these are individuals now who did not sign a consent form. And originally, I thought this was going to be a real challenge to get it through Couhes, which is the IRB at M.I.T. But there's two things that make this -- made it a lot easier. One thing was that these -- I have no phone numbers. I have anonymized I.D.s that are associated with a sim card. And secondly, I'm not doing any data collection. This data already exists. And sitting on a server, actually sitting on a bunch of servers underground in Kigali at the moment. So with those two caveats, this type of research becomes viable in an academic sense. So one of the students who is working with me at SFI right now is writing his thesis -- well, he's getting distracted in a lot of projects, but his thesis was how -- the structure of society and the stability of society. How looking at this data over a long period of time, trying to identify cliques that form and cliques that basically, that dissolve. Are there particular characteristics of groups of people that are correlated with the stability of their relationships? Like so we identify relationships that are going to be persistent. Can we identify groups of people that are base catty going to stay a group for a long period of time. A group is defined as individuals who communicate a lot with each other. So that's -- we've made less progress on that than I wanted to, but actually it was inspired by a nature paper Laszlo Barabasi wrote maybe three or four years ago, where he was basically asking similar times of questions, but getting at it in a different way. He was using data, I think, from on the order of about a year, but now with four years of data and literally every phone call in the country, I think we can do a pretty good job at this. >>: [inaudible] does that represent? >> Nathan Eagle: So when we started -- well, when the data started back in January of 2005, it was 200,000 people in Rwanda had cell phones. By the time the data ended, it was 1.4 million, which was about 15% of the country at the moment. And that number is -- I mean, ARF case the fastest growing mobile phone market in the world. That number is rising almost exponentially, which is exciting. >>: I would think it was a much higher percentage of family unit level. 15 >> Nathan Eagle: over 50%. Yeah. So if you look at number of phones per household, it's well All right. So one of the reasons why I kind of keep focusing on the fact that this is every phone call in the country is that, you know, in previous work, in Laszlo's as well, like there have been working with operators that have less than 100% market share, generally less than 30% market share. And so we have a paper now, basically showing how -- what the implications are when you're having that kind of sampling. So if you have all the data from a single operator, and that operator has 20% of market share, what it turns out is you only have 4% of the edges. And less than a percent of the triangles. So when you start wanting to kind of build something that diffuses over that graph, you get radically different dynamics than if you had the full graph. And Sune Lehmann over at Barabasi's lab, he's been a big advocate of the fact that we need to get away from this idea of discrete community structure. You know, when people talk about community structure and graphs, generally we assume that a node can only belong to one community. That kind of, seems a bit silly if you think about it. If you think about your own social networks, you're affiliated with your college, your college community and then your a Microsoft community and your high school community, your family community, so we've got a lot of different communities. But it turns out if you sample social network down to something like 20%, you do get this kind of discrete community topology. But as unique your sample size, you get overlapping communities. You get a densification of the graph, and so that's kind of what we're showing here is that, you know, the number of memberships increase as unique sample size. What we were able to do is recreate all of the previous results that have been, you know, looking at processes that diffuse over a social network that have been previously published at least when those authors were looking at 20% graph. And we could recreate those dynamics when we sampled our graph at 20%. But then as we increased the sampling size, those dynamics, they changed quite a bit. And so that's -- so we've got a paper now in submission on talking about how those dynamics change and the fact that when you want to think about constraining an epidemic, initially we focus on looking at the super spreaders, which are generally the hubs. But what we found is actually membership for a large number of communities is four times more important than degree. Meaning it's more important how many 16 memberships you have with these different communities than simply the fact that you've got a high degree. All right. >>: So that's -- [inaudible]. >> Nathan Eagle: >>: How something spreads over a graph. Meaning what in this case? >> Nathan Eagle: Okay. In this case, you can think of that as information. I mean, right now, we just used a simple SI model, susceptible infected model. This shouldn't be confused with disease, per se, but this is -- you know, people have been using social graphs to -- as a proxy for how disease would diffuse. >>: But just to talk details [inaudible] what are you doing, are you saying given this graph, how could I go from one place in the graph to another? >> Nathan Eagle: No. What we do is we look at how many people you ultimately infect. So we run the simulation thousands of times. We take one individual out of the 1.4 million, we infect them and we watch this thing spread over time. And over a long period of time. >>: [Inaudible] standard contagion model and applying to the graph and [inaudible] as [inaudible] for a probability of [inaudible] with or without the arc in place? >> Nathan Eagle: Yes, right. Okay. So that's -- and that's basically just talking about the implications of sampling. >>: But that model does allow for a jump without an arc? >> Nathan Eagle: in that case. >>: You have to be connected? >> Nathan Eagle: >>: Yeah, you have to be connected in order to disseminate something Yeah, you have to have an edge. I can imagine noisier models where you can [inaudible] ambience without the 17 arc or ->> Nathan Eagle: >>: >> do As of That's true. You look at the arc it's noisy. But when the arc is there, boom [inaudible]. Nathan Eagle: Yeah, there's all sorts of -- I mean, what we wanted to do was the absolute dead simplest model to show the effect of membership versus degree. unique the levels of sophistication, unique the realism but then you add a budge things that may make your point less clear. In this case, this is now another ->>: Or clearer. >> Nathan Eagle: Or potentially clearer, but with more variables. susceptible to getting shot down by reviewers. So more So another study that we wanted to -- or another question we wanted to ask was about the effect of cities, and in this case, this is -- okay. So I should spend a little bit of time explaining this graph. This they they back is these are all individuals who moved from a rural area into the capital, and moved at zero. So they're spending their time in the rural area here, and then move to the capital, and then the back movers, the individuals who end up moving are characterized by the solid line. And so you can start seeing, so this is the degree of -- this is their degree. So this is the number of people that they call in the rural area. And as they move to the capital, it goes down. But then the back movers, they move back after about six months and then establish the same rule degree. And the same kind of goes for this is their capital degree, so it goes up when they move there, but then they never -- they never really fully integrate and then move back and then it goes so it goes low again. Whereas the people who permanently move is this dash line here. For moving from a rural region to capital, this is your capital degree. So it's more or less kind of static and then it jumps and keeps kind of just going up. So the idea here is, you know, how do cities affect individuals? One notion is this 18 hypothesis called differential selection, meaning that you behave differently -- individuals in the rural area who are behaving like an urban person, the more they behave like that urban person, the more they kind of get pulled into the capital. So that is -- that's kind of the theory of differential selection. They're different from their home environment and that's what's driving them to move. And you can actually see, you know, the individuals who do move are different. This is kind of the average rural degree. And the people who do move have a lower, kind of have less integration, have less contacts within that rural area. >>: So this is behavior described as the way they communicate with each other? >> Nathan Eagle: This is purely looking at the degree. So is -- I mean, which is kind of a proxy for behavior, but this is the number of people these individuals called both in their home environment. >>: [Inaudible] pattern and divvy those up into rural and urban and say if you've got a communication pattern -- you're rural, but you've got a communication pattern that's closer to the ->> Nathan Eagle: Urban, then you're more likely to leave. And that's kind of the theory of the differential selection. The other theory that has been in the literature for decades is this behavioral adaptation. So meaning you move to the city and suddenly then you behave more like an individual who is in the city. So what we're trying to do is try to figure out how much is differential selection, how much is behavioral adaptation and it's still very much a work in progress, but we're seeing clear signs of both. And then the other thing that has -- we're making progress on, but I think we're not quite there yet, is this idea of integration. So looking at how long does it take for individuals to integrate into these different communities, both these urban communities and these article communities. And we're finding that it's a lot easier for people to integrate into an urban environment than a rural environment. And then there's fun other things about rural environments, whereas you may have lesser degree, but the tie strength is much greater. People in rural areas, they don't have as many contacts, but their relationships are stronger, as measured by call volume. So there's some interesting dynamics going on in urban versus rural, and I've got another student at if SFI now hoist working on building a generative 19 model for this you are been growth. She's looking at Kibera, which is the largest slum in the world, just outside of Nairobi, and what we're able to do is look at how this slum is changing over time. We're inferring tribe by looking at what regions of Kenya individuals living in this area are calling. And you can start seeing the dynamics of how -- of the kind of turnover of these different tribes moving in and out of these neighborhoods. And the hope here is we can start parameterizing this in a way that we can quantify how the slum is growing over time and with which type of inhabitants. And this is important for people doing urban planning. I mean, these guys, they need to figure out where to put the next latrines, where to put the next pieces of infrastructure, where is the slum going to go next. These slums, in these developing world countries, are very organic, they're ad hoc. They grow in very much an unsupervised manner. But I think they -- and this is very much work in progress, but I think that it's looking like they have particular patterns in their dynamics that can be quantified. And if they can be quantified, we can get some insight into really what's going on. >>: So is that data or that stuff, that's also coming from the Rwanda data? >> Nathan Eagle: This is coming from, yeah, Rwanda and Kenya. Actually, this slum is Kenya, which is ten millions. I jumped ahead. So we're working with the city planners in Kigali. Kigali has their own sets of slums, but it's not the world's biggest slum so we're basically applying these models, but we're using Kibera as well right now. >>: Okay. >> Nathan Eagle: So I hate the term universal law of human anything, but this is kind of -- well, this is the name of the grant that we got so it's now the title of the slide. But the idea is looking at quantifying how mobility models scale over different people from different cultures in different continents. And so what Marta did earlier, last year, was basically parameterize a population's mobility in terms of a radius gyration. It's kind of using a standard gravity model. She's able to find kind of the distribution of these radiuses. And what we're doing now is looking at what this distribution is across a variety of different regions. Both in America, Latin America, South America, Africa, Europe, and seeing how, you know, what holds as kind of commonality and what is different across these different 20 regions. In terms of diffusion, not only do we have -- so when we talk about CDR data, we not only have the communication data, but we also have this product adoption. So what products people are adopting over time, and especially in places like Kenya, like Rwanda, like the domestic Republic, where there hasn't been as much mass marketing, simply because people live in more rural areas and are less connected. The importance of kind of how people hear about a product is, you know, basically comes from your peers. And so what we've found -- and what's striking is this holds true both in the U.K. as well as in Kenya, as well as in the DR. If you want to start quantifying the probability that an individual adopts a particular product, you know, you have a tried a, so you have three people. A and B both adopted and you want to know what the probability of C adopting is. It's significantly more. Meaning the probability of C adopting goes up significantly when the tried a is closed, when A and B are friends. When A and B are friends, they exert more mutual influence over C than if they're not connected. And it's striking how that result seems to hold again across a wide range of markets and across a wide range of products. This is now kind of work that's veering towards developmental economics. But I think kind of should scale to a wide range of events. This is the implications of what happened in an earthquake back in February of 2008. And what you're seeing here, I mean, so the first thing is, you know, you type in Rwanda earthquake and you see this massive spike and you find out that there was an earthquake in Nyamasheke region on February 3rd. And then you can start plotting what -- these are the behavior that's coming from cell towers in this particular region. And on the Y axis, this is outgoing minus incoming calls. So the net outgoing minus the net incoming. And you can see, basically, the cell towers were -- the time series started about here and you generally have more outgoing than incoming until this earthquake happened. And then suddenly things changed dramatically. This is a plot that literally, we generated about five days ago. So there's no more kind of analysis beyond what you're seeing. But I think what it points to is an ability to start -- ability to do things that Eric and I have spent the morning talking about, being able to do surveillance, whether it's disease surveillance or flooding, trying to detect flooding or crisis or market collapse. We have this data on how people are communicating, how people are traveling, how people are sharing air time, and the hope is that we can -- go to the next slide, 21 we can start parameterizing people's reactions to these catastrophes in a way that we can use it for surveillance. You know, can we do something analogous to that Google -- you know that going the flu trends paper. So I liked the analogy that we could do something similar with CDR. And instead of people typing in search queries about sicknesses, we can identify key behaviors that are indicative of an outbreak. You know, signatures that are associated with people who are getting sick or who have recently been flooded or recently been exposed to something like an earthquake. Do you have a question? >>: So you touched upon that triad, that if it's a close triad, it affects the decision making of the third person. Have you applied that to the moving data that you had, like who affects -- who is the influencer in where people move from urban areas to city. >> Nathan Eagle: That's a good idea. No, we haven't. >>: [inaudible] in finding, like, something to the effect of who influences things in social graphs. That's a very recent study. >> Nathan Eagle: Right now, we're just looking at aggregate behavior. trying to pull out individuals who are influencing more than others. >>: We're not Okay. >> Nathan Eagle: But that probably, at some point, we will do that. And we're still just trying to get our head around kind of what the aggregate is looking at doing first. And so anyway, so there's a lot of questions that can be addressed, whether it's, you know, trying to identify -- so I'm working with a group of epidemiologists at imperial, and what they're trying to do is quantify how malaria is -- how mobility affects malaria, so how people's movements around east Africa change the spread of that particular disease. And, you know, and with Eric, we've been talking about trying to quantify people's reactions to something like cholera. Like when there's a cholera outbreak, and there have been dozens of cholera outbreaks during the time period where we're looking at this data. Can we start characterizing how a region responds to something like that. And ideally, can we characterize the onset of a cholera outbreak? Can 22 we see the events leading up to a potential disease outbreak. that, then we can do some real good. >>: If we can quantify [inaudible]. >> Nathan Eagle: Yeah. So if you can close the loop, if you can identify when a disaster is about to happen, you know, there's a lot of organizations, both governmental and nongovernmental, that would be very eager to help out. Josh and I, so Josh is a student at Berkeley. He just arrived in Kigali today, and we're conducting a study a phone study, where we're surveying people and asking them a variety of both socioeconomic questions, but also about, you know, what their livelihood is dependent on and whether or not they've been involved in an economic crash of some sort or, like, been touched that way with the idea of trying to figure out how behavior is correlated with things like market prices. All right. So now from a million Rwandans, we've got essentially the same data for Kenya. Now it's 10 million mobile phone subscribers in Kenya, but we've got an additional edge type. Instead of just communication and air time sharing, we've got actually how people are sending and receiving money across these different regions. It's a product called M-Pesa, which is kind of mobile money in Swahili. And what M-Pesa has enabled is a variety of really interesting applications that lie on top of it. So I mentioned earlier this company, TxtEagle, based in Nairobi, and what it simply does is enables people to earn small amounts of money on their phone by completing really simple tasks for corporations who pay them generally in air time or M-Pesa. Tasks are things like transcription, translation, image tagging, surveys. We've got -- well, I mean, right now we've got something like 15 million subscribers who now have phones where they can start earning money. And the question is, coming up with enough tasks to meet the demand. I mean, we soft launched in February, and with a series of translation tasks, and we had to shut the service down after something like six hours because we ran out of [unintelligible] tasks. We introduced the service to some taxicab drivers and some high school students, and just within hours, we had thousands of users. Like there's a huge demand for being able to start making money on your phone by completing work. And it's now just a question of figuring out how to get enough tasks. 23 And I mean, it's pretty exciting. I mean, and so if people in this room have an idea about how crowd sourcing can, you know, what particular paid tasks can be done on a handset, and it's important to think that these are not kind of smart phones. This is a handset from -- the handset that you had ten years ago. So initially, we deployed via SMS, but now we've gone to USSD, which is -- I mean, so SMS is to USSD what e-mail is to Telenet. So USSD is kind of this persistent session-spaced GSM protocol that is appropriate -- that works on literally every single one of the four billion GSM phones on the planet today. So it's a fantastic opportunity, I think, and it's a great protocol. But the problem with USSD is that you have to be an operator in order to deploy USSD service. And so this is done very, you know, this is basically done in house with these operators. So that's -- and then I kind of, I put mechanism design. So this is an example of trying to figure out what the kind of incentive and reward structure should be so that you can get this work done. And when we initially launched, I think we were probably paying people too much. People were earning upwards of three dollars an hour, and that was -- I think, you know, when we redo this again, I think really what we're going to try to do is cap how much the daily -- the daily amount you can earn is to maybe a dollar or even less. So that we get a wider selection of people who are completing these tasks. So from those, you know, 10 million east Africans, this is now 100 million individuals in the U.K. This is a graph actually, it's a social network now of 250 million unique phone numbers, 12 billion edges, and we've got things like the notes are characterized by region, by product adoption and the edges have a time and duration associated with them. This represents virtually every mobile -- every phone -- every land line phone call in about 80% of the mobile phones for the month of August of 2005. So it's a very large graph. And I'm at the Santa Fe Institute now, and so one of the job requirements is to show straight lines on log log plots. And this is a particularly good straight line. This shows calls made for, in this case, this is August 19th, 2005, and what you're seeing is 10 million people in the U.K. made one call. 1 million people made ten calls, 100,000 people made 100 calls. And it just keeps going down and down. And the thing that kind of -- you know, one of the things that initially gets pointed out is like who are these people who are making, you know, a hundred thousand phone calls on that particular day? And this seems to be true across these different data sets. Like this is -- it's a true parallel all the way out to the tail and these 24 are clearly not human. And actually, we have a little -- we're trying to build a little classifier that tries to pull out, you know, what machine related properties. So if you behave like a machine, we're not going to count you anymore. So this is the unfiltered graph. And kind of the thing that's striking for me, I think, is just the fact that out of the behavior of hundreds of millions of idiosyncratic individuals, every single day, this is a straight line. But the interesting thing is the slope of the line, the exponent of the power law changes with the day of the week. So you have like a Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. So that's calls out, and you have a mirrored exponent for calls received. So there's something fundamental about what's going on with the dynamics there that I think is interesting. We've been doing a little bit of work just looking at how distance affects probability of communication. So this idea of pro pin equity, is this motion, the fact that you're much more likely to be friends with people who are proximate to you. And what we've done is this is kind of similar to a PNAS paper by David Noll where he took a live journal data set and looked at the probabilities that individuals link based on zip codes. And this is what we've been able to do is recreate that distribution that he find for live journal with the U.K. and it looks like Juan da and the DR follow a similar distribution as well. >>: [inaudible] messenger data the month of June [inaudible] communications. We did an additional study, how often do you talk to somebody, given the fact that you talked with them before and it goes one this. >> Nathan Eagle: >>: You see people hopefully [inaudible] talk to people far away. >> Nathan Eagle: >>: Goes up with this? So your low weight ties are your proximate ones? Right. >> Nathan Eagle: We should do that too. So in any case, we're seeing similar -- I mean, that's kind of the neat -- that's a neat finding, right, whether it's instant messenger data, whether it's live journal data, whether it's communication data from England, from the middle of a village in Rwanda. Getting these same types of 25 distributions repeatedly across different types of communication he'd medium, across very different types of demographics. I think there's something interesting there. But it gets more interesting once we have that outcome of interest. So looking at the outcome of interest for this slide is the multiple index of deprivation, which is an aggregate score that's associated with access to healthcare, education levels and average income. and this is for the U.K. And so what we're doing here is we're testing Granovetter hypothesis, the strength of weak ties. Right, so Granovetter many years ago has kind of has theorized that it's going to be that acquaintance that's going to get you your next job. It's not your best friend who is going to be able to kind of push you forward socioeconomically, but rather it's the weaker ties. And what we've done is we've looked at diversity. So again, diversity now is -- well, in this case, this is a Shannon entropy metric, but we've also used Simpson metric for diversity. It's more or less the same, an idea of how much entropy is in an egocentric graph, based on the geography that they're calling. And what's striking is that socioeconomic status, it's really not that correlated with your degree or your call volume. In other words, it's not the number of friends you have or how much time you spend on the phone, but rather it's the diversity score. It's whether you have contacts in a wide range of areas. So this is a pretty amazingly strong correlation especially in the social sciences where we're explaining more than half of the variance in a group of people that numbers 50 million. So I this is a major result. But the controversy right now is figuring out what's really driving this, because there's another kind of -- there's another theory from Bert about structural holes and constraints, and I could spend a little bit of detail talking about that. But basically, essentially, it's just if your friends are friends, so if you have an insular group, what Bert theorized was that's going to constrain you. That's going to mean that you have -- it's harder for you to change and that constraint is going to affect your socioeconomic status. Whereas if you have a lot of holes in your system, meaning if you have a lot of friends in a lot of different places, then basically you're going to do -- you're going to be better off. And so you can see how there's parallels there with diversity. And so trying to figure out what's really the driving force is still an open question. 26 But at the end of the day, we're able to start validating these types of theories because we have this amazing -- these amazing data sets. And it's not just all the communication graph, but it's also -- through the U.K. civil service, they've done very detailed studies about the socioeconomic status of all these regions in the U.K., and so it's really, when combining these two things, where we can get some interesting results. Yeah? >>: [Inaudible] services on top of that to address these [inaudible] based on this data? >> Nathan Eagle: That would be the ultimate goal. To date, no. And one of the reasons is I can't, I can't really advocate them to do that, because what we haven't done is talked about causality. What we've only done -- well, we talked about correlation. If I can establish causality, then I can actually make a case to the government saying, like, look, this is a particular policy. You need to start adopting. We need you to, you know, create these longer edged ties across the nation and improve everyone's socioeconomic status. I can't make that case to the government yet, because I don't know whether or not -- which direction it goes. My guess is it's going both ways, right. So people who have higher, you know, who have higher diversity, they're going to be exposed to more opportunities and so that's going to drive socioeconomic status up. But the wealthy individuals, by the nature of being wealthy, will have these longer ties as well. So trying to figure out which way causality works is very much an open question. Okay. And then I'm going to wrap it up now, I mean with the billions. And I'm working with like I said, dozens of mobile phone operators from around the world, and it allows us to ask a variety of really interesting questions, whether it's looking at the dynamics of relations between nation states, looking at how culture affect things, disease, outbreak warnings, et cetera. So to wrap it up, I mean, I've talked about where N equals 100, I'm starting to infer the friendship graph based on these behavioral signatures. When N is a thousand, trying to figure out what kind of relationships people have, friends, office workers, colleagues, based on their communication behavior. N equals 10,000, this is the random sample, 10,000 individuals across America in learning how -- learning, basically, what random sampling can do for us. So the 27 hundred thousand criminals in Philly, trying to figure out whether or not there is such a thing as a crime wave and whether or not crime is contagious. To a million people where we're looking at, you know, how the effect of urbanization affects people support social networks, trying to understand how disease and disease dynamics interplay with things like mobility and communication. 10 million people looking at how resources flow through the network, economic resources and then this platform TxtEagle that enables people to earn money on their phone. Then a hundred million trying to validate for, I believe the first time, this theory that Granovetter had about whether or not these weak ties really are strong economically. And then into the billions, where it's possible to start comparing across cultures and across continents and it is striking the similarities that we're initially finding. So this study with Marta, it seems like that people in the Dominican Republic move in similar patterns to the way people move around Rwanda, which bears striking parallels to how people move around San Francisco and L.A. and Belgium. There's particular types of information that spread over greater London with exactly the same dynamics as it spreads through a village in Kenya. So we're at a time now where we can almost -- well, I'm still pushing a little bit against it, but we can almost start talking about universal laws. Universal laws of human behavior. But I think kind of the bigger question, the -- in fact, like kind of the ultimate open question here is really an engineering question, like once we've got all this data, once we've got all these insights, once we've built this model of a system, how can we use these insights and use this data to actively improve people's lives? Because I'm a big a fan as anyone about plotting the aggregate behavior of 100 million people on a log log plot and fitting a distribution and then raising up your arms and yelling, eureka and claiming that you've found a new law of human behavior. That's the easy part. I think the fundamentally, the harder part is trying to figure out a way that we can use this data to improve the lives and better both the lives of the billions of people who are generating this data continuously and the societies in which they live. And for that, it's no longer the social scientists who want to be validates theories or physicists who want to start messing with the degree distributions. But now is the time for the engineers. Now is the time for the applications and I think we have the potential to really make a difference here. This could be extremely high impact work. And so I think it's an exciting -- it's exciting space to play in. And I'm playing in it with a variety of people and I'm welcome to continue to add 28 to this list if other people are interested. So with that, I'll open it up for questions. Thanks. [applause] >>: Should I give out my cell phone number? Questions? CDR land here. >> Nathan Eagle: You can call me on my cell phone. Yeah. >>: Any questions? >>: [inaudible] realtime or are you always getting chunks of data? >> Nathan Eagle: Getting realtime data is possible, but I haven't -- to justify getting realtime data, we'd have to do kind of -- we'd have to propose something like what Eric was talking about earlier, interventions. At the moment, I can't make the case that I have any intervention or even the ability to detect something that the operator or the government would care about but when we do, there's no technical reason why that's impossible. Like we should -- and that would be the ultimate goal would be build a little filter on the realtime CDR to start flagging regions that look like they might have just had a disease outbreak. >>: [inaudible] uploads. >>: So are they going to take the same models that apply other ways to communication like instant messenger, Facebook messages and Twitter exchanges. But if that's a realtime field that's open, and you can just grab data out of that, because a lot of these models should be applicable regardless of the mode of communication. That's the relationship in every graph. You should be able to relate something in one to see how it applies to the other one [inaudible]. >> Nathan Eagle: I'd like -- to be honest, I've got more data, I'm drowning in data. And, you know, earlier in my life, I was a data junkie, like there's never enough data. I always wanted more. And I'm kind of reconsidering that position and I'm trying to be more kind of focused on questions that you want to address rather than just trying to build the biggest social network that has ever existed. But, you know, that question appeals to me as well. So yeah, I mean, I'd love to 29 be able to couple different types of data sets together, especially if you can legitimately couple them. You know, you know, the hashed ID, the phone number, you know the hashed idea of the messenger account. You know the hashed idea of the bank account. So I'm collaborating with people who have a bunch of Bank of America data. So there's a lot of privacy implications, obviously, associated with, you know, coupling these different data sets. >>: So I set stuff out for a minute [inaudible] when you said you started working with Kenya as well, some of the mobile [inaudible]; is that right? >> Nathan Eagle: Um-hmm. >>: So are you seeing sort of the same kind of relationships with the call structures? Are you getting data with an M-Pesa? I'm just curious about sort of is there symmetry or similarity in the networks between the call structure stuff and the embanking kind of relationship? >> Nathan Eagle: Like this is early stages. Like I would -- I can tell you kind of what things like the degree distribution looks like for these different networks. And also, the air time sharing. I mean, there's kind of a lot of different -- there's SMS, there's communication, there's air time and there's M-Pesa. And as you go down that line, you get a sparser and sparser graph. And typically, it seems to be correlated with volume. So if there's a lot of volume between a dyad, you're much more likely to have the M-Pesa transaction. But we're in early stages and I really can't -- I'm not comfortable saying more than that, until we really kind of stare at the data more. >>: Okay. >>: So I was wondering if you started doing anything in terms of mass intervention, like providing people with access to information about their own behavior in the web or on the phone and ->> Nathan Eagle: That would be the hope. I mean, the first thing is trying to establish a piece of information that would be relevant to these individuals. And I think something like the fact that there's, you know, a potential disease outbreak would be a pretty good one. The other intervention, which is less of an experiment and more of a commercial entity, is this idea of improving people's socioeconomic status by providing them work. So giving them income based on them doing these tasks and then seeing how that changes their behavior over time. I mean, that's nice 30 because, I mean, that really -- I mean, we've got the 15 million people in east Africa who participating in that. So that's about as big of an intervention as I can imagine. >>: What sort of restrictions do you have to work on when you're working with these data sets. I mean, do you have to go to these locations and work in these locations? You can take the data ->> Nathan Eagle: Well, taking the data out is generally okay under major constraints. But typically, I mean, a lot of this data, I've spent weeks, literally weeks in underground basements in Kigali, in Nairobi, and it's a surreal experience, actually. You're in this giant server farm that's just -- because you're on the equator and you're being blasted by air conditioning and you take, like, the elevator up, and you walk out the door and year in the middle of Africa. But yeah, to deal with that, like to do the access to this data, I have to work very intimately with the operators. And the operators are interested, because, you know, we're providing them services, like TxtEagle. And then also insight into the dynamics of their subscriber base. At the end of the day, they don't have the computational resources, nor the human resources to analyze this type of data. And so that's kind of the value proposition there. And so I can take it out. I have to -- it has to, for obvious reasons, be encrypted. There can't be any phone numbers, anything that can identify an individual. And I want that for my own protection as well as for the privacy protection. Like the last thing I want is to be indicted and have this data, you know, come out and be going into a court case or something. So it needs to be de-identified in a very rigorous way. And some operators are more -- are more paranoid about that than other operators. But for me, that's kind of a mandatory thing. >>: Like taking out the phone numbers is one thing. But you have a lot of information. Even if you can't -- like if you have someone's phone number, clearly you can identify them pretty easily. But with the type of data you have, you could probably identify a lo the of these people even without the phone number. >> Nathan Eagle: That's possible. I mean, like I hope not. I have to make the case to the IRB that that's not possible. But I think that's, you know, like, you know, I mean, it's -- I don't know. 31 >>: A grad student proving it is possible. >> Nathan Eagle: We were talking at lunch the Climer paper, looking at topology alone, you can start cracking some of the stuff. To be honest, I'm less worried about those attacks, because those types of attacks, they let you find yourself and maybe figure out who your friends are friends with. But it doesn't go beyond that. So my real concern is making sure that, like -- and this goes to the fact of just being able to -- I mean, ideally, I'd like to release some of this data to the academic community. I mean, the biggest contribution of my Ph.D. was not these little generative models of M.I.T. behavior, but rather it was releasing that data set online. And it's generated literally hundreds of journal publications thousands of times. It's been a pretty big contribution. Like having this type of data, getting it out there in a way that's ethical would, I think, be a phenomenal contribution, but the problem is trying to figure out a way to do that while preserving privacy. And at the moment, I don't know the right answer to how to do that. >>: What sort of volume in terms of terabytes are you dealing with the largest data sets here? >> Nathan Eagle: The largest data sets, I mean, these things compress phenomenally well. But uncompressed, we're talking maybe a half a petabyte. >>: Okay. [applause] Thanks very much.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib