Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas Weigend (www.weigend.com) Data Mining and Electronic Business: The Social Data Revolution STATS 252 May 4, 2009 Class 5 Facebook: (Part 2 of 2) This transcript: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Corresponding audio file: http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.mp3 Previous Transcript: (Part 1 of 2): http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-1_2009.05.04.mp3 To see the whole series: Containing folder: http://weigend.com/files/teaching/stanford/2009/recordings/audio/ Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 1 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Andreas: Are we ready for the second half? I already introduced Itamar. After our interesting discussion in the first half, now we will hear about data science at Facebook. You should know that I told them that people only call their thing science when it is not a science. [Laughs] Physics is not called Physics Science. Itamar: The name is quite awkward. You can thank my former boss Jeff Hammerbacher for that. Student: Are you going to charge… Itamar: No, it’s open. You can use it, a company can use it. Student: … Itamar: Not to my knowledge, no; that’s a really interesting question, though. I’m not sure. So, I’m Itamar and this is Eric. He’ll continue my presentation after I’m finished with most of it. Today I want to talk with you guys about the team that I work at, at Facebook, called The Data Team, or sometimes we call ourselves the Data Science Team. First, let’s go over some of what’s involved … Andreas: Would you be willing to share the slides with the students afterwards? Itamar: Sure, absolutely Andreas: Okay, concentrate on the class and don’t worry about notes. I’ll put the slides up. Did that change from what you sent me yesterday, or could I just put up what you sent me yesterday? Itamar: This has changed. I want to start by giving you guys a little bit of a taste for what’s involved, particularly with respect to the scale of doing data analysis and data mining at Facebook. We have a social graph, as everyone knows. There are two million active users on our site. More than half of them, over a hundred million users come to the site each day. Several hundred thousand new users join each day. Every user can be described by hundreds of dimensions of different types, numerical, categorical, textual dimensions. The average user has 120 friends. Friendships on Facebook span many different types of relationship, coworkers, people you may have met several years ago, close friends, family members, and so on. There is also a lot of rich behavioral data that we collect. Action data, users interact with hundreds of thousands of applications on and off the site, and users interact with one another directly via hundreds of different, unique types of interactions that we support on this site. 0:02:35.8 Finally, there is rich data we collect about social content, the photos, the status updates, the platform application content, the events, the posts, the videos, the notes, the groups, everything that happens that users create and share with one Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 2 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics another that we have hundreds of millions of sharing events along these lines, every single week. What do you do with all this data? You can’t just stick in a MySQL database or in an Oracle rack server even. At this kind of scale, the solution that we’ve gone with, which is really we have Jeff Hammerbacher, the former manager of the team, to thank; it’s a distributed approach with a piece of technology called Hadoop as well as a piece of technology that we developed on top of Hadoop called Hive. Hadoop and HDFS is a distributed approach to data warehousing and computation. It is inspired by the Map Reduce programming paradigm, and the distributed file system developed at Google, but Yahoo developed it in Java in a system called Hadoop. It’s a system called HDFS, which is a distributed file system. We took these technologies and on top of that, we built something called a metastore which enables you to do metadata management in terms of representing data, rather than on flat files, as actual database-like tables. Finally, we built a [0:03:58.0 unclear] language. It’s very similar to SQL, called HiveQL, that allows pretty much anyone with a SQL-like background to run computations on this data warehouse. We have this Hadoop and Hive system. It’s deployed on a couple of different clusters, but our main cluster has over a terabyte of raw capacity, over 2 terabytes of uncompressed data are collected into this cluster every single day, and dozens upon dozens of terabyte of data are read and written each day via Hadoop and Hive. Andreas: 2 terabytes, including videos and everything? Itamar: We don’t actually put the videos in the data warehouse. Andreas: That would be much more. Itamar: That’s right. The photos and the videos are not in the data warehouse. The fact that a photo or a video was created or viewed is recorded in an action log which is then imported into this Hadoop data warehouse. You can think of all this data as dimensional data about users and raw log files about the pages they view, the interactions they have, the applications they install, the content that they produce, and so forth. 0:05:07.4 Now that you understand the core technology that’s involved in terms of the infrastructure, I’d like to explain what the data science team itself does. We can really think about what the data science team does as two different things, behavioral analysis, and data driven systems. Behavioral analysis – some examples – we are involved in some discussions of formulating the key product health metrics by which we gauge our success, our Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 3 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics site success. With a mission like helping people share more information and to connect to the people they care about, it’s not entirely obvious what to pose as a product health metric. You can’t just simply talk about number of visits or number of users or number of purchases if we sold something. Another thing that we do is we support product launches by doing evaluations of the effect of those launches. For example, you guys probably noticed, those of you who use Facebook, that there was a site redesign six or eight weeks ago. We came in before the launch and helped the product team answer some questions about which specific decisions to make, based on past behavior. Once the product was actually launched, we judged its effect on the overall ecosystem of the user base. Another example is growth modeling, understanding which markets we’re growing in, what the saturation patterns are like, what the adoption curves are like, where there might be markets that we don’t have any success in, even if we’re sending invitations and there are some key users there; it’s a little bit resistant. These are the questions involved in growth modeling. Another interesting problem is user churn modeling. That’s the problem of understanding for a given user, who has a certain set of characteristics and a certain usage profile, how can we model the likelihood that that user will return to the site, K times in the next T periods. Doing so would enable us to label high risk users and maybe target them with additional proactive measures to keep them on the site and we can also leverage this kind of modeling for explanatory purposes to understand what sort of features get users to stay on versus what sort of features push users away. There are a couple of different things here, as well, production incentives. I’ll be talking about that later. That’s a series of studies trying to understand what gets users to produce content. There is also content diffusion, which Eric will be talking about. That’s sort of the first family of things that we do. It’s very quantitative social science analysis driven. The second set of things we do we call data-driven systems. This really is a lot of machine learning and recommendation systems and algorithmic optimization. As an example, one of the projects that we do here is add CTR prediction for optimization of the ads that we serve. Given a user, their profile characteristics, various things they might talk about with their friends, ads they’ve seen recently; what’s the likelihood that they’ll click on a particular ad. If we’re able to determine that likelihood, we can maximize our revenue by serving the ads that have the highest likelihood times the highest expected value. 0:08:19.0 Another example is PYMK. PYMK is a feature on the site that suggests to you people you might know, people you might want to connect with. That’s actually a very interesting and non-trivial algorithmic problem. You have a set of friends and we can compute maybe the second order set of friends of friends that are connected with Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 4 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics those friends, and pick out of those people the people that you are most likely to connect with, and we actually show them to you. The problem is, if the average user has 100 friends, the size of that set is 10,000 and we can’t show you 10,000 people. We can show you one person. How do you actually determine which of these people in this candidate set are most likely to be people you know and want to connect with and you want to interact with? Andreas: Let me interject here. It’s not only about you, but it’s really about the site as a whole. The metric is not just you likelihood of clicking on them. Then, you just show the cutest people and that solves that problem. You want those people who actually will be good for the network as a whole. It might not be who is most likely to, but who will benefit the most. That’s the hard part about modeling. That’s why statistics comes in. Itamar: Exactly, and as an example, consider the case of an existing user who has 500 friends on the site, who might not derive that much more value from making another connection. We’re faced with a problem of two candidates to show him; one is another user who has been on the site for a very long time, is a very close friend of his, who for some reason he may have missed. Another person is someone he knows a little less well, who is a new user, who just came on the site and only has a few friends. In some cases, it might actually be more advantageous in terms of a network effect, to recommend a new user because making that connection will severely impact the likelihood that the new user stays on the site and derives value. That’s an example of the point that you made. Student: … Itamar: Absolutely, that’s right Student: … Itamar: We’re evolving the algorithm to consider these sorts of situational factors. I think Andreas alluded to them earlier. Right now, it’s not very good; you’re right. You keep seeing the same recommendations over and over, but we’re definitely thinking about how to evolve that to a more optimal state in the future. A few more example – search ranking when you search for someone; how do we decide again, based on your intent and also based on what’s best for the network, who to show as the first entry in the search result. 0:11:10.3 Finally, a problem that we’re just starting to work on is the highlight section. I don’t know if you guys actually know what that is. As a result of the redesign, now there are two streams on the home page that serve you information about what your friends are doing. There is the main stream that shows you everything your friends are doing at that time. There is the highlight stream on the bottom right, that shows you the best content. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 5 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics That latter problem is a very robust and interesting quality recommendation system problem. We’ve just started tackling how to actually do that in an empirical way. The point is, we have these two sets of tasks. Unfortunately, I can’t really talk about the specifics of the data-driven systems part because it’s pretty algorithmic. It’s a lot of the secret sauce of what we’re working on. This behavioral analysis part we can really explore a bunch of interesting problems that reveal the nature of user behavior, the nature of social norms, and the nature of how products affect user experience on this site. I think a final point I want to make here is there are also some interesting problems at the intersection of these two focuses. Most of those problems are tool-based in nature or have to do with infrastructure and tables. Making sure you have everything instrumented properly, making sure you have tables that express the metrics you really want to care about, making sure you have tools that allow you to run statistical experiments that allow you to run machine learning algorithms like random forests, or logistic regressions or linear regressions on this huge scale of data that you have. Now, after that longwinded slide, I will quickly show you who is on the team. We have ten people, and a manager. The backgrounds are pretty varied. Some people are fresh undergrads. Some people have PhDs in artificial intelligence. Other people on the team have PhDs in computational sociology. One person has a PhD in biomedical informatics. We have a self-proclaimed algorithmic designer on the team, who specializes in visualizing data. The first thing I want to talk about is a pretty simple exercise in descriptive statistics, that I think yielded some very interesting results. The question we were tackling with this particular study is we wanted to know whether Facebook is increasing the size of peoples’ personal networks, the networks of other people that people are interacting with and keeping up with on the site. The question is very difficult to answer. The first thing we wanted to do was a descriptive task to explore the different types of relationships that people maintain on this site, and the relative sizes of these different groups. You can think of at least four different types of relationships that might exist in the world that we might care about on Facebook. 0:14:14.3 One set is the people you just know throughout your life, and in Facebook, it’s fair to say those are really your Facebook friends. Your Facebook friends for many users are people they’re actually friends with, but it’s actually a super set of that. It’s a set of people you’ve met at some point in your life that have some sort of relevance to you. Researchers have estimated this number to be somewhere between 300 and 3,000. Malcolm Gladwell did an interesting experiment where he gave people the phonebook Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 6 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics and went through common names and asked people to tell him how many people of a given name they knew. By using that information, compared to the distribution of names in the actual population, he was able to sort of extrapolate a guess as to how many people you know throughout your life. The second set of people who might be interesting for us is your communication network. These are people with whom you directly communicate on a regular basis. It probably indicates that the very least, your core support network, the people you really rely on for emotionally significant events and support, and social researcher have estimated the size of this group to be as low as 3 people. Duncan Watts and a few of his students actually dug into this point a bit and looked at email communication patterns across various universities and found that in a month long period, people generally communicate with between 10-20 people, over email, if email is something that they’re using. The final set of relationships that are interesting for us are this notion of maintained relationships. Social technologies like the Facebook Newsfeed or RSS readers allow you to keep up with people in your life and just know what they’re doing without actually having to directly communicate with them. You see your friends do something and it pops up on your newsfeed, or you have an RSS reader that gives you information of your friends’ blogs that you are subscribing to. You can actually consume the content that their producing in a passive way. This really is a form of relationship management, we’re claiming, because you might see your friend do something in a newsfeed like post a photo of the fact that they’re engaged, and this leads you to reach out to that friend and congratulate them on their engagement, even if you hadn’t talked to them in many months. It’s also important to know that even though technologies are making these types of relationships much easier to access, it’s something that you could have done in the past, as well. You might have been in a party and rather than directly talking to someone you didn’t have a direct communicative relationship with, you asked your good friends, “What is Joe doing these days?” We’d like to measure the size of these different types of relationships on Facebook. What we did was we examined the relationships of a random user sample, of a couple hundred thousand users, over thirty days, on this site. We defined the networks of these users in four ways: 0:17:16.9 The first definition is all their friends. This is the largest representation of a person’s network on this site. It’s the best proxy we have for how many people they know. The second group of people we wanted to measure was reciprocal communications. This is the number of people that you, as a user, reach out to via Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 7 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics direct communication on this site, that also reach back to you, via messages, wall posts, and comments. We think this provides a measure of your core network, the set of people that you really are actively engaging with in the most meaningful way on this site. The third relationship we wanted to define was one-way communication, so this is a little bit more of a broad notion, just the people that you’re reaching out to. Of your friends, whose wall are you writing on, who are you sending messages to, and who are you communicating with. Finally, this special notion that we’re focusing on here, called maintained relationships, that’s the number of friends that you’re tracking on this site. The number of friends whose newsfeed stories you click on, and the number of friends you view at least twice; at least twice to control for random people who pop up in your newsfeed that you might just check out once out of curiosity, or people whose friendships you accept and then you check out once and never care about again. Here are the findings that I wanted to discuss. We graph, as a function of the number of friends a user has, the median size of these various relationships: reciprocal communication is in red, one way communication is in green, and maintained relationships are in blue. What we see here is as a function of the number of friends that a user has, the median user is passively engaging via this maintained relationship sort of way, with 2 to 2.5 more people than the number of people directly communicated with. Let’s see what this actually does for the structure of your network. Here we take a user, a random user from our sample, and we graph their social graph according to these different definitions. At the top left, you have the graph that defines all of their friends. Next to it, you see the graph of their maintained relationships. These are the people that they’re keeping up with. On the bottom left, you see the one-way communication, and on the bottom right you see mutual communication. Just observe the stark contrast on the right, between the top right graph and the bottom right graph. This is really visually an evidence of what technologies like newsfeed are enabling people to do. With cell phone, or email, or SMS, you have to have this reciprocal sort of mutual communication relationship with people to really learn about what they’re doing. If you can see, the network is much sparser than if you look at these maintained relationships up top. 0:20:10.2 Student: …enabled a lot of … between the actual… Itamar: Is your question whether the people that you have maintained relationships with is a disjoined set from people that you have mutual communication with? It’s usually a super Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 8 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics set, but what’s interesting is if you then define this notion of maintained relationship in a much more subtle way as the people you actually have a reciprocal maintained relationship with, the people you are consuming their activity and they’re consuming their activity, that set of people does not largely overlap with the set of people you have mutual communication with. That’s fairly interesting. There is this group of people who are having reciprocal interactions in a direct way and a group of people that are having reciprocal interactions in this passive way. There is not complete overlap between the two. Before I go to this next study, this point of that is just to show that with simple descriptive statistics, we get very interesting initial insights about our product’s effect on users’ experiences and users’ social lives. The next example is much more experimental in nature. This project aimed to answer questions about content production among new users. We are really trying to get a sense for the incentives that guide new users to produce content on the site. The reason we’re interested in this is directly out of our mission; our mission is to give people the power to share and make the world more open and connected. One of our core objectives is to get people to actually share content with their friends on this site. As a bit of background, content production among new users looks sort of like this, a little less than half of new users upload a photo in their first two weeks. A little bit more use a third-party app, although that’s not really content production. Less than a third send a private message. About a quarter compose a status update, and about a fifth write on a friend’s wall. These numbers are kind of artificially inflated because there is a large population of new users who sign up for the site and then never come back because they’re just checking it out. They may have heard about it in the press and they’re just not receptive users so after their first week, they don’t actually return to the site. Controlling for this users, these numbers tend to go up a little bit, but the same sort of observations hold. Uploading a photo seems to be the most prevalent content production activity among new users. Within this study, for that reason, we chose to focus on new users producing photos. The goal of this study is really to model what leads new users to produce more photos during the first three months on Facebook. 0:23:16.2 Let’s talk about some hypotheses around what the core production incentives are. These are hypotheses that we came up with based on some social science literature, HCI literature, and so on. Student: …. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 9 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Itamar: Yes, that’s a great point. When I’m citing this statistic here of 45% who upload a photo, that doesn’t include profile pictures. If you control for profile pictures, that number jumps down to I think about a third. Student: … Itamar: No, it’s still the most common behavior, except for using a third-party app, which we’re not here considering as content production. This does have some of that bias there. The one thing to note with that issue is that a lot of users tend to use their profile pictures as a place for uploading photos and not just representing themselves on the profile. Let’s talk about hypotheses. The first hypothesis is one focused on feedback. The hypothesis is that newcomers who receive more feedback on their initial content will go on to contribute more content later on. The second hypothesis is about distribution. Newcomers whose initial content receives greater distribution in terms of the number of people who actually see it, will go on to produce more content. The third hypothesis is social learning. Newcomers whose friends share more content and who see their friends share more content will go on to produce more content themselves. Finally, we have a hypothesis around singling out. New users who are singled out in content that their friends produce, will go on to produce more of that content themselves. Do the hypotheses make sense? Student: I don’t understand the… Itamar: Really, what we’re talking about here, with respect to photos, is if you as a new user are tagged in a photo, you somehow become aware of this content production mechanism on Facebook. You go on to produce more content. Another example is someone might refer to you by name in a status update, which because you were referred to by name, you would attend a bit more carefully to it and you would think about what status updates are, and then you go and produce status updates. Student: What about the … 0:26:01.8 Itamar: That’s right; our hunch was that this hypothesis was the weakest hypothesis because there is no real data that the user receives about how many people are seeing their content. Nevertheless, I think it’s a pretty classic consideration to assume that users somehow want to maximize the amount of distribution they have and maybe they have Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 10 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics indirect ways of judging that based on the buzz they hear or messages they receive from their friends. I think the other valuable aspect of this is if you put a sort of weak hypothesis, dummy hypothesis in the model, you really want to make sure that your expectations are met and that the model shows you no effect of that hypothesis. That’s one of the reasons why it’s there. Here is the method that we followed. We took a quantitative study that I’ll be talking about here, and we also did a qualitative study, which for the sake of time I won’t mention. The quantitative study involved taking two cohorts of new users, one November 5, 2007, and one March 3, 2008. We observed their activity in the first two weeks and we predict, based on features constructed from their activity in their first two weeks – how many photos they uploaded between their third week and their fifteenth week on Facebook. Here are the futures we considered. The independent variables, first let’s talk about the variables that actually capture the hypotheses we posed. For the first hypothesis – feedback, the variables that we looked at were whether users received comments on the content that they produced, and how many comments they received. The second hypothesis – distribution, the variables we looked at were number of times that a content the new user produced was viewed by their friends in newsfeed, and the number of distinct friends who viewed the content in newsfeed. The third hypothesis – social learning, this is captured by the variable, the number of friends’ photos that the new user saw in their first two weeks The final hypothesis – singling out, was the number of times the new user was tagged in photos, as well as a binary indicator of whether the new user was tagged at all. There were some controls. As control variables, we included the user’s age, gender in the form of whether or not they specified it, in addition to the actual gender; the number of friends that the user had, the total pages that the user viewed in the first two weeks on the site, engagement with photos – the number of photos they uploaded in their first two weeks, the number of photos they viewed, the number of photo tags they authored, and the number of photo comments that they wrote on other peoples’ photos. 0:28:46.1 Student: … Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 11 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Itamar: That’s why we run to cohorts. One is from one period of the year and the other is from another period of the year. We took these features that I discussed in the previous slide; again, we’re predicting the number of photos that a user produces between the third and fifteenth week, and we put them in a linear OLS model, linear regression, and here are the results that we found. We really had two models here, the first model was only able to test two of the hypotheses, singling out and social learning. The reason the first model was only able to test those two hypotheses is because there is a certain set of people who just don’t produce photos on the site. If they don’t produce photos in the first two weeks, they’re not going to get any distribution, no feedback, and those hypotheses aren’t testable. I misspoke; the first model just focuses on early uploaders, people who actually produced photos during their first two weeks, for whom all the hypotheses apply, and the second model tested on everyone, wherein some cases, only the first two hypotheses apply. If you see here, I’ve listed the coefficients of the model and the percent change from the Y intercept. At the bottom there, you have the independent variables that are actually testing for the hypotheses. What I haven’t included here was an observation that we made initially, which is the number of comments received on photos had absolutely no statistical significant effect on the number of photos produced later on, which was a very surprising result. It basically says that controlling for all these other factors, the number of comments you receive on photos as a new user is not in any way leading you to produce photos later on. When you change that to a binary variable that indicates whether or not you received any photo comments on your photos as a new user, we do see a positive effect. There is a positive coefficient there that leads to a 6.2% increase in the volume of photos that you produce. That’s the feedback hypothesis. Student: … Itamar: It’s an indicator variable. We transformed that variable just to an indicator variable that expresses whether the user received any comments on the photos that they produced. The second hypothesis, distribution, is captured by that independent variable there, photo views received. What we see here is there is a significant effect; the three stars mean that it’s within .01 significance, the coefficient, but then it’s pretty modest. Photo views received leads to only a 2.6% increase in the volume of photos that you produce later on. 0:32:09.7 To test the hypothesis of social learning, we have an independent variable there on the left of photo stories seen, which is the number of times that you see your friends actually produce photos. We see a pretty sizable effect here, an increase of 6.1% in the volume of photos that you produce later on. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 12 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics The final hypothesis around singling out is captured by this binary variable of photo tags received, and we found no significant effect there. Turning to the second model, which is testing on everyone, and is only able to test the two hypotheses around social learning and singling out, we see pretty similar effects. Again, let’s look at the independent variable columns here. Photo stories seen is the social learning, as an interaction with early uploader, we see a very significant effect of an increase in 10.7%. Photo stories sent, interacted with nonearly uploaders, we see a much milder effect. Photo tags received as an interaction with early uploaders, we see no significant effect. Photo tags received as a function of non-early uploader, we see a positive effect. I think the story that especially the right column is telling is that users are going to produce more content later on, as users who are engaged with the photo app, if they see in their first two weeks, their friends doing this activity at all, if they’re able to discover the feature. If you see here, receiving a photo tag, if you’re not an early uploader of photos, has a pretty large effect that signals to me an event of this nonearly uploader in discovering the photo feature. So, to summarize the results, the feedback hypothesis – we found support among early uploaders and the hypothesis is not applicable among non-early uploaders. The distribution hypothesis – we found modest support, a pretty small coefficient among early uploaders. Again, it’s not applicable among non-early uploaders. We found significant support across both groups of users for the social learning hypothesis and mixed results in terms of singling out; no support among early uploaders, probably because they don’t need to discover the feature – they’ve already discovered it, and large support for the non-early uploaders, probably because he’s singling them out with a photo tag as leading them to discover the feature. The conclusions here really seem to suggest that we learn about social utilities from our friends. If our friends are doing something on the site and we’re able to witness them doing that, we’re probably going to start mimicking them. Social learning, as a result of this study, it appears to be the main lever for content production on this site. We have a very rich access for pulling this lever. Back in the old newsfeed system, we can play with the weights and perhaps show you a story of a particular content type that you haven’t produced before, so if you’ve never produced videos, we show you a video story and see if you actually produce more. 0:35:18.8 Now, with highlights, we still have that ability so the area of the site that shows you the top stories that are going on, if you are a new user, and you have never become a fan of a page or you’ve never created an event, we might want to expose you to that sort of feature so you can learn about it and actually mimic the behavior yourself. For the final study that we’re discussing, we’ll be talking about modeling contagion through newsfeed, where contagion is a theory about how content diffuses among Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 13 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics a social network of people who learn about trends or adopt technologies. For that, we have Eric Sun. Eric: This is a research project that I started last summer. The main goal is to figure out how ideas spread through Facebook. Obviously, on Facebook, the biggest thing that causes the spread of ideas is newsfeed. This is important for many reasons. First, it’s important just as an academic study, but it’s also important for making money on this site because for example, Facebook pages is the main product where Facebook can interface with advertisers because most advertisers will put up a page on Facebook and you can become a fan of the page. We wanted to figure out how these ideas can spread on Facebook. We compared these results with existing models of diffusion and ideally we would want to show that advertising on Facebook is way better than advertising anywhere else, so why would you bother spending money anywhere else. I started this last summer, so it’s based on the old Facebook, in particular, the old newsfeed where not everything is shown but certain types of stories are picked and shown to each user. Just to give a little bit of background, there are two competing theories about how ideas spread through the population. The old theory is that it’s all about the influentials. This is what Malcolm Gladwell talks about in The Tipping Point, and the idea is that if you reach a tiny group of very influential people then they’ll talk to their friends and their friends will talk and eventually you’ll reach everyone for free, after you get these very important people at the top. Therefore, a billion dollars a year spent on word-of-mouth campaigns, trying to find out which of these people are actually influential, and this amount is just growing every year. This 36% figure was from 2006. I’m not aware of how it’s changed since then. That’s the influential theory. The competing theory is recently developed by Duncan Watts, who is a sociologist at Columbia, and I think he’s still at Yahoo. His idea is that anyone can be an influencer. For ideas to become very popular, you don’t need to get the influential people; rather, all you need is a very susceptible population. So, I think he coined a term called “sheeple.” They’re not people and they’re not sheep, they’re “sheeple.” If you have a lot of people who are easily convinced, that’s even better than just getting the people who are very influential at the top. 0:38:56.1 We like to test these things on Facebook. Probably most of you are aware, if you’ve forgotten the old Facebook already, the old Facebook had a newsfeed where stories were selected based on how receptive you’re likely to be to it. The way you interact with the pages product is, for example, any retailer might have a page and Alice might decide to fan a page. Once she does that, with some probability, they will see an item on their newsfeed saying Alice fanned a page. Once you see that, you can Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 14 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics either ignore it or you can also click a link saying, “I want to fan this page, as well.” If you do this, we call this a chain of length one. If you continue this process, throughout the whole population, the result is that you get a huge cloud of large connected trees. This is a really interesting example that we found. This is a diffusion chain for Stripy, which is a European cartoon. We noticed that when we visualize this graph, we had three clusters immediately pop out. One theory was that maybe these users were differentiated in some way. It turns out that it was indeed true, although [0:40:27.5 unclear] are from Bosnia. All the Slovenia nodes are yellow, and the Croatian nodes are green. The Bosnian and Slovenian nodes are connected by a few users, but the Croatian cloud has not been connected, yet. Student: I don’t understand what the links represent. They represent actual links, in terms of friends’ relationships? Links being that I saw one of my friends fan this page and directly fan … within 24 hours…. Eric: This is just an extension of this action, right here. In order to have a link to another node, you need to first be a friend of that person and then you have to see that person has fanned a certain page, and you also have to fan it, as well, within 24 hours. Student: In other words it’s the influence, not just the …. Eric: It turns out that if you just draw these links between users to users, often, the vast majority of fans can be connected into one single cluster. Sometimes, for a new page that becomes very popular, if you draw these links, sometimes you can get over 90% of the fans that are connected in some way. That just speaks to how connected Facebook is and how addicted Facebook users are to Facebook. For example, on August 21, 2008, which was right in the middle of the Beijing Olympics, 71,000 of 96,000 fans of the [0:42:16.5 unclear] page views of the American gymnast were in one connected cluster. Pages created after July 1, 2008, we measured the data at August 19, 2008, and the median page has almost 70% of its fans in one connected cluster. Student: … 0:42:42.1 Eric: Right now, we’re just looking at the clusters. Right now, the yellow and blue nodes would be connected because there is at least one link between those two clouds, but this cloud on the bottom right would not be connected. The 70% figure we get is just by taking all the nodes in the top two and dividing by all of them, for example. In this case, there are also a lot of other nodes that are just random, that were not as interesting to show. Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 15 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Student: A cluster means that every person in that cluster fanned a page as a result of seeing someone else’s cluster? Student: … Eric: That’s something we like to test, so I’ll get to that in a few minutes. First, we would like to figure out how these large clusters come about in the first place. Are these large clusters started by one guy, as the influentials theory predicts, or are they formed when long chains of diffusion is merged together? It turns out that of all pages of any meaningful size, we define that rather arbitrarily at 1,000. 14.8% of the fans in the biggest cluster were start points. By that, I mean 15% of the fans searched for the page and fanned it, whereas 85% of the fans found the page by seeing someone on their newsfeed, fanned the page, and then clicking the link to “also fan the page” within 24 hours. This 15% figure is actually very stable, especially as the number of fans in that page increases. This happens because the average node and the biggest cluster are connected to almost 3 others. It’s in both directions so if I’m somewhere in this big cloud of users, on average, I either have 2 parents and 1 child, or 2 children and 1 parent. I think the median behavior was 2 parents and 1 child. I don’t recall. Just to compare diffusion chains on Facebook versus real life, obviously it’s a little bit different, but the connected nature of Facebook makes long diffusion chains very easily possible. Just to contrast this with a word of mouth study in real life, there was a paper in 1987 where they tried to track the propagation of piano teachers in a neighborhood. If they drew a similar link graph, they found that 38% of the paths involved at least 4 individuals. On Facebook, where things are very easily found and you log onto Facebook and you see a newsfeed and a lot of items that say your friends have fanned a page, and it’s really easy to also click that link, it turns out that 86.4% of paths of page diffusions involve at least 4 individuals. It’s already a huge difference. Another thing we wanted to do is to figure out how these long diffusion chains are created because this obviously has a lot of implications for advertisers and also it is just interesting in a sociological experiment. We wanted to test whether the influential theory or the contagion theory is more applicable to Facebook. To do this, we tried to predict the length of the diffusion chain that someone will create when they fan a page. 0:46:50.9 Basically, the way it works is that if I fan a page, using my characteristics, maybe we’ll be able to predict the length of the diffusion chain I create. If I fan a page, maybe Itamar will see it and he’ll fan a page, and someone he’s friends with that I’m not will also fan it. The process continues. If it stops there, then my chain will be of length 3, but maybe there are some characteristics that will be more amenable Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 16 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics to having someone create a very long chain. If we can figure out what those characteristics are, then it would be very great. To do this, we used a sample of 10 pages that we found pretty randomly. For these pages, we got the graph of every single link between the actor and the follower. To make sure these are good pages, we made sure they’re sufficiently old and also sufficiently large. Our prediction model was a model where we have the response variable is maximum chain length. For each user who is a chain starter – someone who becomes a fan of a page without ever seeing any of their friends fan the page in the last 24 hours – the maximum chain length is a count variable for how long their chain was. Predictors were gender, their age – which is how long they’ve been a member of Facebook, their feed exposure which controls for the number of friends who saw their newsfeed story. In the old Facebook, this was not just everyone. We also controlled for their friend count, for how active this person was on Facebook, and there is also a nebulous concept called popularity. This controls for newsfeed exposure through an algorithm that puts stories on your friends’ newsfeeds. With some probability, if I’m a friend of yours, I’m going to see your story when you fanned a page, but that algorithm is pretty complex. Student: I was wondering about…. Eric: In graph terminology, sometimes they call it “diameter,” so if you take this person at the top and you plot their entire tree, and it’s the maximum the length of the tree, the number of levels of the tree, just the number of levels. Students: … Eric: Yes, but that makes it very difficult because a lot of these changes will emerge so it turns out that you won’t get a very interesting regression because most of these people – it 90% of these users are actually in one chain, then the width will just be 90%. Everyone will have the same number. Student: To me, what you’re telling me is … Eric: There are definitely a lot of things you can do with this data. I’m just presenting one little analysis. 0:51:01.8 I’m going to skip over the technical details, but we’ll post the paper on the wiki later, if you want to take a look. We ran a negative binomial regression and we found that the only consistent significant coefficient is on this feed exposure data, which controls for the number of friends who saw your news feed. This coefficient hovers around Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 17 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics one, which implies that if we tweak it such that newsfeed publishes a user’s action to 1% more people, we can actually expect a 1% longer max chain. This implies that the friend count variable is not realistically meaningful because after control – friend count is just the number of Facebook friends you have. This is actually not that meaningful which is a surprising finding. It says that after controlling for distribution and popularity, your demographics' characteristics, such as age, gender, and your number of Facebook friends do not play an important role in the prediction of your maximum diffusion chain length. This is one small evidence for the contagion theory that Duncan Watts puts out. Student: … Eric: If you have no friends, then you will not diffuse. You are right, there will be an excess number of zeros, so we did a correction. It’s call a zero inflation correction, in the negative binomial procedure. We can take a look at the paper if you’re interested. In conclusion, Facebook newsfeed enables long-lasting chains of diffusion that may reach a lot more people than real life chains. This is made possible by Facebook, which is very connected and ideas that have good receptiveness will attract long, wide chains of clusters. The main influence of these long chains is basically distribution, not anything related to the users. That’s why it may not be so important to find the actually influential people. I think that’s it. Student: … clusters with the … Eric: We assemble clusters just by drawing links between all these users. Our data set is so big that if you looked hard enough, you can find clusters that look like anything. Am I misunderstanding your question? Student: … nodes… it’s connected by…are those small clusters generate… essentially you can draw at the edges… to find how big your cluster…. Eric: Yeah, but that would require a certain structure of the graph that may not be present, because there is also a time aspect where it’s really not a graph; it’s really a tree that goes throughout time. I think a tree might be a better way to think about it. Student: … assigned any weight to the connections… Eric: That’s not something we’ve done yet, so far. Right now, for this, all we care about is that they’ve fanned a page, seen someone fan a page, and they also fanned the page. 0:55:07.0 Student: I thought that… do you think there are power chains that are more… than the others? Eric: What do you mean by power chains? Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 18 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc Transcript of Andreas Weigend Data Mining and E-Business: The Social Data Revolution Stanford University, Dept. of Statistics Student The diffusion of information … nodes… Eric: What do you mean by consistently valuable? Student: If the information travels through one chain, consistently creating a larger effect than… Eric: So you’re saying if we take a lot of different kinds of content and it always propagates through this one chain, is that meaningful? Student: Yes Eric: Maybe, we haven’t looked into that. I’d expect that something like that could happen very easily. Student: … people versus Eric: Yes, but I didn’t cover that in the presentation but it’s in the paper if you want to take a look. Student: … “sheeple,” and you used the cartoon to propagate. Does this have anything to say about the complexity of the information and its ability to propagate through a long chain, as something that’s easily transmittable and … pass it on? Eric: We think that what’s very important is that people are receptive, so for example, with the American gymnast in the Olympics example, it happened during the Olympics and a lot of people were very receptive of the idea. All they needed was some stimulus, for someone to point out that this page exists. Once someone points it out, then people are very receptive and they’ll go ahead and click the link. Andreas: I’ll see you next week, and let’s thank both speakers very much. Thank you. [Applause] Transcript by Tamara Bentzur, http://outsourcetranscriptionservices.com/ Page 19 http://weigend.com/files/teaching/stanford/2009/recordings/audio/weigend_stanford2009_5facebook-2_2009.05.04.doc