>> Paul Bennett: It's my pleasure to welcome Jure Leskovec today. So many of you know Jure. He's an assistant professor at Stanford. Before that he did a postdoc turn at Cornell and did his Ph.D. at Carnegie Mellon where he came from -- I think I would butcher it -- with [inaudible]. >> Jure Leskovec: Yeah, before that, yes. >> Paul Bennett: And so Jure's been all over. He was intern for -- intern here as well. So I think Jure's covered a lot of the world in a short amount of time. Jure's work has been recognized not only by a number of best paper awards but through Microsoft Research Faculty Fellowship, through the Sloan Fellowship, through a number of different types of things. And today he's going to be talking about when the information cascades can be predicted. >> Jure Leskovec: Okay. Good. Thank you. Okay. Great. Thanks for having me. It's always great to be back. Yeah. Great. So about this talk. So this talk is about information cascades. And this is joint work with two of my students, Justin and Hema [phonetic], a postdoc in my group, Julian, and then collaborators at Facebook, Lada and Alex, and collaborators at Cornell with John Kleinberg [phonetic]. So that's basically the team, and I'll show a few different pieces of work. So here is kind of how to motivate this thing. So we as humans, we are embedded in networks, we are interacting with each other, we are talking with each other, and basically these networks are these fundamental place in which we as humans exist. And one of the most important tasks that networks allow us to do is they allow us to get access to information. And here is what I mean by that. So if we say about us as humans, we learn from one another, we make decisions by talking to one another, we gain trust and safety from one another. So this all kind of happens through our connections in the network. You can say that this kind of happens based on the influence on the neighbors or some input from the neighbors of the network. And there has been a number of studies, everything from what people do before purchasing electronics and actually maybe surprisingly more people go and talk to their friends about what camera to buy rather than do research online. So we get a lot of information from our networks. So this way you can now think about the network as providing you a skeleton for access to information. And you can think about now this notion of a contagion, like a piece of information or a piece of behavior or something that spreads over the edges of a network like a virus. Okay. So there are many examples of contagions in networks, everything like information in terms of news, opinions, rumors, and so on, to, you know, word of mouth, product adoptions, and so on, to political mobilization, infectious diseases and so on. So this -- you can think of this notion of a contagion as something very, very general. It can be anything from a particular rumor to a decision to sign to a particular pension plan or buy an iPhone and so on and so forth. So these contagions can be in some sense many different things. So how do we think about this? We think of a network where circles are people and we have social connections between them, and now we can think of these contagions as let's say some color that spreads throughout the network over the nodes of the graph. So the idea is these blue contagions started here, and then these kind of propagating like a virus throughout the network. And I could have a different contagion and also kind of spreading throughout the network. So that's kind of one way how to think about the adoption of a particular contagion, how it is spreading throughout the network. And now that I have this notion of a contagion and I have a notion that this contagion spreads from a node to node over the edges of the network, the next thing I need to define are the information cascades. So as the contagion spreads, it creates a cascade. So what do I mean by this is basically I have a network, I have this rare contagion starting here and then the contagion kind of spreads throughout the network. It doesn't spread everywhere. It spreads over a subset of the network. And kind of the structure over which it spread, in some sense, this tree, I call an information cascade. So that's basically the object we'll be studying in this talk. And just to give you an example how this real cascade look like, here is some work we did kind of a while back. This was done when we had really good data about how people make a send and receive product recommendation. So the idea is the following. You are a person. You buy a particular product on our Web site, you make an explicit recommendation to another person, another person can now go buy the product and further make this a recommendation. So now this notion of a contagion is the decision to buy a product and what is now spreading throughout the network is basically this behavior of purchasing a given book in this case. So this is something kind of that happens in the real life all the time. But the problem is how do I get -- how do I get to see it. And what is nice here is that we worked with a large online retailer that gave us data from around 2001 to 2003 where we have 16 million of these explicit recommendations between customers on half a million products between 4 million customers. So we really see who makes a recommendation to whom, which of these recommendations are successful, and how this decision to purchase a given product propagates over this underlying implicit social network. Okay. So the first question is, if this is now the contagion that I'm studying, how do this information or this product adoption or product purchasing cascades, how do they look like. Just to show you an example, this is a case of a cascade from -- about a particular DVD, a movie DVD title that basically people were buying, and we have three types of nodes. We have the blue nodes, which are the people that make a purchase and send recommendations; we have black nodes, which are the people that receive recommendations but don't buy; and then we have red nodes which are the ones that receive a recommendation and make a purchase. And what do you see in this case? You see that we have kind of most of the nodes that are blue, basically most of the recommendations are received but they never result in a purchase or further propagation. What you also see is that I have this kind of one big cascade here, but then I have tons of these very small isolated tiny, tiny cascades, and this is kind of a good intuition or a good picture to have in mind. So something big, and then lots and lots of these small isolated pieces that, you know, happened all over the network. So this is kind of one example of a cascade that kind of happens in real life. Another case where cascades happen a lot is in social media. So by this I mean basically that we have users who are resharing content with one another. And this resharing happens on Twitter through retweeting; happens on Facebook by resharing posts, happens on other websites as well. So another kind of resource of cascade data is online social media. Here is an example. From Facebook. My collaborator goes, writes the posts, and what people can do now, they can reshare this post to their friends, basically by this resharing button. So you can go and reshare this post. This post was reshared 25 times. So now how do we think about this? We think of this post as a contagion that is kind of spreading over this 1 billion node graph and this contagion kind of infected or spread over 25 nodes in the graph. Okay? So that's kind of the object we'll be studying, is this kind of resharing behavior of information across the Facebook network. There are many kinds of posts that spread a lot. Here is kind of a very useful post. This post teaches you how to compute the volume of pizza you eat. So if you have pizza with diameter a -- or, sorry, diameter z and thinkness a, then the amount of pizza you eat, it's called Pi(z*z)a. Okay? And, you know, this got reshared 27,000 times. So now this thing kind of spread over 27,000 nodes in the Facebook -- in the Facebook network. Okay. And basically the -even though like information have been happening for a long time. We are only now kind of able to see them at this individual granular individual propagations level. And this has kind of led to a lot of research in this area. So, for example, people have been studying information cascades [inaudible] in a sense that, you know, how do people adopt hashtags, so now you can study how the adoption of a given hashtag is spreading throughout the network. Another kind of contagion that is spreading throughout the network is the retweeting behavior, so how our basically people retweeting pieces of information, what is going to be reshared. In some other social networks also people reshare photos, and now basically individual photos tend to propagate throughout networks like Flickr and so on. So given that a lot has been done, kind of the question that hasn't necessarily been well addressed is this question about how is a cascade going to grow in the future, so how are things kind of going to be reshared in the future and how much of this behavior can be predicted. And this is kind of what I'll try to talk about. Before I tell you how to formulate the problem and how to think about this prediction problem of how a cascade is going to grow in the future, let me kind of show you a few examples why this is -- why this is nontrivial or why this could be hard. So the first reason why this could be hard is the following graph that says cascade size, which is the number of reshares, versus the probability of seeing a cascade of size at least X. So this is complementary CDF, so cumulative distribution function. So I'm asking -for every X, I'm asking how many cascades have size at least X. And basically what do I see? I see that most cascades are super small. Actually, on Facebook, less than 1 percent of all the images that get uploaded get shared more than one or two times. So basically if I would want to predict where something will get reshared, then my baseline would be no, it won't be and it will be kind of -- I'll be accurate around 99 percent of the time. So basically the thing is these kind of large cascades are extremely, extremely rare, and predicting rare events is hard. So that's one reason why this kind of would be a nontrivial prediction problem. Another reason why this would be a nontrivial prediction problem is that even though I can have exactly the same piece of content, I can get widely different number of reshares. So here I get three reshares. In this case the same image, the same content got 10,000 reshares. So I have the same content but kind of widely different popularity. And it's not a priori clear why here only three and here 10,000. I can explain you why this is. It's actually delicious. What you do is you take a banana, put peanut butter and cover it with chocolate, put in the fridge, and you have this kind of great, great dessert. You should try it. It's really good. Okay. So that's right. And I know this part of the network appreciates good desserts, and I know this part of the network doesn't. Or whatever. So kind of big difference. So now all this kind of little evidence accumulated kind of is nicely summarized by this quote from one of Duncan Watt's papers, where basically what they were -- what they were saying is that increasing the strength of social inference increases both the inequality and unpredictability of success of the content. So kind of success in our case is how many reshares did you get, how big of an information cascade did you create. And here are two kind of interesting questions. One is about unpredictability and the other is about inequality. So I'll first focus on the unpredictability part, and then I'll kind of go to the inequality part and touch on that one a bit also. Okay? So here is kind of the outline for my talk. What I will do is first I'll tell you about this work we just published at WWW using Facebook data where the cascades can be predicted, basically whether this resharing behavior and the size of the cascade on Facebook can be predicted. And then the other thing I'll talk later is not only can we predict cascades but can we kind of create -- can we create them, can we kind of create engaging content and automatically post it online. And this would be something we did with Reddit. So Reddit data. So these are kind of the two parts of my talk. Okay. So here is kind of the take-home message: Are the cascades predictable? Yes, they are, very much. The trick is kind of to define the prediction problem well. And this is what we call the cascade growth prediction problem. What's interesting when I say are cascades predictable, it's not even where not only the size is predictable but also the structure can be predicted and also kind of the content we can take into account and differentiate between cascades of different sizes of the same piece of content. So we can actually predict a lot of things about the cascade. And what I want to do now is kind of guide you through how you can do this. Okay? So before you even go and kind of start training your machine learning classifiers, the first question is how do I even go and formulate the problem. So the idea is the following. I will observe a bit of a cascade, maybe the first six reshares or whatever, and my goal is to predict what will happen at the end of time, kind of what will happen in the future, what will happen to this cascade after some time. And there are many different ways how I could formulate this as a kind of clean machine learning problem. So here's one potential formulation. So what I could do is to say I will do binary class classification where I'll observe a bit of a cascade, and what I will try to do, then, is to answer the question will this cascade grow beyond size K or not. For some arbitrary K. Maybe K is 1,000. 100. Whatever. Right? So basically I have this binary prediction problem. The problem if you formulate the task this way is that most cascades are small. I showed you before, cascade's distribution is heavily skewed, so I will have this huge class imbalance. And of course I can then go and subsample and things like that. But this causes problems. So the -- if I do it this way, the problem is I have huge class imbalance and most of the cascades are super small so kind of a trivial predictor that does almost perfectly well, and it's very hard to beat it. Okay? So that's one way. Another way how I could go do this is to say, look, this is not a classification problem, this is a regression problem. So all I want to do is to say you give me a bit of a cascade and ask me to predict how big will this cascade be in the future, I just predict the exact number of reshapes. This also has a problem, and the problem is kind of similar in a sense that here the cascade size distribution is heavily skewed. So if you plot it, it follows a power load distribution, which means it's kind of heavy distribution, which means that kind of outliers will skew your error. One way to fix this would be to not predict the rho number of reshares but to predict the log number of reshares, then kind of regression works better. But then kind of you're not predicting the number but you're just kind of predicting the order of magnitude. So, again, not clear. And then the last way how you could go and formulate this problem was me to say, okay, let me let me only look at cascades that reached some minimum distant size, maybe 10, and then out of this let me try to predict future growth. So kind of lets ignore small cascades, they're not interesting, let's just focus on the beefy ones and let's try to predict those. If you would go to this that way, then you are doing this kind of -- you have this huge selection bias in the sense that you are now defining your prediction problem only over a small subset of data. Okay? So these are kind of three ways how you could go and formulate the problem, but this is not how we formulate it. So the way we formulate it is the following. So we call this a cascade growth prediction problem. And what we are asking is basically to say we will a cascade reach the median size. Okay? That's the -- it's a binary classification problem with this question. So let me now explain you what I mean by this. So the idea is the following. The idea is that I observe the cascade up to size K, and now I want to predict whether a cascade that I observed up to size K, will it grow beyond the median size for those cascades are not. So the idea is that I take all the cascades that had let's say at least five reshares, this generates -- has some distribution -- this gives me some population of cascades. This population has a mean. And now I'm asking are you going grow beyond that mean or below that mean. Okay? So for a given K, there is some median cascade size for cascades that reach size at least K, and now I'm asking are you less than median or more than median. What is cool about this prediction problem, the cool thing is that the median is the 50th percentile. So by definition, guessing always gives me 50 percent accuracy. My classification problem is balanced. Another feature of this is that for every K I get a different prediction problem because for every K there is a -- the median changes because I'm conditioning on a given cascade size. So now the next question is how does K -- how do kind of the median value for K, how do they relate to each other. So here is -- here's the data. This is Facebook data. This is not fake. This is kind of really a straight line. It's kind of amazingly strait. And what this is showing is what is the number of reshares that I observed, what is K versus what is the median cascade size for cascades that reach size at least K. Okay? And what this is basically saying is to say all that the scale -- the median size of cascades that reach size 10 is 20. Okay? So basically asking whether you will grow above the median size is the same as asking will you double. And this is not magic. Actually you can prove that if -- because the cascade size distribution is parallel, then asking this question about the median is the same as saying will you double. Okay? So now -- yes. >>: Quick question. So this median data is in [inaudible] for the real data? >> Jure Leskovec: This is the real data. >>: And this serves as your ground truth like the median of [inaudible] that's kind of the ground truth? >> Jure Leskovec: That's the question. Yes. >>: Okay. >> Jure Leskovec: That's the question. Exactly. So what -- yes. >>: Have you looked at the rate at which something doubles? Does it stay relatively constant until some saturation point for different events? So, you know, if it's size five, it took me a day to double at size 10 [inaudible] double until I reach some overall saturation? >> Jure Leskovec: That's a great point. So we were not trying to predict time to double, but we use time to tell us is this going to double at all. So we were not saying given the distributed double how long will it take, but we were asking what does the time tell me about whether something is going to double or not. So we were interested in predicting the size, not predicting the speed. But that's an excellent point. So I'll show you what I mean. Okay? Yeah. >>: I'm not real clear on that. So you kind of fix a point of like for ten days ->> Jure Leskovec: No, no, no. I say I fix how much of a cascade I see. >>: Yes. >> Jure Leskovec: And now I want to predict whether the cascade is going to double or not. >>: Double using a time window or ->> Jure Leskovec: Some time window. Long enough. These things are not -- the cascades die in a couple of days. The lifespan of these things is so short that basically you say -- like we look at two months of data or one month, and that's basically -- we have enough. So there's no end of time effects, if that's what you're worried about. Okay. So what's my prediction problem? My prediction problem is I observe the cascade for the first-case steps, and I want to predict will the cascade size reach -- will the cascade reach the size of twice K or not. Okay? What are cool properties of this problem? So the prediction problem is the cascade has a -- given that the cascade has obtained K reshares, will it grow beyond the median size, which basically means will the cascade double. The two cool properties of this prediction problem is that it's naturally balanced. The reason it's balanced is because median is the 50th percentile, so exactly half of the cascades by definition won't double and half of them will double. And the other one is that this is not a single prediction problem but a series of prediction problems, in the sense that for every K I get a different prediction problem. So what this means is I can now track the cascade growth over time. So now that I told you kind of what we are trying to do, here's the data. We are using complete anonymized traceable data from June 2013 where we spent quite a lot of effort to actually recreate this cascade based on click and impression data and so on. On Facebook we get around 150 -- 350 million photos uploaded a day, and only around 150 photos -- 150,000 photos get at least five reshares. So there is kind of most of this stuff never spreads anywhere. And then right now that I have this data, I'll do kind of very simple things. I'll just go extract some features from the cascades, and I'll, you know, evaluate the performance of a given machine learning classifier. And what I'll show you is just performance of a very simple logistic regression classifier. Kind of our goal was not to get the best predictive accuracy; our goal was more to understand the problem. So we kind of -- I'm sure you can do much better than what we did. So it will be too hard to boost the performance of this for another 5, 10 percent or something. But what we are more interested in is kind of exploring the space. So let me just show you how -- what we did. So the first thing we wanted to understand is what are the factors that kind of influence predictability. And there are kind of four things if you think about them that matter. One is what is the content that is trying to spread. In our case we are interested in images that spread. So the question is what's the content of the image, does it have text overlaid, is it a mean, things like that. The other thing that is also important is who's the user that is spreading this. For example, you know, what's the number of followers that this user has. If Justin Bieber is kind of trying to post something, that will get a very different effect if I know me with my 500 followers is going to -- I'm going to spread something. Another important thing is not just what is the structure of the individual user but also what is the structure of the network in which the cascade is trying to spread. And then the last thing that is also important or that gives us information is how quickly or what are the temporal characteristics of the things spreading, how quickly is it spreading and so on. And this kind of goes towards both questions. So just to show you how well you can do, this is the question of a will a cascade double, will you go beyond median or not, for a cascade of size 5. This is classification accuracy. [inaudible] look very similar, the guessing gives you 50 percent. We can do around 80. Time kind of by itself gives you the most signal. Who's the user gives you relatively a lot of signal here. Kind of the structure of the network around it also gives you some, and kind of the content gives you the list. That's how I would interpret this. But kind of the bottom line is that it doesn't really matter too much what kind of features you are using. You can do around 80 percent. And now that I kind of showed you that this is kind of not a hopeless prediction problem, now I kind of want to show you some kind of next level of questions and saying, okay, what's going on, can we understand things better. So here's kind of the first of a series of these kinds of questions. So the idea is to ask the following, like how does the predictability, how does the performance change with K? And what I mean by this is to say the following. Here's one prediction problem. I see the first five reshares, and I have to predict whether the cascade will grow beyond 10 versus I see 20 reshares and I need to predict whether the cascade will grow beyond 40. Right? And the question is which of these two prediction problems is easier. Right? And if you think about it, kind of there are two opposing arguments. You could say that the second prediction problem is easier. Why would this one be easier is just because here I have more data, here I see 20 reshares, so I can compute better features. I see more of the cascade. So predicting this should be easier. Predicting whether this doubles or not should be easier. Similarly, or on the other hand, I could make kind of the opposite argument that says no, no, this is harder. Why would this be harder? Because here I'm predicting 20 steps into the future and here I'm predicting just five steps into the future. And kind of predicting further ahead into the future should be harder. So the question is which one is it? And so the graph that I'll show you will have the K on the X axis and accuracy on the Y axis. And here's the graph. So this is predicting whether a cascade will double or not, how much of a cascade we see. And we are predicting whether it's doubling. And then the lines -- the line shows the accuracy. And what's interesting is that basically predicting small -- whether small cascades will double is harder than predicting whether big cascades are going to double. So kind of predicting bigger cascades is easier. Which is maybe a bit -- a bit surprising. So that's kind of the first result. Here's the second result, so kind of second variation of a similar question. So now imagine that I fixed the minimum cascade size, and I want to ask how does predictability change with K. So now what I fix is I fix what I'm predicting. I'm predicting whether a cascade is going to grow to size 40 or not. So I keep what I'm predicting, but I'm changing how much of the cascade I observe. So in this case I just observe five reshares and I'm trying to predict whether it's going to grow beyond 40, where here I observe 20 and predict whether it's going to grow to 40. And if you think now about these two prediction problems, it's kind of clear that this one is easier. I get more information, I'm predicting the same quantity. So here this thing is easier than that thing. But the question in this case is there any sweet spot, in a sense, is there just a case, I don't know, something between 5 and 20 where I observe just enough information that I can make -- solve this prediction problem well. Okay? So to give you an answer to this one, here is what we did. So we took all the cascades that reached size of at least 100, and we are saying now predicting whether a cascade will reach size 200 or not. And now what I'm doing is I'm observing more and more of the cascade and asking what is my accuracy. And of course the more of the cascade I observe, the better my accuracy. That's not surprising. But what's maybe is that as I observe more and more cascade, my accuracy is basically linearly increasing. So what is interesting in this case is that each individual reshare gives me a constant boost in my performance. So basically what this means is that if I observe 25 reshares and I get a 26th one, my accuracy will increase for the same amount as if I observed 99 reshares and I see the 100th one. Right? Which is a bit surprising. You would imagine that having one more reshare early on kind of helps you more than getting one more reshare later on. But it seems that kind of each reshare provides you kind of a constant amount of information regardless of what you have already seen or what you have already gotten. You get this kind of linear increase in performance, and that was kind of surprising to us. >>: [inaudible]. >> Jure Leskovec: Do I have an intuition why it behaves this way. So we did this experiment several different ways, and we always kind of found a bit of a need very early on, and then this kind of straight line. And we definitely did not expect this. We were more thinking that this would have this kind of diminishing returns. Because here at the end, once you have a lot of information having one more reshare shouldn't matter -shouldn't matter that much. Why could that be? We could -- we think we could blame it on the network because kind of every new reshare kind of opens you a new space or a new part in the network, and that may be one of kind of the intuitions why this is going on. But I wouldn't -- I'm not sure we -- I would say we don't really understand this one well enough. >>: Yeah, because there would seem to be -- I mean, if you're [inaudible] already lots of friends have seen him, so some of them share hoping that new people will see, right? >> Jure Leskovec: Exactly. >>: [inaudible]. >> Jure Leskovec: It could be something that basically every new reshare kind of opens a new part of the network that hasn't seen the content yet, and that kind of gives you more information than just small pieces. >>: So does this study conduct a fixed group of networks, or it's various different group of [inaudible]? >> Jure Leskovec: This is conducted on the complete full Facebook network. So it's one group of people, 1.2 billion of them. >>: I would probably add something which is simpler, which is to go back to the time thing, right, time to -- for every individual gives you an estimate on the time to reshare. And as long as that time to reshare is not getting longer, you're going to continue doubling, right? Each point offers you that same observation on time to be shared estimate. And if you -- it'd be worth checking that, the time to reshare factor and where it's sort of changing. Because it's when that just started dropping off that you should see a network dive. >> Jure Leskovec: Sure, sure. So we have in the paper -- I show some examples about the time to reshare and what's important. But I'll show -- I'll go one level deeper than this, and there is more in the paper [inaudible]? >>: Yeah, I was [inaudible] where this line intersects a hundred percent accuracy, if you allowed null reshares in your model, then you could stretch this out to see when K actually approached 200, which for a cascade of size, that's your ->> Jure Leskovec: Why would it mean size a hundred? So you say if I see -- because then I'm not -- then I'm changing the prediction problem. My prediction is always are you going to reach 200 or not, so it was this are you going to double. But now I -- I reveal more and more of a cascade to you. >>: So is your prediction problem always whether you're reaching 200 or whether you're reaching a hundred? >> Jure Leskovec: No, no, here is always are you reaching 200. So I have the same set, but now I reveal more and more of a cascade to you that kind of I know that all these cascades that reach size 25, I know kind of they will reach size 100. And I'm predicting whether they will reach size 200. Because I'm kind of revealing more and more of a cascade to you. I'm revealing more and more of a cascade to the learning algorithm and asking how are you -- how good are you at predicting whether you are reaching 200 or not. And it's the same dataset as I'm revealing more and more of a cascade. >>: But the 25 of you still trained on the entire global data? >> Jure Leskovec: Uh ->>: [inaudible] trivial [inaudible]. >> Jure Leskovec: No, no, I'm -- I show this much of a cascade and training this classified here. >>: No, no, what -- you trained on cascades that were less than a hundred as well. >> Jure Leskovec: Over -- of course, of course, of course. Of course. >>: And so saying that you have a linear model, it will intersect with a hundred percent line, is it before 200 or it's -- like if I give you all the data of like 200 cascades, you obviously know everything, but since you have a linear model, is that point before the 200 where you -- before you hit the 200 point or it's right on the 200 point? So saying that you -- now you have all the data of 200 cascade, then I'm only using a hundred cascade to do the predictions, and I have 0.8 something. If I increase, it will increase [inaudible]. >> Jure Leskovec: These are all the cascades that reach size 100. Okay? >>: Yes. >> Jure Leskovec: Half of them -- half of them now -- out of [inaudible] size 100, half of them [inaudible] each size greater than 200 and half of them is less than 200. So basically between 100 and 200, right? >>: Yes. >> Jure Leskovec: By -- by -- and now what I'm doing here is I -- I show you -- so this is all the cascades of a size 100 or more, half of them will be -- by definition are between hundred and 200 and various at about 200. What I'm doing in this experiment is I'm showing you I have these cascades that are kind of growing and I know that they will all grow to a size 100, at least size 100, some will grow to 200, 50 percent will be in here, 50 percent will be in here, and I show you this much of a cascade and I'm asking you will a cascade be here or there. So. So I'm kind of revealing more and more of a cascade to you, but I'm still having the same prediction problem. And what this is saying is that as I reveal more of a cascade to you, the problem gets easier. That's not surprising. But the interesting thing is that it gets linearly easier. So that's how to think about this. >>: Plus it's [inaudible] it will increase to a point that you only need to know like 150 cascade to successfully ->> Jure Leskovec: No, no, maybe we should talk offline or I'll explain later exactly. Okay? >>: [inaudible] feature dimension are the same feature dimension? >> Jure Leskovec: The same feature -- features are always the same. Features are always the same. Okay? Now kind of going to some of the questions Paul was asking, so the first one is, you know, we can also go and look at the individual features, how much do they matter, right? So, for example, here is how much the -- how much is the original post [INAUDIBLE] how much does it matter. And we see that early on when cascades are small it's very -- kind of it's important what's your degree as the initial poster. As the cascades get bigger, that gets less and less important, which makes sense. We also see similar things whether -- where there are properties of whether a photo was a meme, in a sense, does it have text overlaid or not. In terms of time, kind of this was the question. What we see is that for the cascades to double, they need to have -- they need to have a lot of unique viewers per unit time, so you want to get lots of people exposed to it. But that's not enough. What you also want to do is to have high conversion rights. So you want to have kind of lots of these exposed actually reshare things. And have these two together is a good sign that cascade is going to double. Another thing we were looking at is to ask if I have a cascade that -- let's say I observe the first three reshares of a cascade and I know that cascades spread in a given pattern, what does this mean for the future growth of a cascade? So if I show you that a cascade sped in this pattern and ask you is a cascade going to double or not, this one versus that one, which one would you guess is more likely to double or not, this one that is kind of doesn't want to die or this that seems to be kind of spreading very kind of in this breadth for search type of way. And what's interesting is if you look at this and you kind of sort these cascade structures and plot the probability of doubling, it turns out that this long and narrow cascades, they are very unlikely to double, while these wide cascades are more likely to double. And actually the wide cascades that spread beyond level 1, those are the most likely to double. And kind of intuition being that here you don't know is this something that is kind of very vital or is this just being posted by a very high degree person, while here kind of you see that, you know, this is going to propagate and it's going to kind of in a star-like fashion. So that's the intuition that comes from here. Rather than just saying can I predict the cascade size, you can also ask can I predict the cascade structure in a sense that I see a bit of a cascade and I want to say, you know, what will be the structure of the cascade in the future, is it going to be more this kind of breadth for search style like thing or this kind of very long and narrow type of structure. The way we do this is to say that we will compute the kind of a structural [inaudible] feature of every cascade. This is called the Wiener index, and it's simply the mean shortest path land between all nodes in the cascade. So the idea being that in this case, if I have a star-like cascade, the Wiener index will be small. If I have these kind of big tree cascades, the Wiener index will be high. So now the question is how well can we predict the Wiener index. And we can do the same thing as we did before. Now we basically take the cascades up to a given size, compute the Wiener index of all those cascades and predict whether the Wiener index is below or -- or above or below the median index for that population of cascades. What's interesting is basically our random prediction would give you like 50 percent accuracy. We can do around 72. The interesting here is that of course structural features tell you a lot about how the cascade is going to grow in the future in terms of structure, but time also gives you a lot of information. So basically fast -- cascades are spread fast, have different structure than cascades are spread -- that spread slow. So cascades that spread slow can be this kind of long and narrow while cascades are spread too fast, then to be this kind of wide and broad. And this is what comes out of this. So this was kind of what I wanted to talk about on predictability. Let me just quickly touch about inequality. And when I mean about inequality is basically this question. I have the same piece of content, in some cases this content got very few reshares and in other cases it got lots of reshares. So the question is can I kind of differentiate between the cascades of the same content. So what we did here is because we worked with photos, you were able to go and identify clusters of identical photos. We have around a thousand clusters. They totaled -- overall they contained 38,000 photos, and they were reshared 13 million times. And what we did is for every cluster, we select 10 random photos or 10 random cascades, and the question is can you predict which one of these 10 is going to be the largest. Because this is one out of 10 or a 10 [inaudible] problem, random guessing would kind of give me 10 percent accuracy, if you like. Okay? So how well can we do in predicting 10 cascades of the given scene, 10 cascades of the same photo, which one of those will be the largest, our classification accuracy is around 50 percent, so kind of five times better than the random. If you think of this as a ranking problem, we are at around .66 in terms of mean reciprocal rank, which basically means about half of the time you put the largest cascade as the correct prediction and about half of the time we kind of put it on the second spot. What's also interesting in terms of inequality of content, you can commute the Gini index where 0 would mean perfect equality and 1 would mean kind of perfect inequality. We see that the Gini index of this cascade is around .8. The skewness or the Gini index of human world wealth distribution is .66. Okay? So kind of the point being that cascade or online content popularity is less equal than what is human wealth. And we think of human wealth as kind of being superskewed. This is even more skewed. But that doesn't -- it's not a problem kind of. We can still do -- it's still predictable. It's very skewed, but very much predictable. So with this now you can start asking, right, okay, you can do these predictions, observe -- kind of make them as a cascade is growing and you can kind of trace the growth of a cascade and make predictions what you think will happen in the future. But the question is can you kind of go and make your posts kind of viral, can you go and optimize the posts that will get a lot of attention. So here is kind of what I would like to do in this part. I would like to say how can I go and maximize the probability of success of my content. So in some sense, I have a piece of content, and I would like to make it successful, which means I would like people to like it, I would like people to comment on it, I'd like people to reshare it. And then if this is what I want, the question is how do I get it, like what can I do. And the things that -- the fact that I have inference [inaudible], for example, where do I post, like if I think of -- that I'm posting to a given Web site, usually these Web sites are organized into groups or communities, so to each community am I posting, at what time am I posting, what is the text that I'm using to convey my message, who is the user that is posting this, what's the reputation of popularity. You know, maybe this content has been already published into the same community, how well did it do previously and so on and so forth. So these are kind of things that I want to understand. And what we would like to do is basically take these individual factors and kind of tease them apart. So what we require is in some sense a dataset that will allow us to do this. And the idea is similar to what I showed you on Facebook, right, in the sense that I want to look at the resubmissions of the same content across multiple communities, communities with varying characteristics from various users. And this basically means that I have this kind of natural experiment that will allow me to figure out how these factors relates to each other. So the dataset we'll be using is the following one. It's from Reddit. And Reddit is this kind of Web site where people post stories or content, other people can upload and download and they can comment on these stories. Okay? It's not a social network, it's not Facebook, it's a Web site where people post content and then they can comment, they can vote, and so on. Okay? And what we did is we were able to go and identify pieces, posts where the same content get posted multiple times. So we worked with images. We have around 130,000 Reddit submissions from 16,000 original images or different pieces of content and around seven resubmissions per image or per piece of content. And the data is available on our Web site, so we can kind of download it and play with it. Okay? So now given this the question is how popular is the content. So here's a question. This is a given image that was submitted to Reddit. Here it is. And this image was submitted 26 times. And here is one way how to quantify the attention the content gets. This is what we call score. It's simply number of upvotes minus number of downvotes. What you see here is that, for example, the first time this image was submitted it got some number of uploads. You know, maybe the fifth one got a bit more, but then it wasn't until the tenth one that actually some serious, serious number of upvotes came. And then, you know, the next few ones were really unlucky and then the 16th one was, again, very lucky and so on. So you see here that basically we are reposting the same piece of content but the popularity of that piece of content differs very widely, between kind of 800 difference between upvotes and downvotes versus, you know, just small amount of attention. >>: Does the popularity relate to who posts this content? >> Jure Leskovec: I will -- I will show you. >>: Okay. >> Jure Leskovec: Yeah? Okay. But that's basically what we would like to do. So in some sense what we'd like do is we'd like to play this game. We'd like to go and repost content on Reddit and get more upvotes than what the original post did. Okay. So that's kind of what we want do. So if now I want to go on Reddit and repost content, what can I play with? I can play with -- you know, I can play with the title, I can play with when I submit. I could potentially play with whose the user that is submitting, but we didn't do that. I can decide which category to submit, and then there are several ways how to measure my success. One is how many upvotes or downvotes I get, how many comments I get and so on. Okay. While the content that I'm posting is fixed, I have this image, it has already been posted, I'm just trying to repost it and get more attention than what the original poster did. Okay. So the way we do this is basically build a model that tries to predict the popularity of a given piece of content. And the way we build the model is to do -- kind of have two components. One is about kind of the community, who are the people that I'm talking to, and the other one is about kind of the language, in what way am I talking to that set of people. And the community model will have -- basically will try to model what is the choice of a community, at what time am I submitting, and how well did previous submissions do, and then the language model is all about kind of how does the title and the language of the submission relate to the language of the community. So let me show you for the community model a few things. For example, here is how much time matters. This is time versus score for submissions in six different Reddit communities or Reddit categories. And you see that kind of posting at 4:00 a.m. may not be as good as posting around lunchtime. Okay? So the effect is actually surprisingly big. So it matters when you post. Another thing that matters in terms of time is also because you are reposting things is how much are you penalized by reposting the same content. So here is the -- again, for different categories, the resubmission number of the same piece of content versus popularity. And you see that generally the first time the community sees a given piece of content, you get a lot of attention, and with every subsequent submission you get kind of less and less attention. And that makes lots of sense. However, what can kind of save you is to wait. So basically people forget. So what this is trying to say is this is probably that your current submission is less popular than the previous submission of the same content, and this is how much is the time between two consecutive submissions. And kind of if you are willing to wait four months, then basically people forgot about that piece of content, at least in the atheism community. In other communities like pics, you are still much less like -- kind of you are unlikely to be more successful than the original post, even if you wait long time. So kind of these guys are forgetting slower than atheists are forgetting, if you like. Okay. So this is in terms of temporal effect. And there are also these interesting effects between communities. So what I'm trying to show here is all different Reddit categories, and the idea is how likely is a piece of content to be successful in one category given that it has been posted in another one. And basically what it says is the diagonal tells you that resubmitting to the same community, to the same category is always bad, and the other thing it tells you, that basically if something has been submitted into one of these big categories, then it will do uniform -- it will do badly in all other ones. So basically the way to think about it, you have these huge categories and then all other smaller ones are kind of subsets of these few huge ones. And the small ones don't overlap with each other, but they overlap with the big one. So now kind of given these kind of pieces of signal, you can simply put all this together into the big kind of regression model where you say I want to model the amount of attention a piece of content got. I want to kind of estimate this latent inherent content popularity terms, I want to kind of model forgetfulness, I want to model whether this has been submitted multiple times to the same community. I want to model how well did this piece of content do previously, in other communities and so on and forth. And I can basically go and train and figure out these coefficients. So that's how to model time and kind of community effects. Now I also want to model language effects. And in terms of language, a few interesting things. So what this plot is showing is what they call community specificity. It's simply how much does the language of the title match the language of other posts in the community. And basically what this is saying is the following. If you are talking to gamers atheists, use kind of -- the more similar your language is to the language of the community, the better. While for other communities like gifs and pics the interesting thing is if you are too similar to the language of the community, that's actually bad for you. So it's kind of good to be here in the middle. Right? So the way to understand this, if you are talking -- if you are using words that the community is not using, nobody understands you. If you are using words everyone else is using, you are kind of [inaudible], but if you are here in the middle, at least for these three communities, that seems to be the best, while for gaming and atheism, kind of the more standard vocabulary you use the better. And then the other -- the second part is showing what we call title originality, which is simply I have my current submission with a current title, how does this title relate to the titles of the same piece of content previously submitted. And what this is saying is if you are using -- if you are reciting old titles, that's always bad, coming up with kind of new titles that haven't yet been used is good. Okay? So then what happens, we actually even go a step further and ask in what kind of grammatical structures should I kind of formulate as titles to get lots of attention, and depending on which community you talk to, you may want to use kind of different sentence structures. And this matters quite a bit. Okay. So now that I have these kind of things and I can put them together into a big regression model, I can ask how well is this doing, I'm just measuring the R squared between our predicted popularity and the true popularity. The community model in terms of the scores. So upvotes minus downvotes gives around 50 percent R squared, language model by itself gives you around .1. But what's interesting is putting the two together, their performance basically adds up and you are around .62. Okay? So this is in terms of the score or rating. You can also say how much attention I'll get, simply as upvotes or downvotes. You can do our spread of around .52 -- sorry, 5.8 and engagement in terms of number of comments, you can do aren't .7. So it seems to be working quite well. But the second part of evaluation we did was kind of more fun. So here the idea was we generated 85 pairs of posts where we basically are resubmitting old content. Okay? And the idea was to say for one of the posts kind of will go and submit it and for the other one we'll use our system to tell us what type we should reuse and when should you submit it and so on. And on average we got around three times as much attention or activity on the posts submitted by our model than the posts submitted by humans. And out of these 85, five of them reached the front pages of their corresponding community, and two of them reached the front page of Reddit/all, which is kind of interesting that we were just reposting old content but still got on the front page of Reddit. >>: [inaudible] is to play the with perceived popularity, so rather than changing content, actually change the number of votes that the post currently has, if you have a way of doing that. So if somebody sees that something has been upvoted a thousand times, they may be much more like the [inaudible] and they could be just like for people that cited the problem. Like after a while, people just starting citing it because the paper just cited it. >> Jure Leskovec: So there has been some work on this, but basically there was a science paper by [inaudible] where basically they showed for some -- as far as I know, some Russian Web site that is kind of Reddit like that, yeah, if you have a piece of content, then you just quickly upvote it, then it will get many more upvotes than if you don't do anything. But then I've talked to a few other people and they were not able to reproduce that result. So it seems that kind of effect seems to be very community specific. So here we did not mess up with that. We also created brand-new users with [inaudible] karma score, so all these things were kind of -- those are the other things that you could play with. We didn't do that. We just played with the type of time and where we post. Just to show you an example, here's one of the images we were posting. So this is something that our system said will be a good title. Here is -- once we posted -- this image that has been posted before, we got 7,000 upvotes, 5,000 downvotes, and around 500 comments. Here is something that our system said will be a bad title, so we also submitted this. We got around 300 votes, nine comments, and this is kind of if you look at why our system thinks this is a good title, it's regional in terms of [inaudible] the previous titles, it's kind of a good length and has an interesting sentence structure, while, you know, this is not original, generic, short, and kind of flat and boring. And you see kind of at least in this case a big difference. And in this case the difference was around 3X. So I'll show this one last time. Okay. Good. So kind of if this is what we wanted to do, then I think we can kind of declare success. So basically we were able to repost old content but just changing the title and kind of things like that and get much more attention than what the original posting got. So kind of what are conclusions of this work. One is about, you know, can cascades be predicted. Yes, they can be. They are quite predictable, even though large cascades are rare. And to some extent cascades cannot only be predicted but they can kind of be engineered, if you like. Now, what did we learn about, you know, when posts -- about wild posts or posts that are engaging. So if you want to do this, what do you need? You need kind of content by itself is important, right, so kind of if you submit images, people favor means kind of things, things that are interesting and popular. What's definitely important on Facebook is to have lots of followers, right, kind of be popular. It's important who's the originator. What's kind of more interesting is this question that it's not just enough to have lots of friends, you have to know what your friends like, and that's not enough. You have to know what your friends like, what their friends like, and so on and so forth. So really kind of have to know -- you need to know your network. And posting at the right time plays quite a bit of role. What are some kind of interesting further questions? This is something from Facebook. So this is the same photo, four different cascades. One kind of started here, here, there, and some are here. Right? And now I have basically these four different cascades that are kind of merging and interacting with each other. So kind of understanding how these information cascades interact and how they kind of clash together, that is something that I think would be a very interesting piece for future work. And another thing that would be interesting is not just kind of trying to predict how the content spreads, but also how the content mutates and how the sentiment toward that content changes as the content is spreading. So kind of not just saying how does the information spread but how does the attitude or something attached to that information, how does that change as information is spreading. So in terms of maybe attitude or sentiment, or maybe even understanding of how the thing that is spreading mutates as it is spreading throughout the network. And then kind of the last thing I wanted to say is kind of making a step back and thinking what does this mean. And in some sense, this kind of -- the motivation comes from this idea, right, today messages are spreading through networks. And the way we consume information today is in this kind of small incremental networks in realtime kind of very differently than what we used to do. And this really kind of requires us to think differently about information consumption, search, and so on. And there are several kind of interesting feedback effects happening in this information network. So the first one is kind of the feedback because we are using -- the feedback loop because we are using our social connections. So what I mean by this is that in some sense, because of this information being spread over the links, some links get stranded and others get kind of created or some get broken and others get created. So kind of understanding this process of how information sharing and link kind of strengthening and weakening relate would be interesting. And then there is also another feedback loop which is that kind of we as users, now we see our position in the network and we see how people react to the content we are sharing. So kind of understanding how that affects us and our content sharing and creation, is kind of another uncontrolled experiment that is happening in understanding this one would also be something that would be really cool. So with this, I'm done. Data and papers and more papers. Thank you. [applause]. >> Paul Bennett: We have time for some questions. >> Jure Leskovec: Yeah. >>: I had the question about the -- especially on the Facebook study at the beginning. To what degree or how do you convince yourself that you're learning something about the people's behavior versus the algorithm ->> Jure Leskovec: The feed ranking algorithm? >>: Yeah. Or does it just not matter? >> Jure Leskovec: No, it's a good point. So we were worrying about this quite a lot, and especially in these systems because everything is kind of driven by the recommended engines or some rankers. It's kind of hard to know am I just kind of reverse engineering the ranker or is there anything more. So we worried about this quite a bit. We were able to look because kind of AB tests are running constantly, some people have ranking turned off, so we were able to see that there is little difference between those and so on. So we tried to kind of convince ourselves that we are not studying the ranker, and we think we are not. But the best thing would be to turn off the ranking for the whole Facebook for a month, but they didn't want to do that. Which is not surprising. But it's a good point. We worried as much as we could about it, but at the end there is a bit of it in here. >>: I don't exactly remember what were the features [inaudible] but have you considered [inaudible] because if it gets lots of likes then someone will [inaudible]? >> Jure Leskovec: Sure. Like in our -- in the paper, we have this one page of kind of all the features we could come up with. And there is a bunch about likes and the -- we did a lot of kind of image analysis to capture the structure of the content, we did a lot of network analysis to understand the network, we did a lot in terms of time, how quickly is spreading, who's trying to spread it, is the speed kind of accelerating and decelerating, things like that. So we tried to be as exhaustive as possible. But it's -- I think the number of likes was, yes, was one of the features. But most of the -- and which -- and the likes kind of happen more and more. As cascades get bigger, it's more important to have likes because their small likes don't tell you too much, because most of the content has very few likes early on. Yeah. >>: So [inaudible] network structure and the information that's spreading, particularly the feedback of what information you're trying to spread versus [inaudible] versus how the network responds. For these kind of benign cascades, benign information, probably doesn't change the network structure touch. But if I disagree or agree with feeling much more controversial pieces of information [inaudible] to change the structure. >> Jure Leskovec: I think -- I think -- so we are looking at this a bit right now, not on Facebook, but using Twitter. What's interesting about Twitter is it's like a super dynamic networks in a sense that edges get created all the time and edges get dropped all the time. Even if you do nothing, you have this kind of constant churn of new followers and a constant churn of people unfollowing you. And then what we see is that because of your tweeting activity and because of retweeting you get these kind of spikes or bursts of new followers coming, people unfollowing you and so on. The biggest case when people unfollow someone is -- usually it's kind of professional athletes when they change teams, then you see basically somebody says, you know, it was great playing for you guys, but now I move here, and it's just like these links get cut, these links get added. That was the biggest thing we saw. And then you see this when politicians and Donald Trumps and so on say things they shouldn't say. >>: Right. More controversial people. >> Jure Leskovec: Exactly. So ->>: Good or bad, I presume. >> Jure Leskovec: But what we see is, for example -- what we are -- we what are able to do is we have these models that are able to predict when are you going to get new followers. And what turns out is, for example, you get new followers when you get a new retweet cascade. But that's not enough. What you need to do is kind of have a retweet cascade, but if the retweet cascade goes kind of in the direction of people who are already know about you, you won't get the followers. In a sense know that they have been exposed to your tweets in the past. What you'd like to do is kind of have a retweet cascade that goes into one part of the network that is not yet aware of you, that kind of hasn't seen your tweets in the past and that will generate the links towards you. So that's what we see on Twitter. Now understanding the breaking of links, that's harder. Because it's more -- it's much more about content. While in link creation, it's more about kind of new people being exposed to you and saying, oh, this is interesting, let me try it for a week. So it seems that in the systems, at least in Twitter, what we see is that users, you know, they come, they follow a few celebrities, and then later on kind of they discover what they like. They almost kind of learn what is the place in the network, what is the content they like, what is the content they don't like. So it seems it's this kind of news feed optimization by creating and dropping links. And there's a lot of that. And there is no kind of social cost on dropping a link, which on Facebook is like you don't unfriend someone. Right? >>: Right. Which brings me to my second question. Something that Twitter or [inaudible] personal, so they don't really reflect the actual social network and the breaking and forming of social links where there was a cost to both forming and breaking. These are easy to study; the other ones are hard. Any thoughts? >> Jure Leskovec: Sure. This is -- I mean, the kind of Twitter -- in some sense Twitter is easier because it's almost like more honest in a sense that it's all about the -- it's much more about the information. Even though there I'd imagine that there are some links that are social and some that are kind of purely informational. But comparing the two, like the Facebook link seems to be much more kind of socially driven while most of Twitter links I'd imagine are kind of information driven, in a sense that you follow someone because you like what they -- the -- the feed that they are creating. >>: With your colleagues at Facebook, it'd be eventually interesting to look at this when after change of relationship status, to divorces and other types of dating breakups and what happens in social [inaudible] that should be common enough and certainly friends seem to split after that, that you get a lot of ->> Jure Leskovec: There has been at least two papers from Facebook on this, on this topic, what happens when people move, what happens when they break out or break up. There was another paper trying to say can I identify your wife in your network. Kind of how do you identify a significant other. And they can predict the relationship lens based on how well you integrated your significant other to your network and things like that. [inaudible] same patterns what you see in the public network? >> Jure Leskovec: That's a good question. I don't have an answer. >>: And on that note, have you compared to [inaudible] over time? >> Jure Leskovec: Yeah, no. I mean, there's a lot to say here. Yeah. Yeah. Yeah. >> Paul Bennett: Other questions? >> Jure Leskovec: Anything else? >> Paul Bennett: Okay. Let's thank the speaker. >> Jure Leskovec: Okay. Good. Thank you, guys.