>> Paul Bennett: It's my pleasure to welcome Jure... know Jure. He's an assistant professor at Stanford. ...

advertisement
>> Paul Bennett: It's my pleasure to welcome Jure Leskovec today. So many of you
know Jure. He's an assistant professor at Stanford. Before that he did a postdoc turn at
Cornell and did his Ph.D. at Carnegie Mellon where he came from -- I think I would
butcher it -- with [inaudible].
>> Jure Leskovec: Yeah, before that, yes.
>> Paul Bennett: And so Jure's been all over. He was intern for -- intern here as well.
So I think Jure's covered a lot of the world in a short amount of time.
Jure's work has been recognized not only by a number of best paper awards but through
Microsoft Research Faculty Fellowship, through the Sloan Fellowship, through a number
of different types of things. And today he's going to be talking about when the
information cascades can be predicted.
>> Jure Leskovec: Okay. Good. Thank you. Okay. Great. Thanks for having me. It's
always great to be back. Yeah. Great. So about this talk. So this talk is about
information cascades. And this is joint work with two of my students, Justin and Hema
[phonetic], a postdoc in my group, Julian, and then collaborators at Facebook, Lada and
Alex, and collaborators at Cornell with John Kleinberg [phonetic]. So that's basically the
team, and I'll show a few different pieces of work.
So here is kind of how to motivate this thing. So we as humans, we are embedded in
networks, we are interacting with each other, we are talking with each other, and
basically these networks are these fundamental place in which we as humans exist.
And one of the most important tasks that networks allow us to do is they allow us to get
access to information. And here is what I mean by that. So if we say about us as
humans, we learn from one another, we make decisions by talking to one another, we
gain trust and safety from one another. So this all kind of happens through our
connections in the network. You can say that this kind of happens based on the influence
on the neighbors or some input from the neighbors of the network.
And there has been a number of studies, everything from what people do before
purchasing electronics and actually maybe surprisingly more people go and talk to their
friends about what camera to buy rather than do research online. So we get a lot of
information from our networks. So this way you can now think about the network as
providing you a skeleton for access to information. And you can think about now this
notion of a contagion, like a piece of information or a piece of behavior or something that
spreads over the edges of a network like a virus.
Okay. So there are many examples of contagions in networks, everything like
information in terms of news, opinions, rumors, and so on, to, you know, word of mouth,
product adoptions, and so on, to political mobilization, infectious diseases and so on. So
this -- you can think of this notion of a contagion as something very, very general. It can
be anything from a particular rumor to a decision to sign to a particular pension plan or
buy an iPhone and so on and so forth. So these contagions can be in some sense many
different things.
So how do we think about this? We think of a network where circles are people and we
have social connections between them, and now we can think of these contagions as let's
say some color that spreads throughout the network over the nodes of the graph. So the
idea is these blue contagions started here, and then these kind of propagating like a virus
throughout the network.
And I could have a different contagion and also kind of spreading throughout the
network. So that's kind of one way how to think about the adoption of a particular
contagion, how it is spreading throughout the network.
And now that I have this notion of a contagion and I have a notion that this contagion
spreads from a node to node over the edges of the network, the next thing I need to define
are the information cascades. So as the contagion spreads, it creates a cascade. So what
do I mean by this is basically I have a network, I have this rare contagion starting here
and then the contagion kind of spreads throughout the network.
It doesn't spread everywhere. It spreads over a subset of the network. And kind of the
structure over which it spread, in some sense, this tree, I call an information cascade. So
that's basically the object we'll be studying in this talk.
And just to give you an example how this real cascade look like, here is some work we
did kind of a while back. This was done when we had really good data about how people
make a send and receive product recommendation. So the idea is the following. You are
a person. You buy a particular product on our Web site, you make an explicit
recommendation to another person, another person can now go buy the product and
further make this a recommendation. So now this notion of a contagion is the decision to
buy a product and what is now spreading throughout the network is basically this
behavior of purchasing a given book in this case.
So this is something kind of that happens in the real life all the time. But the problem is
how do I get -- how do I get to see it. And what is nice here is that we worked with a
large online retailer that gave us data from around 2001 to 2003 where we have 16
million of these explicit recommendations between customers on half a million products
between 4 million customers. So we really see who makes a recommendation to whom,
which of these recommendations are successful, and how this decision to purchase a
given product propagates over this underlying implicit social network.
Okay. So the first question is, if this is now the contagion that I'm studying, how do this
information or this product adoption or product purchasing cascades, how do they look
like.
Just to show you an example, this is a case of a cascade from -- about a particular DVD, a
movie DVD title that basically people were buying, and we have three types of nodes.
We have the blue nodes, which are the people that make a purchase and send
recommendations; we have black nodes, which are the people that receive
recommendations but don't buy; and then we have red nodes which are the ones that
receive a recommendation and make a purchase.
And what do you see in this case? You see that we have kind of most of the nodes that
are blue, basically most of the recommendations are received but they never result in a
purchase or further propagation. What you also see is that I have this kind of one big
cascade here, but then I have tons of these very small isolated tiny, tiny cascades, and this
is kind of a good intuition or a good picture to have in mind. So something big, and then
lots and lots of these small isolated pieces that, you know, happened all over the network.
So this is kind of one example of a cascade that kind of happens in real life. Another case
where cascades happen a lot is in social media. So by this I mean basically that we have
users who are resharing content with one another. And this resharing happens on Twitter
through retweeting; happens on Facebook by resharing posts, happens on other websites
as well. So another kind of resource of cascade data is online social media.
Here is an example. From Facebook. My collaborator goes, writes the posts, and what
people can do now, they can reshare this post to their friends, basically by this resharing
button. So you can go and reshare this post. This post was reshared 25 times. So now
how do we think about this? We think of this post as a contagion that is kind of
spreading over this 1 billion node graph and this contagion kind of infected or spread
over 25 nodes in the graph. Okay? So that's kind of the object we'll be studying, is this
kind of resharing behavior of information across the Facebook network.
There are many kinds of posts that spread a lot. Here is kind of a very useful post. This
post teaches you how to compute the volume of pizza you eat. So if you have pizza with
diameter a -- or, sorry, diameter z and thinkness a, then the amount of pizza you eat, it's
called Pi(z*z)a. Okay?
And, you know, this got reshared 27,000 times. So now this thing kind of spread over
27,000 nodes in the Facebook -- in the Facebook network. Okay. And basically the -even though like information have been happening for a long time. We are only now
kind of able to see them at this individual granular individual propagations level.
And this has kind of led to a lot of research in this area. So, for example, people have
been studying information cascades [inaudible] in a sense that, you know, how do people
adopt hashtags, so now you can study how the adoption of a given hashtag is spreading
throughout the network.
Another kind of contagion that is spreading throughout the network is the retweeting
behavior, so how our basically people retweeting pieces of information, what is going to
be reshared. In some other social networks also people reshare photos, and now basically
individual photos tend to propagate throughout networks like Flickr and so on.
So given that a lot has been done, kind of the question that hasn't necessarily been well
addressed is this question about how is a cascade going to grow in the future, so how are
things kind of going to be reshared in the future and how much of this behavior can be
predicted. And this is kind of what I'll try to talk about.
Before I tell you how to formulate the problem and how to think about this prediction
problem of how a cascade is going to grow in the future, let me kind of show you a few
examples why this is -- why this is nontrivial or why this could be hard.
So the first reason why this could be hard is the following graph that says cascade size,
which is the number of reshares, versus the probability of seeing a cascade of size at least
X. So this is complementary CDF, so cumulative distribution function. So I'm asking -for every X, I'm asking how many cascades have size at least X.
And basically what do I see? I see that most cascades are super small. Actually, on
Facebook, less than 1 percent of all the images that get uploaded get shared more than
one or two times. So basically if I would want to predict where something will get
reshared, then my baseline would be no, it won't be and it will be kind of -- I'll be
accurate around 99 percent of the time.
So basically the thing is these kind of large cascades are extremely, extremely rare, and
predicting rare events is hard. So that's one reason why this kind of would be a nontrivial
prediction problem.
Another reason why this would be a nontrivial prediction problem is that even though I
can have exactly the same piece of content, I can get widely different number of reshares.
So here I get three reshares. In this case the same image, the same content got 10,000
reshares. So I have the same content but kind of widely different popularity. And it's not
a priori clear why here only three and here 10,000.
I can explain you why this is. It's actually delicious. What you do is you take a banana,
put peanut butter and cover it with chocolate, put in the fridge, and you have this kind of
great, great dessert. You should try it. It's really good. Okay. So that's right. And I
know this part of the network appreciates good desserts, and I know this part of the
network doesn't. Or whatever. So kind of big difference.
So now all this kind of little evidence accumulated kind of is nicely summarized by this
quote from one of Duncan Watt's papers, where basically what they were -- what they
were saying is that increasing the strength of social inference increases both the
inequality and unpredictability of success of the content. So kind of success in our case
is how many reshares did you get, how big of an information cascade did you create.
And here are two kind of interesting questions. One is about unpredictability and the
other is about inequality. So I'll first focus on the unpredictability part, and then I'll kind
of go to the inequality part and touch on that one a bit also. Okay?
So here is kind of the outline for my talk. What I will do is first I'll tell you about this
work we just published at WWW using Facebook data where the cascades can be
predicted, basically whether this resharing behavior and the size of the cascade on
Facebook can be predicted. And then the other thing I'll talk later is not only can we
predict cascades but can we kind of create -- can we create them, can we kind of create
engaging content and automatically post it online. And this would be something we did
with Reddit. So Reddit data. So these are kind of the two parts of my talk.
Okay. So here is kind of the take-home message: Are the cascades predictable? Yes,
they are, very much. The trick is kind of to define the prediction problem well. And this
is what we call the cascade growth prediction problem.
What's interesting when I say are cascades predictable, it's not even where not only the
size is predictable but also the structure can be predicted and also kind of the content we
can take into account and differentiate between cascades of different sizes of the same
piece of content. So we can actually predict a lot of things about the cascade.
And what I want to do now is kind of guide you through how you can do this. Okay? So
before you even go and kind of start training your machine learning classifiers, the first
question is how do I even go and formulate the problem. So the idea is the following. I
will observe a bit of a cascade, maybe the first six reshares or whatever, and my goal is to
predict what will happen at the end of time, kind of what will happen in the future, what
will happen to this cascade after some time.
And there are many different ways how I could formulate this as a kind of clean machine
learning problem. So here's one potential formulation. So what I could do is to say I will
do binary class classification where I'll observe a bit of a cascade, and what I will try to
do, then, is to answer the question will this cascade grow beyond size K or not. For some
arbitrary K. Maybe K is 1,000. 100. Whatever. Right? So basically I have this binary
prediction problem.
The problem if you formulate the task this way is that most cascades are small. I showed
you before, cascade's distribution is heavily skewed, so I will have this huge class
imbalance. And of course I can then go and subsample and things like that. But this
causes problems. So the -- if I do it this way, the problem is I have huge class imbalance
and most of the cascades are super small so kind of a trivial predictor that does almost
perfectly well, and it's very hard to beat it. Okay? So that's one way.
Another way how I could go do this is to say, look, this is not a classification problem,
this is a regression problem. So all I want to do is to say you give me a bit of a cascade
and ask me to predict how big will this cascade be in the future, I just predict the exact
number of reshapes.
This also has a problem, and the problem is kind of similar in a sense that here the
cascade size distribution is heavily skewed. So if you plot it, it follows a power load
distribution, which means it's kind of heavy distribution, which means that kind of
outliers will skew your error.
One way to fix this would be to not predict the rho number of reshares but to predict the
log number of reshares, then kind of regression works better. But then kind of you're not
predicting the number but you're just kind of predicting the order of magnitude. So,
again, not clear.
And then the last way how you could go and formulate this problem was me to say, okay,
let me let me only look at cascades that reached some minimum distant size, maybe 10,
and then out of this let me try to predict future growth. So kind of lets ignore small
cascades, they're not interesting, let's just focus on the beefy ones and let's try to predict
those.
If you would go to this that way, then you are doing this kind of -- you have this huge
selection bias in the sense that you are now defining your prediction problem only over a
small subset of data.
Okay? So these are kind of three ways how you could go and formulate the problem, but
this is not how we formulate it. So the way we formulate it is the following. So we call
this a cascade growth prediction problem. And what we are asking is basically to say we
will a cascade reach the median size. Okay? That's the -- it's a binary classification
problem with this question. So let me now explain you what I mean by this.
So the idea is the following. The idea is that I observe the cascade up to size K, and now
I want to predict whether a cascade that I observed up to size K, will it grow beyond the
median size for those cascades are not. So the idea is that I take all the cascades that had
let's say at least five reshares, this generates -- has some distribution -- this gives me
some population of cascades. This population has a mean. And now I'm asking are you
going grow beyond that mean or below that mean. Okay?
So for a given K, there is some median cascade size for cascades that reach size at least
K, and now I'm asking are you less than median or more than median.
What is cool about this prediction problem, the cool thing is that the median is the 50th
percentile. So by definition, guessing always gives me 50 percent accuracy. My
classification problem is balanced.
Another feature of this is that for every K I get a different prediction problem because for
every K there is a -- the median changes because I'm conditioning on a given cascade
size.
So now the next question is how does K -- how do kind of the median value for K, how
do they relate to each other. So here is -- here's the data. This is Facebook data. This is
not fake. This is kind of really a straight line. It's kind of amazingly strait. And what
this is showing is what is the number of reshares that I observed, what is K versus what is
the median cascade size for cascades that reach size at least K. Okay?
And what this is basically saying is to say all that the scale -- the median size of cascades
that reach size 10 is 20. Okay? So basically asking whether you will grow above the
median size is the same as asking will you double.
And this is not magic. Actually you can prove that if -- because the cascade size
distribution is parallel, then asking this question about the median is the same as saying
will you double. Okay?
So now -- yes.
>>: Quick question. So this median data is in [inaudible] for the real data?
>> Jure Leskovec: This is the real data.
>>: And this serves as your ground truth like the median of [inaudible] that's kind of the
ground truth?
>> Jure Leskovec: That's the question. Yes.
>>: Okay.
>> Jure Leskovec: That's the question. Exactly. So what -- yes.
>>: Have you looked at the rate at which something doubles? Does it stay relatively
constant until some saturation point for different events? So, you know, if it's size five, it
took me a day to double at size 10 [inaudible] double until I reach some overall
saturation?
>> Jure Leskovec: That's a great point. So we were not trying to predict time to double,
but we use time to tell us is this going to double at all. So we were not saying given the
distributed double how long will it take, but we were asking what does the time tell me
about whether something is going to double or not.
So we were interested in predicting the size, not predicting the speed. But that's an
excellent point. So I'll show you what I mean. Okay? Yeah.
>>: I'm not real clear on that. So you kind of fix a point of like for ten days ->> Jure Leskovec: No, no, no. I say I fix how much of a cascade I see.
>>: Yes.
>> Jure Leskovec: And now I want to predict whether the cascade is going to double or
not.
>>: Double using a time window or ->> Jure Leskovec: Some time window. Long enough. These things are not -- the
cascades die in a couple of days. The lifespan of these things is so short that basically
you say -- like we look at two months of data or one month, and that's basically -- we
have enough. So there's no end of time effects, if that's what you're worried about.
Okay. So what's my prediction problem? My prediction problem is I observe the
cascade for the first-case steps, and I want to predict will the cascade size reach -- will the
cascade reach the size of twice K or not. Okay?
What are cool properties of this problem? So the prediction problem is the cascade has
a -- given that the cascade has obtained K reshares, will it grow beyond the median size,
which basically means will the cascade double.
The two cool properties of this prediction problem is that it's naturally balanced. The
reason it's balanced is because median is the 50th percentile, so exactly half of the
cascades by definition won't double and half of them will double. And the other one is
that this is not a single prediction problem but a series of prediction problems, in the
sense that for every K I get a different prediction problem. So what this means is I can
now track the cascade growth over time.
So now that I told you kind of what we are trying to do, here's the data. We are using
complete anonymized traceable data from June 2013 where we spent quite a lot of effort
to actually recreate this cascade based on click and impression data and so on. On
Facebook we get around 150 -- 350 million photos uploaded a day, and only around 150
photos -- 150,000 photos get at least five reshares. So there is kind of most of this stuff
never spreads anywhere.
And then right now that I have this data, I'll do kind of very simple things. I'll just go
extract some features from the cascades, and I'll, you know, evaluate the performance of
a given machine learning classifier. And what I'll show you is just performance of a very
simple logistic regression classifier. Kind of our goal was not to get the best predictive
accuracy; our goal was more to understand the problem. So we kind of -- I'm sure you
can do much better than what we did. So it will be too hard to boost the performance of
this for another 5, 10 percent or something.
But what we are more interested in is kind of exploring the space. So let me just show
you how -- what we did. So the first thing we wanted to understand is what are the
factors that kind of influence predictability. And there are kind of four things if you think
about them that matter.
One is what is the content that is trying to spread. In our case we are interested in images
that spread. So the question is what's the content of the image, does it have text overlaid,
is it a mean, things like that.
The other thing that is also important is who's the user that is spreading this. For
example, you know, what's the number of followers that this user has. If Justin Bieber is
kind of trying to post something, that will get a very different effect if I know me with
my 500 followers is going to -- I'm going to spread something.
Another important thing is not just what is the structure of the individual user but also
what is the structure of the network in which the cascade is trying to spread.
And then the last thing that is also important or that gives us information is how quickly
or what are the temporal characteristics of the things spreading, how quickly is it
spreading and so on. And this kind of goes towards both questions.
So just to show you how well you can do, this is the question of a will a cascade double,
will you go beyond median or not, for a cascade of size 5. This is classification accuracy.
[inaudible] look very similar, the guessing gives you 50 percent. We can do around 80.
Time kind of by itself gives you the most signal. Who's the user gives you relatively a lot
of signal here. Kind of the structure of the network around it also gives you some, and
kind of the content gives you the list.
That's how I would interpret this. But kind of the bottom line is that it doesn't really
matter too much what kind of features you are using. You can do around 80 percent.
And now that I kind of showed you that this is kind of not a hopeless prediction problem,
now I kind of want to show you some kind of next level of questions and saying, okay,
what's going on, can we understand things better.
So here's kind of the first of a series of these kinds of questions. So the idea is to ask the
following, like how does the predictability, how does the performance change with K?
And what I mean by this is to say the following. Here's one prediction problem. I see the
first five reshares, and I have to predict whether the cascade will grow beyond 10 versus I
see 20 reshares and I need to predict whether the cascade will grow beyond 40. Right?
And the question is which of these two prediction problems is easier. Right?
And if you think about it, kind of there are two opposing arguments. You could say that
the second prediction problem is easier. Why would this one be easier is just because
here I have more data, here I see 20 reshares, so I can compute better features. I see more
of the cascade. So predicting this should be easier. Predicting whether this doubles or
not should be easier.
Similarly, or on the other hand, I could make kind of the opposite argument that says no,
no, this is harder. Why would this be harder? Because here I'm predicting 20 steps into
the future and here I'm predicting just five steps into the future. And kind of predicting
further ahead into the future should be harder.
So the question is which one is it?
And so the graph that I'll show you will have the K on the X axis and accuracy on the Y
axis. And here's the graph. So this is predicting whether a cascade will double or not,
how much of a cascade we see. And we are predicting whether it's doubling. And then
the lines -- the line shows the accuracy.
And what's interesting is that basically predicting small -- whether small cascades will
double is harder than predicting whether big cascades are going to double. So kind of
predicting bigger cascades is easier. Which is maybe a bit -- a bit surprising. So that's
kind of the first result.
Here's the second result, so kind of second variation of a similar question. So now
imagine that I fixed the minimum cascade size, and I want to ask how does predictability
change with K. So now what I fix is I fix what I'm predicting. I'm predicting whether a
cascade is going to grow to size 40 or not. So I keep what I'm predicting, but I'm
changing how much of the cascade I observe. So in this case I just observe five reshares
and I'm trying to predict whether it's going to grow beyond 40, where here I observe 20
and predict whether it's going to grow to 40.
And if you think now about these two prediction problems, it's kind of clear that this one
is easier. I get more information, I'm predicting the same quantity. So here this thing is
easier than that thing.
But the question in this case is there any sweet spot, in a sense, is there just a case, I don't
know, something between 5 and 20 where I observe just enough information that I can
make -- solve this prediction problem well. Okay?
So to give you an answer to this one, here is what we did. So we took all the cascades
that reached size of at least 100, and we are saying now predicting whether a cascade will
reach size 200 or not. And now what I'm doing is I'm observing more and more of the
cascade and asking what is my accuracy. And of course the more of the cascade I
observe, the better my accuracy. That's not surprising.
But what's maybe is that as I observe more and more cascade, my accuracy is basically
linearly increasing. So what is interesting in this case is that each individual reshare
gives me a constant boost in my performance. So basically what this means is that if I
observe 25 reshares and I get a 26th one, my accuracy will increase for the same amount
as if I observed 99 reshares and I see the 100th one. Right? Which is a bit surprising.
You would imagine that having one more reshare early on kind of helps you more than
getting one more reshare later on. But it seems that kind of each reshare provides you
kind of a constant amount of information regardless of what you have already seen or
what you have already gotten. You get this kind of linear increase in performance, and
that was kind of surprising to us.
>>: [inaudible].
>> Jure Leskovec: Do I have an intuition why it behaves this way. So we did this
experiment several different ways, and we always kind of found a bit of a need very early
on, and then this kind of straight line. And we definitely did not expect this. We were
more thinking that this would have this kind of diminishing returns. Because here at the
end, once you have a lot of information having one more reshare shouldn't matter -shouldn't matter that much.
Why could that be? We could -- we think we could blame it on the network because kind
of every new reshare kind of opens you a new space or a new part in the network, and
that may be one of kind of the intuitions why this is going on. But I wouldn't -- I'm not
sure we -- I would say we don't really understand this one well enough.
>>: Yeah, because there would seem to be -- I mean, if you're [inaudible] already lots of
friends have seen him, so some of them share hoping that new people will see, right?
>> Jure Leskovec: Exactly.
>>: [inaudible].
>> Jure Leskovec: It could be something that basically every new reshare kind of opens
a new part of the network that hasn't seen the content yet, and that kind of gives you more
information than just small pieces.
>>: So does this study conduct a fixed group of networks, or it's various different group
of [inaudible]?
>> Jure Leskovec: This is conducted on the complete full Facebook network. So it's one
group of people, 1.2 billion of them.
>>: I would probably add something which is simpler, which is to go back to the time
thing, right, time to -- for every individual gives you an estimate on the time to reshare.
And as long as that time to reshare is not getting longer, you're going to continue
doubling, right? Each point offers you that same observation on time to be shared
estimate. And if you -- it'd be worth checking that, the time to reshare factor and where
it's sort of changing. Because it's when that just started dropping off that you should see
a network dive.
>> Jure Leskovec: Sure, sure. So we have in the paper -- I show some examples about
the time to reshare and what's important. But I'll show -- I'll go one level deeper than
this, and there is more in the paper [inaudible]?
>>: Yeah, I was [inaudible] where this line intersects a hundred percent accuracy, if you
allowed null reshares in your model, then you could stretch this out to see when K
actually approached 200, which for a cascade of size, that's your ->> Jure Leskovec: Why would it mean size a hundred? So you say if I see -- because
then I'm not -- then I'm changing the prediction problem. My prediction is always are
you going to reach 200 or not, so it was this are you going to double. But now I -- I
reveal more and more of a cascade to you.
>>: So is your prediction problem always whether you're reaching 200 or whether you're
reaching a hundred?
>> Jure Leskovec: No, no, here is always are you reaching 200. So I have the same set,
but now I reveal more and more of a cascade to you that kind of I know that all these
cascades that reach size 25, I know kind of they will reach size 100. And I'm predicting
whether they will reach size 200. Because I'm kind of revealing more and more of a
cascade to you. I'm revealing more and more of a cascade to the learning algorithm and
asking how are you -- how good are you at predicting whether you are reaching 200 or
not. And it's the same dataset as I'm revealing more and more of a cascade.
>>: But the 25 of you still trained on the entire global data?
>> Jure Leskovec: Uh ->>: [inaudible] trivial [inaudible].
>> Jure Leskovec: No, no, I'm -- I show this much of a cascade and training this
classified here.
>>: No, no, what -- you trained on cascades that were less than a hundred as well.
>> Jure Leskovec: Over -- of course, of course, of course. Of course.
>>: And so saying that you have a linear model, it will intersect with a hundred percent
line, is it before 200 or it's -- like if I give you all the data of like 200 cascades, you
obviously know everything, but since you have a linear model, is that point before the
200 where you -- before you hit the 200 point or it's right on the 200 point?
So saying that you -- now you have all the data of 200 cascade, then I'm only using a
hundred cascade to do the predictions, and I have 0.8 something. If I increase, it will
increase [inaudible].
>> Jure Leskovec: These are all the cascades that reach size 100. Okay?
>>: Yes.
>> Jure Leskovec: Half of them -- half of them now -- out of [inaudible] size 100, half of
them [inaudible] each size greater than 200 and half of them is less than 200. So
basically between 100 and 200, right?
>>: Yes.
>> Jure Leskovec: By -- by -- and now what I'm doing here is I -- I show you -- so this is
all the cascades of a size 100 or more, half of them will be -- by definition are between
hundred and 200 and various at about 200.
What I'm doing in this experiment is I'm showing you I have these cascades that are kind
of growing and I know that they will all grow to a size 100, at least size 100, some will
grow to 200, 50 percent will be in here, 50 percent will be in here, and I show you this
much of a cascade and I'm asking you will a cascade be here or there. So.
So I'm kind of revealing more and more of a cascade to you, but I'm still having the same
prediction problem. And what this is saying is that as I reveal more of a cascade to you,
the problem gets easier. That's not surprising. But the interesting thing is that it gets
linearly easier. So that's how to think about this.
>>: Plus it's [inaudible] it will increase to a point that you only need to know like 150
cascade to successfully ->> Jure Leskovec: No, no, maybe we should talk offline or I'll explain later exactly.
Okay?
>>: [inaudible] feature dimension are the same feature dimension?
>> Jure Leskovec: The same feature -- features are always the same. Features are
always the same. Okay? Now kind of going to some of the questions Paul was asking,
so the first one is, you know, we can also go and look at the individual features, how
much do they matter, right? So, for example, here is how much the -- how much is the
original post [INAUDIBLE] how much does it matter. And we see that early on when
cascades are small it's very -- kind of it's important what's your degree as the initial
poster. As the cascades get bigger, that gets less and less important, which makes sense.
We also see similar things whether -- where there are properties of whether a photo was a
meme, in a sense, does it have text overlaid or not.
In terms of time, kind of this was the question. What we see is that for the cascades to
double, they need to have -- they need to have a lot of unique viewers per unit time, so
you want to get lots of people exposed to it. But that's not enough. What you also want
to do is to have high conversion rights. So you want to have kind of lots of these exposed
actually reshare things. And have these two together is a good sign that cascade is going
to double.
Another thing we were looking at is to ask if I have a cascade that -- let's say I observe
the first three reshares of a cascade and I know that cascades spread in a given pattern,
what does this mean for the future growth of a cascade? So if I show you that a cascade
sped in this pattern and ask you is a cascade going to double or not, this one versus that
one, which one would you guess is more likely to double or not, this one that is kind of
doesn't want to die or this that seems to be kind of spreading very kind of in this breadth
for search type of way.
And what's interesting is if you look at this and you kind of sort these cascade structures
and plot the probability of doubling, it turns out that this long and narrow cascades, they
are very unlikely to double, while these wide cascades are more likely to double. And
actually the wide cascades that spread beyond level 1, those are the most likely to double.
And kind of intuition being that here you don't know is this something that is kind of very
vital or is this just being posted by a very high degree person, while here kind of you see
that, you know, this is going to propagate and it's going to kind of in a star-like fashion.
So that's the intuition that comes from here.
Rather than just saying can I predict the cascade size, you can also ask can I predict the
cascade structure in a sense that I see a bit of a cascade and I want to say, you know, what
will be the structure of the cascade in the future, is it going to be more this kind of
breadth for search style like thing or this kind of very long and narrow type of structure.
The way we do this is to say that we will compute the kind of a structural [inaudible]
feature of every cascade. This is called the Wiener index, and it's simply the mean
shortest path land between all nodes in the cascade. So the idea being that in this case, if
I have a star-like cascade, the Wiener index will be small. If I have these kind of big tree
cascades, the Wiener index will be high.
So now the question is how well can we predict the Wiener index. And we can do the
same thing as we did before. Now we basically take the cascades up to a given size,
compute the Wiener index of all those cascades and predict whether the Wiener index is
below or -- or above or below the median index for that population of cascades.
What's interesting is basically our random prediction would give you like 50 percent
accuracy. We can do around 72. The interesting here is that of course structural features
tell you a lot about how the cascade is going to grow in the future in terms of structure,
but time also gives you a lot of information.
So basically fast -- cascades are spread fast, have different structure than cascades are
spread -- that spread slow. So cascades that spread slow can be this kind of long and
narrow while cascades are spread too fast, then to be this kind of wide and broad.
And this is what comes out of this.
So this was kind of what I wanted to talk about on predictability. Let me just quickly
touch about inequality. And when I mean about inequality is basically this question. I
have the same piece of content, in some cases this content got very few reshares and in
other cases it got lots of reshares.
So the question is can I kind of differentiate between the cascades of the same content.
So what we did here is because we worked with photos, you were able to go and identify
clusters of identical photos. We have around a thousand clusters. They totaled -- overall
they contained 38,000 photos, and they were reshared 13 million times.
And what we did is for every cluster, we select 10 random photos or 10 random cascades,
and the question is can you predict which one of these 10 is going to be the largest.
Because this is one out of 10 or a 10 [inaudible] problem, random guessing would kind of
give me 10 percent accuracy, if you like.
Okay? So how well can we do in predicting 10 cascades of the given scene, 10 cascades
of the same photo, which one of those will be the largest, our classification accuracy is
around 50 percent, so kind of five times better than the random. If you think of this as a
ranking problem, we are at around .66 in terms of mean reciprocal rank, which basically
means about half of the time you put the largest cascade as the correct prediction and
about half of the time we kind of put it on the second spot.
What's also interesting in terms of inequality of content, you can commute the Gini index
where 0 would mean perfect equality and 1 would mean kind of perfect inequality. We
see that the Gini index of this cascade is around .8. The skewness or the Gini index of
human world wealth distribution is .66. Okay? So kind of the point being that cascade or
online content popularity is less equal than what is human wealth. And we think of
human wealth as kind of being superskewed. This is even more skewed. But that
doesn't -- it's not a problem kind of. We can still do -- it's still predictable. It's very
skewed, but very much predictable.
So with this now you can start asking, right, okay, you can do these predictions,
observe -- kind of make them as a cascade is growing and you can kind of trace the
growth of a cascade and make predictions what you think will happen in the future.
But the question is can you kind of go and make your posts kind of viral, can you go and
optimize the posts that will get a lot of attention.
So here is kind of what I would like to do in this part. I would like to say how can I go
and maximize the probability of success of my content. So in some sense, I have a piece
of content, and I would like to make it successful, which means I would like people to
like it, I would like people to comment on it, I'd like people to reshare it.
And then if this is what I want, the question is how do I get it, like what can I do. And
the things that -- the fact that I have inference [inaudible], for example, where do I post,
like if I think of -- that I'm posting to a given Web site, usually these Web sites are
organized into groups or communities, so to each community am I posting, at what time
am I posting, what is the text that I'm using to convey my message, who is the user that is
posting this, what's the reputation of popularity.
You know, maybe this content has been already published into the same community, how
well did it do previously and so on and so forth. So these are kind of things that I want to
understand.
And what we would like to do is basically take these individual factors and kind of tease
them apart. So what we require is in some sense a dataset that will allow us to do this.
And the idea is similar to what I showed you on Facebook, right, in the sense that I want
to look at the resubmissions of the same content across multiple communities,
communities with varying characteristics from various users.
And this basically means that I have this kind of natural experiment that will allow me to
figure out how these factors relates to each other.
So the dataset we'll be using is the following one. It's from Reddit. And Reddit is this
kind of Web site where people post stories or content, other people can upload and
download and they can comment on these stories. Okay? It's not a social network, it's
not Facebook, it's a Web site where people post content and then they can comment, they
can vote, and so on. Okay?
And what we did is we were able to go and identify pieces, posts where the same content
get posted multiple times. So we worked with images. We have around 130,000 Reddit
submissions from 16,000 original images or different pieces of content and around seven
resubmissions per image or per piece of content. And the data is available on our Web
site, so we can kind of download it and play with it. Okay?
So now given this the question is how popular is the content. So here's a question. This
is a given image that was submitted to Reddit. Here it is. And this image was submitted
26 times. And here is one way how to quantify the attention the content gets. This is
what we call score. It's simply number of upvotes minus number of downvotes.
What you see here is that, for example, the first time this image was submitted it got
some number of uploads. You know, maybe the fifth one got a bit more, but then it
wasn't until the tenth one that actually some serious, serious number of upvotes came.
And then, you know, the next few ones were really unlucky and then the 16th one was,
again, very lucky and so on.
So you see here that basically we are reposting the same piece of content but the
popularity of that piece of content differs very widely, between kind of 800 difference
between upvotes and downvotes versus, you know, just small amount of attention.
>>: Does the popularity relate to who posts this content?
>> Jure Leskovec: I will -- I will show you.
>>: Okay.
>> Jure Leskovec: Yeah? Okay. But that's basically what we would like to do. So in
some sense what we'd like do is we'd like to play this game. We'd like to go and repost
content on Reddit and get more upvotes than what the original post did. Okay. So that's
kind of what we want do.
So if now I want to go on Reddit and repost content, what can I play with? I can play
with -- you know, I can play with the title, I can play with when I submit. I could
potentially play with whose the user that is submitting, but we didn't do that. I can decide
which category to submit, and then there are several ways how to measure my success.
One is how many upvotes or downvotes I get, how many comments I get and so on.
Okay.
While the content that I'm posting is fixed, I have this image, it has already been posted,
I'm just trying to repost it and get more attention than what the original poster did.
Okay. So the way we do this is basically build a model that tries to predict the popularity
of a given piece of content. And the way we build the model is to do -- kind of have two
components. One is about kind of the community, who are the people that I'm talking to,
and the other one is about kind of the language, in what way am I talking to that set of
people.
And the community model will have -- basically will try to model what is the choice of a
community, at what time am I submitting, and how well did previous submissions do,
and then the language model is all about kind of how does the title and the language of
the submission relate to the language of the community.
So let me show you for the community model a few things. For example, here is how
much time matters. This is time versus score for submissions in six different Reddit
communities or Reddit categories. And you see that kind of posting at 4:00 a.m. may not
be as good as posting around lunchtime. Okay? So the effect is actually surprisingly big.
So it matters when you post.
Another thing that matters in terms of time is also because you are reposting things is
how much are you penalized by reposting the same content. So here is the -- again, for
different categories, the resubmission number of the same piece of content versus
popularity. And you see that generally the first time the community sees a given piece of
content, you get a lot of attention, and with every subsequent submission you get kind of
less and less attention. And that makes lots of sense.
However, what can kind of save you is to wait. So basically people forget. So what this
is trying to say is this is probably that your current submission is less popular than the
previous submission of the same content, and this is how much is the time between two
consecutive submissions. And kind of if you are willing to wait four months, then
basically people forgot about that piece of content, at least in the atheism community. In
other communities like pics, you are still much less like -- kind of you are unlikely to be
more successful than the original post, even if you wait long time. So kind of these guys
are forgetting slower than atheists are forgetting, if you like.
Okay. So this is in terms of temporal effect. And there are also these interesting effects
between communities. So what I'm trying to show here is all different Reddit categories,
and the idea is how likely is a piece of content to be successful in one category given that
it has been posted in another one. And basically what it says is the diagonal tells you that
resubmitting to the same community, to the same category is always bad, and the other
thing it tells you, that basically if something has been submitted into one of these big
categories, then it will do uniform -- it will do badly in all other ones.
So basically the way to think about it, you have these huge categories and then all other
smaller ones are kind of subsets of these few huge ones. And the small ones don't
overlap with each other, but they overlap with the big one.
So now kind of given these kind of pieces of signal, you can simply put all this together
into the big kind of regression model where you say I want to model the amount of
attention a piece of content got. I want to kind of estimate this latent inherent content
popularity terms, I want to kind of model forgetfulness, I want to model whether this has
been submitted multiple times to the same community. I want to model how well did this
piece of content do previously, in other communities and so on and forth.
And I can basically go and train and figure out these coefficients. So that's how to model
time and kind of community effects. Now I also want to model language effects. And in
terms of language, a few interesting things. So what this plot is showing is what they call
community specificity. It's simply how much does the language of the title match the
language of other posts in the community. And basically what this is saying is the
following. If you are talking to gamers atheists, use kind of -- the more similar your
language is to the language of the community, the better.
While for other communities like gifs and pics the interesting thing is if you are too
similar to the language of the community, that's actually bad for you. So it's kind of good
to be here in the middle. Right?
So the way to understand this, if you are talking -- if you are using words that the
community is not using, nobody understands you. If you are using words everyone else
is using, you are kind of [inaudible], but if you are here in the middle, at least for these
three communities, that seems to be the best, while for gaming and atheism, kind of the
more standard vocabulary you use the better.
And then the other -- the second part is showing what we call title originality, which is
simply I have my current submission with a current title, how does this title relate to the
titles of the same piece of content previously submitted.
And what this is saying is if you are using -- if you are reciting old titles, that's always
bad, coming up with kind of new titles that haven't yet been used is good. Okay?
So then what happens, we actually even go a step further and ask in what kind of
grammatical structures should I kind of formulate as titles to get lots of attention, and
depending on which community you talk to, you may want to use kind of different
sentence structures. And this matters quite a bit.
Okay. So now that I have these kind of things and I can put them together into a big
regression model, I can ask how well is this doing, I'm just measuring the R squared
between our predicted popularity and the true popularity. The community model in terms
of the scores. So upvotes minus downvotes gives around 50 percent R squared, language
model by itself gives you around .1. But what's interesting is putting the two together,
their performance basically adds up and you are around .62. Okay?
So this is in terms of the score or rating. You can also say how much attention I'll get,
simply as upvotes or downvotes. You can do our spread of around .52 -- sorry, 5.8 and
engagement in terms of number of comments, you can do aren't .7.
So it seems to be working quite well. But the second part of evaluation we did was kind
of more fun. So here the idea was we generated 85 pairs of posts where we basically are
resubmitting old content. Okay? And the idea was to say for one of the posts kind of
will go and submit it and for the other one we'll use our system to tell us what type we
should reuse and when should you submit it and so on.
And on average we got around three times as much attention or activity on the posts
submitted by our model than the posts submitted by humans.
And out of these 85, five of them reached the front pages of their corresponding
community, and two of them reached the front page of Reddit/all, which is kind of
interesting that we were just reposting old content but still got on the front page of
Reddit.
>>: [inaudible] is to play the with perceived popularity, so rather than changing content,
actually change the number of votes that the post currently has, if you have a way of
doing that. So if somebody sees that something has been upvoted a thousand times, they
may be much more like the [inaudible] and they could be just like for people that cited
the problem. Like after a while, people just starting citing it because the paper just cited
it.
>> Jure Leskovec: So there has been some work on this, but basically there was a
science paper by [inaudible] where basically they showed for some -- as far as I know,
some Russian Web site that is kind of Reddit like that, yeah, if you have a piece of
content, then you just quickly upvote it, then it will get many more upvotes than if you
don't do anything. But then I've talked to a few other people and they were not able to
reproduce that result. So it seems that kind of effect seems to be very community
specific.
So here we did not mess up with that. We also created brand-new users with [inaudible]
karma score, so all these things were kind of -- those are the other things that you could
play with. We didn't do that. We just played with the type of time and where we post.
Just to show you an example, here's one of the images we were posting. So this is
something that our system said will be a good title. Here is -- once we posted -- this
image that has been posted before, we got 7,000 upvotes, 5,000 downvotes, and around
500 comments. Here is something that our system said will be a bad title, so we also
submitted this. We got around 300 votes, nine comments, and this is kind of if you look
at why our system thinks this is a good title, it's regional in terms of [inaudible] the
previous titles, it's kind of a good length and has an interesting sentence structure, while,
you know, this is not original, generic, short, and kind of flat and boring.
And you see kind of at least in this case a big difference. And in this case the difference
was around 3X. So I'll show this one last time. Okay. Good. So kind of if this is what
we wanted to do, then I think we can kind of declare success. So basically we were able
to repost old content but just changing the title and kind of things like that and get much
more attention than what the original posting got.
So kind of what are conclusions of this work. One is about, you know, can cascades be
predicted. Yes, they can be. They are quite predictable, even though large cascades are
rare. And to some extent cascades cannot only be predicted but they can kind of be
engineered, if you like.
Now, what did we learn about, you know, when posts -- about wild posts or posts that are
engaging. So if you want to do this, what do you need? You need kind of content by
itself is important, right, so kind of if you submit images, people favor means kind of
things, things that are interesting and popular. What's definitely important on Facebook
is to have lots of followers, right, kind of be popular. It's important who's the originator.
What's kind of more interesting is this question that it's not just enough to have lots of
friends, you have to know what your friends like, and that's not enough. You have to
know what your friends like, what their friends like, and so on and so forth. So really
kind of have to know -- you need to know your network.
And posting at the right time plays quite a bit of role.
What are some kind of interesting further questions? This is something from Facebook.
So this is the same photo, four different cascades. One kind of started here, here, there,
and some are here. Right? And now I have basically these four different cascades that
are kind of merging and interacting with each other. So kind of understanding how these
information cascades interact and how they kind of clash together, that is something that I
think would be a very interesting piece for future work.
And another thing that would be interesting is not just kind of trying to predict how the
content spreads, but also how the content mutates and how the sentiment toward that
content changes as the content is spreading. So kind of not just saying how does the
information spread but how does the attitude or something attached to that information,
how does that change as information is spreading. So in terms of maybe attitude or
sentiment, or maybe even understanding of how the thing that is spreading mutates as it is
spreading throughout the network.
And then kind of the last thing I wanted to say is kind of making a step back and thinking
what does this mean. And in some sense, this kind of -- the motivation comes from this
idea, right, today messages are spreading through networks. And the way we consume
information today is in this kind of small incremental networks in realtime kind of very
differently than what we used to do. And this really kind of requires us to think
differently about information consumption, search, and so on.
And there are several kind of interesting feedback effects happening in this information
network. So the first one is kind of the feedback because we are using -- the feedback
loop because we are using our social connections. So what I mean by this is that in some
sense, because of this information being spread over the links, some links get stranded
and others get kind of created or some get broken and others get created. So kind of
understanding this process of how information sharing and link kind of strengthening and
weakening relate would be interesting.
And then there is also another feedback loop which is that kind of we as users, now we
see our position in the network and we see how people react to the content we are
sharing. So kind of understanding how that affects us and our content sharing and
creation, is kind of another uncontrolled experiment that is happening in understanding
this one would also be something that would be really cool.
So with this, I'm done. Data and papers and more papers. Thank you.
[applause].
>> Paul Bennett: We have time for some questions.
>> Jure Leskovec: Yeah.
>>: I had the question about the -- especially on the Facebook study at the beginning. To
what degree or how do you convince yourself that you're learning something about the
people's behavior versus the algorithm ->> Jure Leskovec: The feed ranking algorithm?
>>: Yeah. Or does it just not matter?
>> Jure Leskovec: No, it's a good point. So we were worrying about this quite a lot, and
especially in these systems because everything is kind of driven by the recommended
engines or some rankers. It's kind of hard to know am I just kind of reverse engineering
the ranker or is there anything more.
So we worried about this quite a bit. We were able to look because kind of AB tests are
running constantly, some people have ranking turned off, so we were able to see that
there is little difference between those and so on. So we tried to kind of convince
ourselves that we are not studying the ranker, and we think we are not. But the best thing
would be to turn off the ranking for the whole Facebook for a month, but they didn't want
to do that. Which is not surprising. But it's a good point. We worried as much as we
could about it, but at the end there is a bit of it in here.
>>: I don't exactly remember what were the features [inaudible] but have you considered
[inaudible] because if it gets lots of likes then someone will [inaudible]?
>> Jure Leskovec: Sure. Like in our -- in the paper, we have this one page of kind of all
the features we could come up with. And there is a bunch about likes and the -- we did a
lot of kind of image analysis to capture the structure of the content, we did a lot of
network analysis to understand the network, we did a lot in terms of time, how quickly is
spreading, who's trying to spread it, is the speed kind of accelerating and decelerating,
things like that.
So we tried to be as exhaustive as possible. But it's -- I think the number of likes was,
yes, was one of the features. But most of the -- and which -- and the likes kind of happen
more and more. As cascades get bigger, it's more important to have likes because their
small likes don't tell you too much, because most of the content has very few likes early
on. Yeah.
>>: So [inaudible] network structure and the information that's spreading, particularly
the feedback of what information you're trying to spread versus [inaudible] versus how
the network responds. For these kind of benign cascades, benign information, probably
doesn't change the network structure touch. But if I disagree or agree with feeling much
more controversial pieces of information [inaudible] to change the structure.
>> Jure Leskovec: I think -- I think -- so we are looking at this a bit right now, not on
Facebook, but using Twitter. What's interesting about Twitter is it's like a super dynamic
networks in a sense that edges get created all the time and edges get dropped all the time.
Even if you do nothing, you have this kind of constant churn of new followers and a
constant churn of people unfollowing you.
And then what we see is that because of your tweeting activity and because of retweeting
you get these kind of spikes or bursts of new followers coming, people unfollowing you
and so on.
The biggest case when people unfollow someone is -- usually it's kind of professional
athletes when they change teams, then you see basically somebody says, you know, it
was great playing for you guys, but now I move here, and it's just like these links get cut,
these links get added. That was the biggest thing we saw.
And then you see this when politicians and Donald Trumps and so on say things they
shouldn't say.
>>: Right. More controversial people.
>> Jure Leskovec: Exactly. So ->>: Good or bad, I presume.
>> Jure Leskovec: But what we see is, for example -- what we are -- we what are able to
do is we have these models that are able to predict when are you going to get new
followers. And what turns out is, for example, you get new followers when you get a
new retweet cascade. But that's not enough. What you need to do is kind of have a
retweet cascade, but if the retweet cascade goes kind of in the direction of people who are
already know about you, you won't get the followers. In a sense know that they have
been exposed to your tweets in the past.
What you'd like to do is kind of have a retweet cascade that goes into one part of the
network that is not yet aware of you, that kind of hasn't seen your tweets in the past and
that will generate the links towards you.
So that's what we see on Twitter. Now understanding the breaking of links, that's harder.
Because it's more -- it's much more about content. While in link creation, it's more about
kind of new people being exposed to you and saying, oh, this is interesting, let me try it
for a week. So it seems that in the systems, at least in Twitter, what we see is that users,
you know, they come, they follow a few celebrities, and then later on kind of they
discover what they like. They almost kind of learn what is the place in the network, what
is the content they like, what is the content they don't like. So it seems it's this kind of
news feed optimization by creating and dropping links. And there's a lot of that. And
there is no kind of social cost on dropping a link, which on Facebook is like you don't
unfriend someone. Right?
>>: Right. Which brings me to my second question. Something that Twitter or
[inaudible] personal, so they don't really reflect the actual social network and the
breaking and forming of social links where there was a cost to both forming and
breaking. These are easy to study; the other ones are hard. Any thoughts?
>> Jure Leskovec: Sure. This is -- I mean, the kind of Twitter -- in some sense Twitter is
easier because it's almost like more honest in a sense that it's all about the -- it's much
more about the information. Even though there I'd imagine that there are some links that
are social and some that are kind of purely informational. But comparing the two, like
the Facebook link seems to be much more kind of socially driven while most of Twitter
links I'd imagine are kind of information driven, in a sense that you follow someone
because you like what they -- the -- the feed that they are creating.
>>: With your colleagues at Facebook, it'd be eventually interesting to look at this when
after change of relationship status, to divorces and other types of dating breakups and
what happens in social [inaudible] that should be common enough and certainly friends
seem to split after that, that you get a lot of ->> Jure Leskovec: There has been at least two papers from Facebook on this, on this
topic, what happens when people move, what happens when they break out or break up.
There was another paper trying to say can I identify your wife in your network. Kind of
how do you identify a significant other. And they can predict the relationship lens based
on how well you integrated your significant other to your network and things like that.
[inaudible] same patterns what you see in the public network?
>> Jure Leskovec: That's a good question. I don't have an answer.
>>: And on that note, have you compared to [inaudible] over time?
>> Jure Leskovec: Yeah, no. I mean, there's a lot to say here. Yeah. Yeah. Yeah.
>> Paul Bennett: Other questions?
>> Jure Leskovec: Anything else?
>> Paul Bennett: Okay. Let's thank the speaker.
>> Jure Leskovec: Okay. Good. Thank you, guys.
Download