>> Devavrat Shah: Thanks. And thanks, Rico for making this possible. I
really appreciate it. So just so that there's a bigger purpose behind the
visit, and one of the things that we would really like to do at LIDS -- LIDS,
which is research lab at MIT. So many of you who might know how MIT works, the
departments are there and the research labs are there, and both are important
and everybody belongs to both. And at least one lab, if not more.
At LIDS, primarily play role as a bridge between EENCS so things like what I'm
going to tell you about, which is a nice bridge between statistical inference
and machine learning.
Things people do in optimization, which by the design is between EENCS, people
do continuous optimization and discrete optimization, control and robotics
computer vision and signal processing, and the list goes on.
And these are the type of things we do there, and we would really like to
engage broadly speaking. And with that prelude, let me start telling you about
what you kind of stuff I do and I'll be happy to talk to you about myself or
about others, if you are interested in our [indiscernible].
All right. So what am I going to talk about? I'm going to talk about a few
questions in the context of processing social data. These are concrete
questions that we've looked at over the past three to four years, and there are
some nice solutions that we have, and I believe there is much more to be done
here than just thinking about big data as building big systems.
So with that in mind, what am I really interested in talking about by meaning
social data? It's data that is generated by us. It's all sorts of
transactions, electronic transactions that you have made, or the restaurant
rating that you left on Yelp, or movie ratings that you left on NetFlix, the
tweets you have sent out if you participate, whether as an employee or an
employer on Mechanical Turk, which is the crowd sourcing system. Or your
Facebook posts, your cookie data and so on. Which I'm sure everybody uses that
all the time.
Here are a few options. Of course, we can do better businesses. In operations
world in business school, when we talk about this as figuring out how to manage
our revenue. It's useful for pricing. At Bing when we think of how to do
advertising. If you are thinking a little more generally, thinking about
myself, I would like good recommendations available so that I can go and eat at
good places, for example, or watch the right movie. It would be useful for
policy making, deciding whether to add a road or not or a school or not. And
if Congress has access to what people like and dislike, it would be very
And more generally, it could change the course of societies. I mean, at some
level, this was a well understood example of that. Some level, crowd sourcing
are changing the way labor exchanges are work organized, right? I mean, O-desk
is one of those examples where people who can get a quick employment that was
not possible before. Mechanical Turk style micro crowd sourcing is helping
connecting people across the world in an interesting way. Things like news
reports are useful to get quickly.
So all of these are ways in which we can do something. What is the basic
challenge? At least from my view, the core of it, we've got a lot of data is
which unstructured and highly noisy, and there's a lot of it, which means what
I want to do is I want to eventually make decisions from this data. In order
to make these decisions, at some level, I need to solve this statistical
challenge; that is, I need to understand what is basic structure there, and if
there's a lot of data, I need to figure out how am I going to do it in
computationally efficient manner that can scale with it.
These kind of questions at some scale being observed across the idea of domains
over years, but now it has become really acute, given the scale and given the
amount of uncertainty that we have.
Now, for -- this is a billing challenge. I won't be able to solve it at all.
But what I would be able to do, I would be able to able to tell you a few
concrete questions in which we have approached this two-pronged approach, where
it's thinking of the right statistical inference framework, along with simple
algorithms and try to get meaningful answers. So effectively, I'm thinking of
data as forming an appropriate statistical model. Now, the data is generated
from an appropriate statistical model.
Once I understand that, I could think of taking decisions by coming up with
optimal inference algorithm and while I have finished most of the important
part, the question would be that optimal algorithms are not easy to implement,
especially at scale, so what I would like to do is I would like to develop
meaningful approximations that take me from data to decisions. Really, the
model is only helping me think through it.
So at the end of the day, I will take the data, I apply simple algorithms, I
will get some meaningful answers and these answers are interesting for two
reasons. One is they solve the problem. And second, if I would as an academic
what these are really doing, our statistical model to argue about those
Now, this is at a high level, very useful plan, only if I can execute. Let me
say proof is in the pudding. So I will show you three examples. One in the
context of decision making/recommendation, here the question is we've got a lot
of data that's telling something about people's choice. A little bit about
people's choice and how it stitches together. Second is question of crowd
sourcing. Most of you must know this, and if you haven't, I'll explain
precisely what it is. What I want to do is I want to build meaningful answers
from different small, small answers that I opt in from people which are noisy
and stitch them together. Finally, related to understanding trending,
advertising and Twitter.
Now, I understand that talk is going to last for 40 or more minutes, and this
is too much to go through. So as we go through this way, the information
content will decrease, but hopefully I will be able to convey what's the
question I'm looking at and hopefully the algorithms that we're looking at.
And feel free to stop me here. The audience is small and interaction would be
very useful.
>> Rico Malvar:
You have more than 40 minutes.
>> Devavrat Shah: Okay. So with that broad layout in mind, let's start with
this one simple question. I should give credit to my co-authors here. This
has been part of a longer program that has been going on for four to five
years. Started with my former student, [indiscernible] Jagabathula, who is now
at NYU, with [indiscernible] at Sloan School of Management [indiscernible]
Farias. A student, Ammar, who is at LIDS with my post-doc, Sahan Negahban and
[indiscernible] Oh is now at Urbana-Champaign. This is the first part of the
The second part is with [indiscernible] and David Karger. David is a colleague
of mine in computer science part of the art department. And the last one is
with my former student, [indiscernible], who is now at MIT as a faculty member
and a student who spent six months with me, now because of his work he's now at
Twitter. I wish he was remained with me. When we'll get there, you'll see
Okay. So first part is recommendation. So at a high level, question is
something like this. I've got lots of partial preference information from
various sets of people, somebody telling me I like this restaurant so much.
Somebody telling me I like this movie so much and so on. And then from that,
what I really want to do is I want to somehow put this partial information
together and stitch them and provide some kind of global ranking.
At the end of the day, it's not just ranking, it's also intensity also that
matters. So here's some scenarios. Let's say I've got a bunch of movie
watchers. Somebody like yourself telling me that you really like Inside Job.
And based on what other people have told me and what you have line liked, I
might suggest that you might want to watch this movie.
Or let's say there might be hiring decisions that you must be doing all the
time, and a candidate is interviewed, and different people give scores
differently. Somebody gives 8 out of 10. Maybe Rico gives eight out of ten.
[indiscernible] gives nine out of ten, Phil gives seven out of ten. And then
at the end of it, you'll be you don't want to hire the Microsoft CEO, but
somebody else is hired. So it's a decision making question.
Now, these kind of questions show up everywhere.
Microsoft TrueSkill if you
have played and called yourself true skilled, then you will have a score.
People are playing games. There are -- not everybody is playing with
everybody. So only subsets of people are playing with each other and based on
that I want to assign scores to everybody. Recommendation we just went
through, as academics, we think about this all the time. We submit papers to
conferences and then conferences have only ten papers or 15 papers or 25 papers
that can be accepted out of 200 or 500 or thousand. And question is which ones
to accept.
Similar question for us in graduate admissions shows up. Every winter, we have
students to admit and I'm sure you have intern problems of similar type, which
interns to hire and not. Okay. So in all of these questions, really at the
end of it, there are two types of questions one wants to answer. One is how
should I get input from people if I can make it feasible. Things like in
conferences or admissions or hiring, I can tell people that give me in some
form, five stars or this or that. That's a design question. But given that
well of an input I have, for example, in games, somebody wins over somebody
else, or maybe if you are playing cricket match for five days, then you can
lead to draws too. But regarding the draws, you got pairwise winning results
coming out.
Now, there are all sorts of heterogenous ways in which data is coming. How
would you look at it from one lens and stitch them together to get answer?
It's really two questions. One is what should I do to design it if you had a
choice. And if you didn't have a choice, you got all sorts of partial data
coming in, preferences coming in, how do I stitch them together, okay.
So let's just look at some of the popular approaches. One would be like
dislike or do star rating. Okay? This is easy to input, basically whether you
like or not like. This is a little bit complicated, because what does four
star mean? But again, no matter what, these are very simple aggregation
problems. Once I have got input, I will average number of likes you have or
take total number here. For example, I've got, let's say, one like for here,
one like for here and one dislike. So plus two minus one and then plus one and
just sorting that out.
And similarly, I can do the same thing for stars.
once I've got input.
So it's easy to aggregate
The problem is that in this case, this is arbitrary scale, because I don't know
what four stars means. Could be mood dependent. And at the end of the day,
these are coarse, right, because as it happens in our MIT admission system,
once you attend the first round, we are left with roughly half of the students,
all of them starred four. So now what to do. Well, now we have got three sets
of day-long meetings through which we actually talk to each other, fight it
out. Maybe there should be a little better way of doing that. And that's
really the issue of coarseness of the scale.
Again, as Nietzsche said, there is something beyond good and evil. We should
think beyond this. Now, answer I think is in the simple game, and I think this
is the right for me to entertain you as well. It's morning. It's before
11:00. So let's see. I give you this blue color, and ask you, tell me how
blue is it 37 don't worry, I've given you the code too. It's like I'm going to
my orthopedic with my back pain, an orthopedic guy. So let's start, how bad is
your pain? Which I didn't have pain which I not bad, I wouldn't show up here.
I have better things to do.
But then a good optometrist asks me the right question. Is this vision better
or is that vision better, right? And it's basically about comparisons. And in
this case, I might say that the answer is this is more blue than that. So
really, the answer lies in comparisons. And whichever way you look at it,
whether it's sports, with win or losses, if I'm a cricket fan and I like, my
inclination is this way, so India beats Australia, then say comparison and put
that India beats Australia. India is liked over Australia more. Or these are
two restaurants that some of you might recognize, right? These are two really
nice French restaurants in Seattle. Apparently, this one has better reviews as
far as people's writing goes. That's what I found.
So if you want to try, if you haven't tried either, maybe the suggestion is try
this one before that. Or if let us suppose I was writing a paper about ranking
and there was my paper versus other paper, then definitely, it would mean, even
though you might have come up with ratings as per scores, I would convert it
into comparisons. Bottom lean is whichever way you provided me partial
preferences, I could view all of them as bags of -- or bunch of pairwise
comparisons, okay?
In the process, one might say that, well, aren't you losing precision here?
There is nine versus five and eight versus five have more information than just
comparison. Yes, reviewer right. You are losing that information, but it's
not clear if that information is really absolute and meaningful. So it's a
debatable topic.
But definitely, it is absolute information that I've got
So question boils down to the following situation.
I've got --
>>: Can I ask a question? Not only there's the question of losing
information, there's the question of [indiscernible], because if it was six
versus five, and you say greater, it is as important as nine versus five.
Whereas six versus five might not even be statistically significant.
>> Devavrat Shah:
Excellent point.
You diminish noise.
And in some sense --
>>: No, because I'm not saying this says that something that the right
conclusion might be equal, you're now assigning it's bigger when something that
was nine to five was the same bigger as the six to five.
>> Devavrat Shah:
So in some sense, what you're saying is that --
It's all within the loss of precision.
>> Devavrat Shah: Loss of precision. And if I had more comparisons between
two things, then I should put more confidence over A versus B, rather than just
treating that as answer once and for all. And in some sense that's exactly
when we try to answer using the model that we would build in, okay. That's a
great question, yeah.
All right. So at the end of it, we are left with this kind of setting, right?
I've got a bunch of objects. There are edges between them representing that
they've been compared by one or more people, let's say. Here, A12, for
example, reflects that when one and two are compared, one and two played games
with each other, out of those many, let's say, this plus this many games, these
many times, one defeated two, and these many times, two defeated one. So
frankly, I've got this kind of a nice weighted graph. And given this, I want
to, from these comparisons, I want to assign ranking or, more specifically,
scores to each of the objects to be meaningful based on these observations.
And in some cases, I will have this kind of data. In other cases, I might even
have a choice of designing the graph. There is which pairs to compare and
which pairs not to compare. If I were thinking of designing a conference paper
reviewing system, if I assigned Lynn, let's say, papers four and five, then he
actually -- let's say four and three, then he actually compared four and three
depending on what scores he assigned.
So I could decide who gets what. Because really, these are two questions. In
some cases this is possible. In some cases, you're left with only this one.
And we would like to answer these two questions. Again, in order to answer
this question, I have to give you an algorithm, and before I give an algorithm,
at least so that I can think concretely, I would like to think of a statistical
So first thing I would do is I would like to tell you about a statistical
model. And the model that I'm going to put as a background is that there's
underlying distribution over permutations that is at the ground truth. And the
observations are coming out of -- the pairwise observations are coming out from
this. So let's see. What do I mean by that? Here's a simple caricature.
Suppose I've got -- I've seen -- got three objects, A, B and C. I've got these
kind of data points, four data points. A bigger than B, B bigger than C and so
Really, I'm thinking in the background really, A greater than B possibly is
coming from A greater than B greater than C as a permutation over all objects.
B greater than C greater than C greater than A is a permutation over all
objects. And so on. Now, what are these permutations presenting? Well, the
permutations are presenting a choice model, a choice model of population.
I'm thinking of he has some ordering of all papers in mind and I asked him only
a subset of them. And that subset when I asked him, he revealed the answer,
the orders. When I tell you about two restaurants, I have inherently in my
mind not just in restaurant what case what would make sense, it's not just one
ordering, but a bunch of orderings, because some days I might prefer Chinese
over Mexican and some days I might prefer Mexican over Chinese. It's a
question of over what fraction of these I prefer this over this versus that
over this. And that is effectively capturing this choice model as one would
call it or distribution over permutations of the objects. And that's the
ground truth. And I'm sampling these data points, really I'm getting snip pets
of that from this ground truth.
Now, suppose given this data, I learned what the ground truth is that's
consistent with it. In this case, let's say here is one such consistent ground
truth. It's 75 percent of the population believes this ordering. 25 believes
in this ordering. Then maybe this might be the reasonable answer. Okay?
Again, this is all caricature, so the question is how do we execute in this
But this is roughly the plan.
Any questions?
Yes, please.
>>: So assume you could have some features for each node, and those features
are ordered and those features may happen to make better [indiscernible]?
>> Devavrat Shah:
So, for example, by features, what do you mean by that?
>>: For example, let's say for the [indiscernible], some movie has 90 minutes,
one movie has 120 minutes and maybe some other, you know, much action within
that movie, something like that. So if you have a feature for each node and
this feature can help you to do better job in the ranking?
>> Devavrat Shah: Sure. Okay. So there are two ways to think of that
feature. One is you could just forget -- at some level here I'm thinking of
each movie as a separate node. But you could say, well, I don't want to learn
everything in detail about each movie, but I want to categorize them through
some features. I will have fewer options over which I'm going to do ranking.
And then again, I will convert data into that feature space and that's how it
will happen.
So at that level, I will have more aggregation hatching of data. There is more
confidence in some sense. But I will be using some kind of precision, because
now I'm comparing two movies, one with super hit and one not so with the same
feature set. Yes, please?
>>: Does your model take into account that, say, some pairs might have been
rated several times?
>> Devavrat Shah:
Yes, of course.
And they do the same way and have more confidence?
>> Devavrat Shah: That is correct. I would like to design an algorithm
effectively that is taking that into account, which is related to a question of
[indiscernible] where I say that, well, if you have one pair compared once,
which is like six and five, ideally, and it's because of noise, one was this
other, if compared many times, maybe it would even out in terms of my
So I would like to have an algorithm that does better, as more and more
information I have. And also, the algorithm shouldn't rely too much on one
comparison. When there's one comparison, it should bias your information only
so much. Excellent question. Yeah.
But at some level, this is my ground truth, and I want to build guarantees
through that kind of thing.
>>: Can I ask a quick question. What if the sampling process is done by
people making [indiscernible] or sensors of any kinds and the samplers
themselves have biases and what if all the samplers could be in a category of
three kinds of samplers and then there's a bunch of samplers with positive
[indiscernible], there's a bunch of neutral samplers and a bunch of
[indiscernible] samplers. That could actually skew your assumption that you
have a ground truth distribution, right?
>> Devavrat Shah: And sometimes, it's saying that sort of categories of
people, each one of them, I would think of choice model one, choice model two,
choice model three. I want some kind of hierarchical classified things. I'm
doing this for one version of them.
>>: It could be an extension of their thinking to add a bias to one or a bias,
>> Devavrat Shah: That's exactly what we're trying to do right now. There are
some conjectures we have, and I'm happy to, yeah, excellent point, yeah.
All right. So here's very, very brief history. This is a great question.
Everybody has been fascinated by it, including myself. Goes centuries back,
where it says decades back. So here's one of the celebrated questions that
Arrow's impossibility result, where he said, well, suppose I don't have
pairwise, but I've got ranking, complete rankings available, and then I want to
aggregate things together.
Then how will I decide winner? So say I've got three objects, A, B and C and
different people have given me their permutations. This is what people would
call ranked elections. Now, in 1851, Tom Hare, a British intellectual, he came
up with this algorithm called Hare's ranking or proportional ranking. It's
been used in all commonwealth countries, including now currently in American
psychological association, where this is how you elect your president. So if
you want to elect a president, let's say there are four candidates in running,
you will rank all of them. And then at the end of it, we will come up with
some algorithm.
Again, Arrow said, well, if you say that your ranking algorithm satisfies these
sets of properties, which are reasonable properties, then there's no such
ranking algorithm possible. And this is very nice impossibility result, led to
two decades of other impossibility results. And then very recently, function
analysts have got into it and they're saying, well, impossibility result of
Arrow is not just one counter example, but it's actually present in a very
broad sense. So really, this is very hard.
In some sense, we are making it harder, because now I'm giving you just
pairwise comparisons. So in axiomatic sense, there's no way to solve this
problem. So this is a fact. As Condorcet had its own criteria, and Cynthia
Dwork was at Microsoft in silicon valley. They had an interesting, two
approximation algorithm for what's called Condorcet criteria. There's an
algorithm called Borda count algorithm, which I'll quickly mention in a second.
That was in Young in 1974 as an economist showed that the algorithm has mice
axiomatic properties. Of course, it does not contradict this result. But it
has some nice axiomatic properties.
So that's a lot of stuff from [indiscernible]. From choice model, there's
Thurston, who was a behavioral scientist. In '27, he said I think people
[indiscernible] like this. Everybody has some kind of TrueSkill, as it's
called in TrueSkill sense and there's some noise in their performance every
time they play, let's say.
So I'd say I and Lynn play a game and I have my own TrueSkill, he has his own
TrueSkill, and when we play, random variable is drawn, which is adding to
[indiscernible] his TrueSkill. And depending on the final answer being bigger
than one or the other, one wins. That's the type of thing that's used in ELO
ranking. It's the type of thing used here.
. There's a whole family of things, depending on how you model the
distribution of noise that leads to different things. This is a Gaussian.
This is what McFadden popularized for policy making which is called multinomial
logic model. It's an extreme value distribution and so on, and that's also
popularized in the business school world.
So it's a long history. There are lots of exciting things that have happened.
What I will do is I will relate to these class of models to an algorithm, but
first let me tell you the algorithm. It's an algorithm that plays with what
you observe that matters. And then after that, we'll see how well it works.
So that's a very quick overview of lots of things.
So here is my algorithm. So what we call rank centrality for reason that it's
like a random walk. So remember, we have a graph with each edge has these kind
of numbers, telling how often one [indiscernible]. I'm going to create random
walk as follows. Okay.
So for each edge that's present there, I'm going to put probability like this.
In fact, what this probability is reflecting is that fraction of time the other
player defeated me. So intuition behind this random walk is following. I'm
going to have a random walk following on this graph, and the stationary
distribution of that walk is going to assign me scores.
So let's suppose that I'm always defeating everybody. Then the stationary
distribution of this random walk should be pretty strong on me, right? It
should be high. That means if such is the case and if I'm defeating everybody,
as a part of random walk, I should hardly go to the other nodes. And, of
course, I should go to other nodes sometimes if I have defeated others only
once. If I have defeated others only once, I don't have too much confidence in
the data, so while I should have a bias towards me, but not too much.
But if I have defeated others all the time, like for over 100 times, it really
is a strong bias, in which case I should make sure that I go to other nodes as
a part of my random walk with little probability. That's the style of design
that we are doing here.
When I'm at node I, I'm going to go to node J with probability that's
proportional to how often J has defeated me in the normalized sense and these
plus ones are taking care of this the finite error correction. So let's
suppose that Rico and I have played only once, and Rico defeated me once. Then
I will go to him with probability 2 over 3. So there is some bias I'm giving,
but not too much. But then if it was 100 games and 100 to zero, then it would
be 101 [indiscernible], 102.
Okay. And if this is connected, which is minimum you need to have any
reasonable ordering, because if there are two sets of things I have never
compared, then there is nothing that I can do meaningful between them, and
there will be a well defined stationary distribution for this random walk, and
that will give me the scores.
Okay. And if I want to learn this, run this algorithm or just do power
[indiscernible], for example.
Is there a typo there?
Is that AIJ plus AJI?
>> Devavrat Shah: This one. You're right, of course. Thank you. Yeah, I was
just focusing on top. Thank you. Okay. That's very good. I'm conveying the
details also. All right. So that's an algorithm. Any questions about it?
Okay. Another question, how well does it do. This is just to roughly tell
you, this is the type of regression you're seeing which is factor recapturing
the essence that said well, I have a high score if I have defeated lots of
people, or a high score, if I play only with one, but that one person had a
very high score. A heavyweight championship, right. The winners only comes in
play at the end. And while you're trying to build your score up, you will play
a lot of time.
All right. This is associate relation to Borda count.
iterative version of Borda count.
And this is an
Now, if one considers this MNL model, which is a Thurston style model, what
this model says that each node has some kind of TrueSkill or parameters
associated with it, W and W1. So ideally, one would like that ranking to be of
that order, and those parameters reflecting the scores, in this case one could
design the maximum likelihood estimator, and our algorithm matches the
performance there. This is in terms of simulations.
What this says is the other algorithms, after some times, stops learning well,
even though there's axiomatically supposed to be very good. More formally,
mathematically, here's how the result looks like. If my graph of comparisons
is a random graph, then this is the standardized error, which scaled with
effectively parameter keys, the confidence that I have is how many times two
pairs are played with each other. And these, the degree of the graph. This is
how it scales down, and you can do better than that. There is -- this is a
fundamental lower bound, and this algorithm is effectively getting close to
And again, this is capturing that, well, if you have -- if you have only so
many comparisons, you can learn it well. And this is capturing the fact that
you need to have -- if your graph with bounded degree, there's only so much you
can learn. Random graph here seems to suggest that, well, with random graph
structure, you can essentially get as good as the best algorithm can ever get.
And sometimes it is captured well, because if you had any [indiscernible]
graph, you didn't have a choice. If I had a choice, maybe I would use random
graph. If I did not have a choice, if I had an arbitrary graph, then it's the
Laplacian of that graph will play an important role in this case. In
particular, it will show up like this.
So if you have a graph which is not well connected in terms of this gap of
Laplacian being small, then it would blow up. The error would be very high.
So if I had a line graph, I connecting to him, he connecting to him and so on,
that would lead to very poor performance.
But if it's a well connected graph, then it will be very good. Okay. And this
is also related to a random talk, a natural random walk on the regional graph
of Laplacian.
So the take-away message here is that if I were to design a graph, I would
choose a graph subject to constraints. But if it's [indiscernible] system,
there will be conflicts. I mean, if I have a conflict with somebody else, I
cannot be assigned that paper. But subject to those constraints, I would like
to choose a graph so that delta is maximized. The spectral gap is maximized.
And then if I wanted to maximize that actually as the thing that sort of in
these pieces that are shown, it's a nice [indiscernible] optimization problem.
And because of that could be solved reasonably well. Question?
[indiscernible] based on answers you obtain in earlier rounds?
>> Devavrat Shah: Excellent point. Let us suppose that we allow our
[indiscernible] graph apriori, and then you choose your [indiscernible],
because this information theoretic lower bound applies to that case also.
Maybe the best gain you can get is [indiscernible] factor. And I believe log
factor is necessary. It's just we can't prove it.
So my sense would be sure, maybe you might be able to improve it, but only up
to constant factor.
Okay. So there was a quick run-through, through one type of questions. There
is ranking. Second question is related to crowd sourcing. Again, ranking
could be thought of as crowd sources because we are getting information from
people. And this is like the world of crowd sourcing. It's a NetFlix prices
or sending somebody man on the moon or micro tasking, which is five cents per
task. And I'm going to talk about this five cents one, because I can't talk
about such a big amount of money, all right.
So here is a quick motivation why one might want to do this. If you have a
biological lab, let's say, and you're coming up with all sorts of these
interesting images of experiment and you want somebody to count how many red
cells are there, if you hire an undergrad intern and might sort of deal with
the 300 images per hour diligently, and it will comfort you something like
this. If you hired and put it out in the Mechanical Turk maybe people will
quickly count things, you're a little bit noisy answer, but you will get a lot
more done.
So the issue would be high versus low reliability. This is essentially the
type of experiment that was done by Susan Holmes [indiscernible] at Stanford.
What you want to do is you want to bring this reliability high but keep this
number high too. And, frankly, this is the type of thing you are aiming for.
So more out of your money.
Okay. And question is that how do we do that? Well, we know that one way to
do that is to have structured [indiscernible] built in. So while we will have
noisy answers coming in, we will be able to denoise it if we have structured
[indiscernible]. And that's sort of what we're trying to do.
So here's a quick example just to set everything, problem and notations right.
The example was related -- actually, these images are the type of thing
happened in 2008, a plane crashed in Nevada, and people are looking for where
the plane is and, of course, image processing is only so much advanced so you
wanted humans to do it.
They released on the order of 50,000-plus images and lots of people
volunteered. And then people started looking at images. So let's say I see
these three images and I say looks like there's a plane. May be a little
noisy, but there's a plane here. There's no plane debris. There's no plane
debris here. Somebody else looks at some other images and gives the answers
there. And so on.
So you get different people look at different subsets of image. You get their
answers on that whether plane debris is there or not. Finally, you decide
these places might be plane debris. Let me send people up to look at it.
Of course, if you look at this, you will say okay, no plane is there. It's
very high likelihood that a plane is there. But then things like this and
this, you don't know. So what you want to do is you want to sort of build a
confidence somehow from the answers that you got from things, which ones are
more likely and which ones are less likely.
If I knew who is a person who is giving me answers and how truthful or how not
truthful that person is, it would be a really easy problem, because I bias a
person's answers that way and then aggregate things. Problem is I don't know.
A standard Mechanical Turk platform, I put out my task, people take on the task
and they answer. And that's that. Really, I'm not really learning about them.
I don't have a choice about them. I do some kind of information from the
platform that how sorts of performers are done in the past, but that's only so
much limited. The question is how am I going to integrate this thing in a
meaningful way?
So again, I want to really solve this problem. Label estimation with minimum
cost. Cost is just number of edges in this kind of bipartite graph, because
each cost is like -- each edge is like one person performing this task.
And operational question I want to answer are task assignments. And once you
have answers, how to infer the best answer. Again, it's measuring the same
sets of questions that we had seen before, correct. How am I going to -- who's
going to compare which things and then once I've got comparison, how am I going
to infer answers. Similarly here, how am I going to locate tasks to different
people. And once I have answers to task, how am I going to infer them.
Again, need to tell you [indiscernible] model is to build the algorithm and
then understand it. And here's very simplistic model. In the kind of example
I give you, the task will be binary, lots of plus 1s and minus 1s. You might
have K-ary tasks. In case of images you will have seven cells or ten cells or
20 cells or so on. And each person has some kind of latent reliability as per
which the person will answer.
In this case, you say this person has probability half. So it's random. With
probability half, that person will answer correctly or incorrectly. And that's
how I'll see the answers. This person is completely correct and truthful. So
all the answers are given correctly. And I would assume that I've got
reasonable positive bias, because if I did not have that, then I would not be
able to differentiate all the pluses from all the minuses.
Okay. So here's with that probabilistic model, here's a quick preview of
results. In that probabilistic model, this is a simulation. You will be able
to -- this is the best performance you will be able to do. This is the amount
of redundancy and the reliability. Higher the reliability, of course, higher
the model of redundancy you need, and this is the best rate you can achieve.
This log versus linear scale suggests that error probability with reliability
goes down exponentially.
This is what you would get for majority voting. That is, you look at answers
and look at the majority answer, which is the natural thing to do. This is
what a popular inference algorithm called expectation maximization will do, and
this is what our algorithm will do. And as you can see, there is a similar
slope, but a little offset.
So something interesting is happening there. Looking at one way that I want to
achieve, let's say, 90 percent accuracy or 10 percent error in this simulation
our approach would require amount of redundancy which is eight, versus, let's
say, existing algorithm would require 12, versus majority, which is 17. And if
you're really investing money in it, this is the factor loss you are incurring
or gain you are incurring.
So really, good inference is very useful. It goes a long way. Now, I will
tell you about answers, right, because I told you the model. I gave you the
results in terms of a graph. Now let me tell you the algorithm for task
assignment and the inference. Like before, the best task assignment would be
random regular graph. There, it was [indiscernible] graph as per choosing
comparisons. Here, random regular graph, which is saying that if I have a
budget that each task should be assigned to L things and each person can
perform at most R tasks, then subject to those constraints, I will choose the
random graph.
Okay. And the inference algorithm would be like this. So let's just build
intuition towards algorithm. So if I have a task like that, which is plus
minus minus, majority voting would say minus, and that's it. But, well, if I
knew that how trustworthy these people were, then I would like to incorporate
that information into my answers and in oracle, who would know these answers
would just add the log likelihood to that, okay.
And, of course if everybody's equally trusted, then best answer would be equal
to majority voting. It sort of makes sense too. But, of course, we won't have
that. There will be uncertainty, so we would like to understand if we can
learn these weights.
Now I don't know the weights. I know only the answers. If I know that, well,
these answers are given by him or almost all of them are correct, then I should
give him very high P. Now how do I know that? Maybe his answers agrees with
other person. So somehow, I want to sort of stitch this intuition together.
One way to do that is to do iteration. Here is what iterative algorithm will
do. It will reliably learn this estimate for these log likelihoods and the way
it will do it is follows.
This is very as natural as it gets, right. Let's start with giving everybody
this equal likelihood. Everybody is equally weighted, say, one. Now I'm going
to assign likelihood for a given task and initially, these are just one so I'm
just going to sum up all the answers that I've got. Here I can sum up because
they're plus and minus one. And in the end, I will get the answer that, let's
say there are seven people who have answered this out of which six of them have
answered plus one and one of minus one. So my likelihood of being plus is plus
five, which is pretty strong.
Now, okay, so I got this kind of likelihood for all tasks. Now I go and try to
assign the reliability for each worker. Well, for different tasks, there are
different likelihoods that I'll obtain from other tasks. Now I want to look at
my answer to this task and see does it compare well with the reliability I've
obtained. If this is plus one and my answer is also plus one, that's good,
because I'm matching. If it's plus one, my answer is minus one, that's really
detrimental because I'm going against what everybody believes is true.
Okay. So I just sum that up and I it rate this. So if I did not exclude this
kind of in my iteration, answers coming from me from previous tasks, then it
would be like a power iteration of this matrix A transpose. So I'm just
excluding it because it's actually very important for information aggregation,
and when we do that, that when the best performance comes out. Before I said
and when we do that, that when the best performance comes out. Before I said
I'll give you the precise theorem
We have the room until noon, but it's audience attention.
>> Devavrat Shah: Okay. I'll end in five minutes and I've got more pictures
now. So we thought, well, we've got simulations, we're got theorem results in
a second. What about real world? I mean, maybe this is meaningful, maybe this
is not meaningful. So we thought we'll do experiments and this is a great
place where you can do experiments, because I can sort of load up my tasks on
Mechanical Turk and I can run experiments.
First, we thought, well, maybe we should do something like this.
Which of these ties are similar. But then similarities is in mind, right. So
it's very hard to make it objective. So we said, well, what about things like
this. Which tie goes well with this shirt. Again, this is all subjective. So
subjective things are very hard to evaluate. Finally, we ended up with this
thing. That is, which color is similar. And that is because there are these
metrics that exist that actually do cognitive similarity metric. And that
seemed to work extremely well, actually.
So we showed people these kind of colors, randomly generated and over them we
did all sorts of experiments and finally here is the type of performance we
see. Iterative algorithm starts doing much better after some threshold, and
there's a reason. And the threshold is effective before I tell you the
theorem. Say like if information is too noise, iterating actually increases
the noise.
Okay. But if you're in low noise; that is where you can actually do correct,
then iteration actually helps. And that's what comes out in terms of theorem,
this kind of qualitative result, and that's also what we saw in the experiment.
This is just one instance of that, but this is what we see on all sorts of
data, on data we collected.
There's a team at MIT which primarily designs crowd sourcing interfaces led by
Rob Miller and his colleagues. And on all of their data, also similar
performance looks. Yes?
It's not a matter of initial condition or anything like that?
>> Devavrat Shah:
>>: So after it converges -- so you could initialize it with majority voting
or something, and it's still going to ->> Devavrat Shah: Yes. So you can start with majority voting and it will
become worse. Again, it's because in model, you can prove it why it happens.
In reality, that's what we observed too.
And sort of it makes sense because of this reason. And then is random graph
really useful? Again, there are theorems about them. But in practice also,
you can see that with graphs with small spectral gap becomes worse.
All right. So this is some parameters you can assign which we called quality
of crowd. It's effectively just a quadratic norm of the latent things and that
is precisely what determines the performance. This is precise theorem. Let's
just look at this one. It says that suppose I want to obtain an error of
reliability one minus epsilon. How much amount of redundancy do I need? I
need redundancy that scales over that parameter as 1 over Q. No matter what
algorithm you use, you need this much. And if you use majority, it would be
quadratically off because of this exponent being Q square. And Q us usually
small, right. So one over Q square is really bad, versus one over Q.
Now, again, this is another place where you can ask a question. What about
adaptive. Does it help. For example, I have answers like this. These things
are well understood. Maybe I should only focus my energy on things like this
and this. Surprisingly, it does not and what this says is that it only
improves up to constant factor. There is again, in this case, also adaptation
does not help. So in both of these class of problems there's a question of
what sorts of graph structure you need for task assignment, what sorts of
inference algorithm you need, and what sorts of qualitative and quantitative
results come out.
Makes sense that you need graph which is reasonably well connected. Simple
iterative algorithms do very well. And they get you as good as a performance
as you want, and adaptivity is not much of use.
Okay. Now I think with that, here's a lot more information you can get out of
these results. Like you have a bunch of crowds, which one I should employ
depending on their quality and amount of money that they're asking me. I can
calculate that and decide which one too. So these are very useful
So that brings me to the end of my talk, and I think it's 11:30. It's time to
end. So really, there's a lot of data we have, it's a great opportunity, but
to realize that we need to process it at scale. And in these examples, what I
showed you is there is start with thinking about reasonable model, come up with
right algorithm, right algorithm helps you solve the problem well. The model
helps you understand mathematically why things are -- why the algorithms are
useful. But the algorithms are model independent, so they are useful just on
their own.
Okay. And I didn't show you this, right. I should show you that. The
question is can I predict trend on Twitter before it becomes trending. So here
is quick review. I should stop. But I should show you this. So that's Miss
Rhode Island, who became Miss USA this June. So naturally, what would you
expect when she becomes Miss USA? Things would trend on Twitter. And so
that's the perfect time to start predicting whether it become trending or not.
So here is the real signal in terms of volume of tweets that are happening. So
through MIT has this reasons why we had access to the firewalls of Twitter. So
this was a real signal we were tracking then. And it becomes -- Twitter
announces it's trending at time that's zero. We had our estimate running at
that same time. And our estimator said this would become trending at time
minus two hours. And this is -- this happened in this particular case. But
this is not atypical. This is very typical. In particular, this is how the
ROC curve over a large number of samples that we've done. Basically, the point
to take away here is that 95 percent of the time, we can predict something
trending correctly before it becomes trending. Four percent of the time, we
make an error, because, you know, you make an error.
And when we are ahead on average, we are ahead by one hour, 40 minutes. And so
that example was not atypical. All right. I think this is where I should
>>: Here you're not predicting the future.
trend on Twitter?
>> Devavrat Shah:
You're predicting what will become
>>: For example, Google did this work on flu symptoms. They are trying to
predict flu in the future and they figure out if there are searches for flu
symptoms, and they can pin them down geographically, then probably there will
be a flu breakout in this area at some, you know, close point in time.
>> Devavrat Shah: Excellent point. So sometimes there the point was that flu
searches are getting out information about something that's going to happen.
It will be recorded more massively on a public scale. If you, based on what we
observe with all sorts of these -- these are factor time diseases. These time
series do have a very simple structure, and for many of them, actually, the
information about effective, the information about them becoming popular is
already there. It's just Twitter is doing it on volume-based. So it takes a
while for them to announce it.
But if you are just a little clever about it, then actually you can get in, do
that prediction. Of course, you might be wrong sometimes, but looks like you
might not be wrong too often. It's a great analogy, actually, yeah. Yes,
>>: So in the second half of your talk wasn't anything you were talking about.
It was extracting information of the reliability of the people giving you
>> Devavrat Shah:
>>: In the first half of your talk, one example you talked about was
conference reviewing. We've got individual reviewers making comparisons.
you thought about combining those, because some reviewers are going to be
better than other reviewers, assessing the compared quality of papers.
>> Devavrat Shah: So what you're asking, putting it the other way, is that
there's a choice model. See, one way to think of it is there's a crowd that I
have, and the crowd is modeled by one distribution over permutations. In fact,
going back to Rico's last question is there are people who are reliable, which
means that there's a well separated distribution over permutations and there
are people who are not reliable, which is sort of mixed distribution over
The question is how can I put them together. One way to go about it is to
think of answers coming from multiple choice models and somehow combine them.
And that's something that we're trying to do right now.
So I don't have any meaningful answer.
about. Yes?
I have some conjecture I can tell you
>>: First part, there was a stationary distribution.
answer to this question?
Why is that a meaningful
>> Devavrat Shah: Okay. So at some level here's what's happening in both
cases. You've got some signal that you want to learn. You're observing a
specks of signal through some of these, let's call it random matrices. And if
you look at some form of a [indiscernible] approximation of these random
matrices, they turn out to be closely related to the signal.
And in both cases, really what we are
trying to learn some form of rank one
rank one 0 approximation turns out to
second case, that turns out to be the
trying to do is through iterations, we're
approximation. In the first case, the
be the stationary distribution. In the
approximation of that chopped-off matrix.
>>: In real life, people talk to each other and make sort of, let's say people
talk to each other in pairs and they make some kind of comparison, and they
talk, you know, I'll talk to you and then you talk. Is there any way to study
how these decisions can be made in a distributed way among people?
>> Devavrat Shah: Great. So there are two things. There is a dynamics part
and there is a decision making eventually. Let's suppose that one way, I mean,
one ideal way I would model people's behavior. I mean, this is, I don't know
how meaningful it is, but it's still useful to think in the ideal world is that
everybody has a choice model of their own that's in their mind. Implicitly or
explicitly, over the objects of interest. And every time I intract with
somebody else, that information changes my choice model and your choice model.
So over time, it's evolving.
So while we intract, it evolves.
And also, at
the same time, this whole evolution could, in principle, lead to some kind of
global decision making, if you're extracting those informations out. How to
think about that meaningful way, I don't know.
But it seems like maybe reasonable way to go about.
It's too messy.
It's too complicated.
First, I want to say I enjoyed the talk very much.
>> Devavrat Shah:
Thank you very much.
>>: Basically, [indiscernible] the fact you're bringing this [indiscernible]
models into practical systems. It's very interesting for the last part of the
work. Now, for the first part, where you tried to create a partial order of
[indiscernible] in the basically space study, one thing I basically noticed is
in practical world, many times there's more than one orders, because of
So in that case, maybe there is something like a context, for the restaurant
model, basically. I mean, not everyone had the same preference. And that may
be cause of background. [indiscernible] of Chinese versus basically U.S.,
maybe it's a different partial orders you have and very few like spiciness, not
like spiciness. Like wine, you do not drink wine. You may have different
However, this hidden [indiscernible], is the possible to be applied into the
order. Let's say a year, I mean, say any particular user when you basically
[indiscernible] is it possible to take these hidden context into consideration?
>> Devavrat Shah: Okay. So I think you bring up again a very interesting
point in which both of them have brought up. Is that thinking of entire world
as one choice model is not the right thing, because people are heterogenous.
Now question is one way to go about dealing with this is saying well, I've got
mixture of these choice models is B1 fraction is this type, B2 is this type and
so on. How many types? And second is what are those Bs and third is what are
those associated choice models.
This is a hard question. Again, I have some conjectures and there's some
interesting things I can say about, but not with 100 percent confidence.
People have tried starting this kind of learning over distributions from
partial information, including myself. There are some sparse choice model
approximations that we know. But they're not, I think, practically useful, at
least in my mind.
So what people do in the world of, for example, revenue management, it's a
business school world, people put some kind of structured mixture of the
multinomial logic models and then try to learn those parameters related to that
structure. But again, that's ad hoc and it's [indiscernible] I do not think
about it.
So it's great sets of questions, which some of you should answer.
And --
We can talk.
>> Devavrat Shah:
>>: So I have been working on application really like what you did for first
part [indiscernible] first part and your second part. I want to get some
opinion from you. This application is how to evaluate stroke patients'
movement quality. And how to do that, we need to run all the stroke patients
movement. And how we can do that, we pick two of them and ask the therapist
who did this better.
>> Devavrat Shah:
>>: So then this is a pairwise comparison and we can Compare all of the stroke
patient. The problem is, so because the time for the therapist is constant.
So we should ask the therapists these questions. So which is your second part.
There is the crowd sourcing. You have a lot of therapists and I have a lot of
stroke patients friends I want to ask as minimum as possible number of
therapists for this pairwise comparison.
>> Devavrat Shah: So again, I think I've been through this very quickly, but
again what you would like to do is you'd like to do exactly same thing for
first setting. That is, you want to maximize the -- there's a comparison graph
that you are creating, right? Effectively.
>> Devavrat Shah: And you want that comparison graph to have as large a
spectral graph as possible subject to your constraints. For example, there's a
therapist who cannot rank two patients. Then you can't ask that question. So
effectively, you've got this huge [indiscernible] matrix and you're going to
assign each therapist or one therapist, I don't know how your setting is, to
each one of the entries to ask them question. Question, how would you choose
those entries.
One option is you choose them at random as per [indiscernible] graph or random
regular graph. Another option is if you could do each -- you do structured
expander, for example. And whichever way you would do it, at least this result
would say that you're getting most amount of information out of it. I mean,
I'll be happy ->>:
On your second part, I want to ask as less time as possible.
>> Devavrat Shah: Yes, so this will also, again, it will say that if you
have -- this is roughly how your uncertainty would scale. So this is how many
times you're asking a given pair and this is how your structured graph would
look. So if this multiplication is large enough for your metric of interest,
then ->>:
>> Devavrat Shah: Okay. If you have more questions, please send me an email.
I'll be happy to -- yeah.
>>: One quick one. When you were estimating the [indiscernible], I don't know
if you can bring the slide that does that, you had one slide where you had the
summations that related [indiscernible] to tees and tees to ->> Devavrat Shah:
The next one.
Yeah, you're right.
>> Devavrat Shah:
Just something that -- yeah, exactly.
>>: So basically, you have a set of the, on the left side, you have an assumed
set of values and from that, you update the tees and then on the other side,
given the tees, you update the values?
>> Devavrat Shah:
Now I'll go back and update tees.
So you keep going back and forth until it converges?
>> Devavrat Shah: Until it converges, actually, I would stop after some number
of iterations, which is -- so I will stop after K, which scales like
effectively order one over log Q. One over log Q.
>>: My question is that these things are kind of in the class of coordinated
sense algorithms where you're estimating such spaces and those algorithms have
very bad convergence properties, including that if some strong assumptions are
not satisfied, they converge to something that's not even a stationary point.
>> Devavrat Shah:
Do you run into issues like that?
>> Devavrat Shah: No. It's a great point. Because effectively, if I want to
think about that in coordinate descent, this is unconstrained. So there is one
reason why it's not happening. If I want to think of in more classical linear
algebraic way, effectively I'm doing a power iteration. I'm trying to compute
the largest singular vector of a matrix. And these are very well conditions
matrices. And there is a reason we are not running into that. It's an
excellent point, yeah.
>> Rico Malvar:
Well, thank you very much.