1 >>: So let's get started. Rico and I are delighted to host Devavrat Shah today, taking advantage of his sabbatical on the west coast to invite him here. Professor Shah is an associate professor at M.I.T. in the Department of Electrical Engineering and Computer Science. He's with the laboratory for information and decision systems and also the Operations Research Center. His interests are generally in large complex networks, the stochastics and algorithms that underlie those things. And he is an associate editor at the IEEE Transactions on Information Theory and also on Operations Research and Queueing Systems, and we'll hear what he has to say today. Thanks. >> Devavrat Shah: Thanks. And thanks, Rico for making this possible. I really appreciate it. So just so that there's a bigger purpose behind the visit, and one of the things that we would really like to do at LIDS -- LIDS, which is research lab at MIT. So many of you who might know how MIT works, the departments are there and the research labs are there, and both are important and everybody belongs to both. And at least one lab, if not more. At LIDS, primarily play role as a bridge between EENCS so things like what I'm going to tell you about, which is a nice bridge between statistical inference and machine learning. Things people do in optimization, which by the design is between EENCS, people do continuous optimization and discrete optimization, control and robotics computer vision and signal processing, and the list goes on. And these are the type of things we do there, and we would really like to engage broadly speaking. And with that prelude, let me start telling you about what you kind of stuff I do and I'll be happy to talk to you about myself or about others, if you are interested in our [indiscernible]. All right. So what am I going to talk about? I'm going to talk about a few questions in the context of processing social data. These are concrete questions that we've looked at over the past three to four years, and there are some nice solutions that we have, and I believe there is much more to be done here than just thinking about big data as building big systems. So with that in mind, what am I really interested in talking about by meaning social data? It's data that is generated by us. It's all sorts of transactions, electronic transactions that you have made, or the restaurant 2 rating that you left on Yelp, or movie ratings that you left on NetFlix, the tweets you have sent out if you participate, whether as an employee or an employer on Mechanical Turk, which is the crowd sourcing system. Or your Facebook posts, your cookie data and so on. Which I'm sure everybody uses that all the time. Now the question is a tremendous amount our gene is encoded it and do something what can we do with it? It's well understood that there is of information. It's like our gene is encoded here. So if here, maybe we can get some meaningful information out of useful with it. Here are a few options. Of course, we can do better businesses. In operations world in business school, when we talk about this as figuring out how to manage our revenue. It's useful for pricing. At Bing when we think of how to do advertising. If you are thinking a little more generally, thinking about myself, I would like good recommendations available so that I can go and eat at good places, for example, or watch the right movie. It would be useful for policy making, deciding whether to add a road or not or a school or not. And if Congress has access to what people like and dislike, it would be very useful. And more generally, it could change the course of societies. I mean, at some level, this was a well understood example of that. Some level, crowd sourcing are changing the way labor exchanges are work organized, right? I mean, O-desk is one of those examples where people who can get a quick employment that was not possible before. Mechanical Turk style micro crowd sourcing is helping connecting people across the world in an interesting way. Things like news reports are useful to get quickly. So all of these are ways in which we can do something. What is the basic challenge? At least from my view, the core of it, we've got a lot of data is which unstructured and highly noisy, and there's a lot of it, which means what I want to do is I want to eventually make decisions from this data. In order to make these decisions, at some level, I need to solve this statistical challenge; that is, I need to understand what is basic structure there, and if there's a lot of data, I need to figure out how am I going to do it in computationally efficient manner that can scale with it. These kind of questions at some scale being observed across the idea of domains over years, but now it has become really acute, given the scale and given the 3 amount of uncertainty that we have. Now, for -- this is a billing challenge. I won't be able to solve it at all. But what I would be able to do, I would be able to able to tell you a few concrete questions in which we have approached this two-pronged approach, where it's thinking of the right statistical inference framework, along with simple algorithms and try to get meaningful answers. So effectively, I'm thinking of data as forming an appropriate statistical model. Now, the data is generated from an appropriate statistical model. Once I understand that, I could think of taking decisions by coming up with optimal inference algorithm and while I have finished most of the important part, the question would be that optimal algorithms are not easy to implement, especially at scale, so what I would like to do is I would like to develop meaningful approximations that take me from data to decisions. Really, the model is only helping me think through it. So at the end of the day, I will take the data, I apply simple algorithms, I will get some meaningful answers and these answers are interesting for two reasons. One is they solve the problem. And second, if I would as an academic what these are really doing, our statistical model to argue about those answers. Now, this is at a high level, very useful plan, only if I can execute. Let me say proof is in the pudding. So I will show you three examples. One in the context of decision making/recommendation, here the question is we've got a lot of data that's telling something about people's choice. A little bit about people's choice and how it stitches together. Second is question of crowd sourcing. Most of you must know this, and if you haven't, I'll explain precisely what it is. What I want to do is I want to build meaningful answers from different small, small answers that I opt in from people which are noisy and stitch them together. Finally, related to understanding trending, advertising and Twitter. Now, I understand that talk is going to last for 40 or more minutes, and this is too much to go through. So as we go through this way, the information content will decrease, but hopefully I will be able to convey what's the question I'm looking at and hopefully the algorithms that we're looking at. And feel free to stop me here. The audience is small and interaction would be very useful. 4 >> Rico Malvar: You have more than 40 minutes. >> Devavrat Shah: Okay. So with that broad layout in mind, let's start with this one simple question. I should give credit to my co-authors here. This has been part of a longer program that has been going on for four to five years. Started with my former student, [indiscernible] Jagabathula, who is now at NYU, with [indiscernible] at Sloan School of Management [indiscernible] Farias. A student, Ammar, who is at LIDS with my post-doc, Sahan Negahban and [indiscernible] Oh is now at Urbana-Champaign. This is the first part of the talk. The second part is with [indiscernible] and David Karger. David is a colleague of mine in computer science part of the art department. And the last one is with my former student, [indiscernible], who is now at MIT as a faculty member and a student who spent six months with me, now because of his work he's now at Twitter. I wish he was remained with me. When we'll get there, you'll see why. Okay. So first part is recommendation. So at a high level, question is something like this. I've got lots of partial preference information from various sets of people, somebody telling me I like this restaurant so much. Somebody telling me I like this movie so much and so on. And then from that, what I really want to do is I want to somehow put this partial information together and stitch them and provide some kind of global ranking. At the end of the day, it's not just ranking, it's also intensity also that matters. So here's some scenarios. Let's say I've got a bunch of movie watchers. Somebody like yourself telling me that you really like Inside Job. And based on what other people have told me and what you have line liked, I might suggest that you might want to watch this movie. Or let's say there might be hiring decisions that you must be doing all the time, and a candidate is interviewed, and different people give scores differently. Somebody gives 8 out of 10. Maybe Rico gives eight out of ten. [indiscernible] gives nine out of ten, Phil gives seven out of ten. And then at the end of it, you'll be you don't want to hire the Microsoft CEO, but somebody else is hired. So it's a decision making question. Now, these kind of questions show up everywhere. Microsoft TrueSkill if you 5 have played and called yourself true skilled, then you will have a score. People are playing games. There are -- not everybody is playing with everybody. So only subsets of people are playing with each other and based on that I want to assign scores to everybody. Recommendation we just went through, as academics, we think about this all the time. We submit papers to conferences and then conferences have only ten papers or 15 papers or 25 papers that can be accepted out of 200 or 500 or thousand. And question is which ones to accept. Similar question for us in graduate admissions shows up. Every winter, we have students to admit and I'm sure you have intern problems of similar type, which interns to hire and not. Okay. So in all of these questions, really at the end of it, there are two types of questions one wants to answer. One is how should I get input from people if I can make it feasible. Things like in conferences or admissions or hiring, I can tell people that give me in some form, five stars or this or that. That's a design question. But given that well of an input I have, for example, in games, somebody wins over somebody else, or maybe if you are playing cricket match for five days, then you can lead to draws too. But regarding the draws, you got pairwise winning results coming out. Now, there are all sorts of heterogenous ways in which data is coming. How would you look at it from one lens and stitch them together to get answer? It's really two questions. One is what should I do to design it if you had a choice. And if you didn't have a choice, you got all sorts of partial data coming in, preferences coming in, how do I stitch them together, okay. So let's just look at some of the popular approaches. One would be like dislike or do star rating. Okay? This is easy to input, basically whether you like or not like. This is a little bit complicated, because what does four star mean? But again, no matter what, these are very simple aggregation problems. Once I have got input, I will average number of likes you have or take total number here. For example, I've got, let's say, one like for here, one like for here and one dislike. So plus two minus one and then plus one and just sorting that out. And similarly, I can do the same thing for stars. once I've got input. So it's easy to aggregate The problem is that in this case, this is arbitrary scale, because I don't know 6 what four stars means. Could be mood dependent. And at the end of the day, these are coarse, right, because as it happens in our MIT admission system, once you attend the first round, we are left with roughly half of the students, all of them starred four. So now what to do. Well, now we have got three sets of day-long meetings through which we actually talk to each other, fight it out. Maybe there should be a little better way of doing that. And that's really the issue of coarseness of the scale. Again, as Nietzsche said, there is something beyond good and evil. We should think beyond this. Now, answer I think is in the simple game, and I think this is the right for me to entertain you as well. It's morning. It's before 11:00. So let's see. I give you this blue color, and ask you, tell me how blue is it 37 don't worry, I've given you the code too. It's like I'm going to my orthopedic with my back pain, an orthopedic guy. So let's start, how bad is your pain? Which I didn't have pain which I not bad, I wouldn't show up here. I have better things to do. But then a good optometrist asks me the right question. Is this vision better or is that vision better, right? And it's basically about comparisons. And in this case, I might say that the answer is this is more blue than that. So really, the answer lies in comparisons. And whichever way you look at it, whether it's sports, with win or losses, if I'm a cricket fan and I like, my inclination is this way, so India beats Australia, then say comparison and put that India beats Australia. India is liked over Australia more. Or these are two restaurants that some of you might recognize, right? These are two really nice French restaurants in Seattle. Apparently, this one has better reviews as far as people's writing goes. That's what I found. So if you want to try, if you haven't tried either, maybe the suggestion is try this one before that. Or if let us suppose I was writing a paper about ranking and there was my paper versus other paper, then definitely, it would mean, even though you might have come up with ratings as per scores, I would convert it into comparisons. Bottom lean is whichever way you provided me partial preferences, I could view all of them as bags of -- or bunch of pairwise comparisons, okay? In the process, one might say that, well, aren't you losing precision here? There is nine versus five and eight versus five have more information than just comparison. Yes, reviewer right. You are losing that information, but it's not clear if that information is really absolute and meaningful. So it's a 7 debatable topic. comparisons. But definitely, it is absolute information that I've got So question boils down to the following situation. I've got -- >>: Can I ask a question? Not only there's the question of losing information, there's the question of [indiscernible], because if it was six versus five, and you say greater, it is as important as nine versus five. Whereas six versus five might not even be statistically significant. >> Devavrat Shah: Excellent point. >>: [indiscernible]. >>: You diminish noise. And in some sense -- >>: No, because I'm not saying this says that something that the right conclusion might be equal, you're now assigning it's bigger when something that was nine to five was the same bigger as the six to five. >> Devavrat Shah: >>: So in some sense, what you're saying is that -- It's all within the loss of precision. >> Devavrat Shah: Loss of precision. And if I had more comparisons between two things, then I should put more confidence over A versus B, rather than just treating that as answer once and for all. And in some sense that's exactly when we try to answer using the model that we would build in, okay. That's a great question, yeah. All right. So at the end of it, we are left with this kind of setting, right? I've got a bunch of objects. There are edges between them representing that they've been compared by one or more people, let's say. Here, A12, for example, reflects that when one and two are compared, one and two played games with each other, out of those many, let's say, this plus this many games, these many times, one defeated two, and these many times, two defeated one. So frankly, I've got this kind of a nice weighted graph. And given this, I want to, from these comparisons, I want to assign ranking or, more specifically, scores to each of the objects to be meaningful based on these observations. 8 And in some cases, I will have this kind of data. In other cases, I might even have a choice of designing the graph. There is which pairs to compare and which pairs not to compare. If I were thinking of designing a conference paper reviewing system, if I assigned Lynn, let's say, papers four and five, then he actually -- let's say four and three, then he actually compared four and three depending on what scores he assigned. So I could decide who gets what. Because really, these are two questions. In some cases this is possible. In some cases, you're left with only this one. And we would like to answer these two questions. Again, in order to answer this question, I have to give you an algorithm, and before I give an algorithm, at least so that I can think concretely, I would like to think of a statistical model. So first thing I would do is I would like to tell you about a statistical model. And the model that I'm going to put as a background is that there's underlying distribution over permutations that is at the ground truth. And the observations are coming out of -- the pairwise observations are coming out from this. So let's see. What do I mean by that? Here's a simple caricature. Suppose I've got -- I've seen -- got three objects, A, B and C. I've got these kind of data points, four data points. A bigger than B, B bigger than C and so on. Really, I'm thinking in the background really, A greater than B possibly is coming from A greater than B greater than C as a permutation over all objects. B greater than C greater than C greater than A is a permutation over all objects. And so on. Now, what are these permutations presenting? Well, the permutations are presenting a choice model, a choice model of population. I'm thinking of he has some ordering of all papers in mind and I asked him only a subset of them. And that subset when I asked him, he revealed the answer, the orders. When I tell you about two restaurants, I have inherently in my mind not just in restaurant what case what would make sense, it's not just one ordering, but a bunch of orderings, because some days I might prefer Chinese over Mexican and some days I might prefer Mexican over Chinese. It's a question of over what fraction of these I prefer this over this versus that over this. And that is effectively capturing this choice model as one would call it or distribution over permutations of the objects. And that's the ground truth. And I'm sampling these data points, really I'm getting snip pets of that from this ground truth. 9 Now, suppose given this data, I learned what the ground truth is that's consistent with it. In this case, let's say here is one such consistent ground truth. It's 75 percent of the population believes this ordering. 25 believes in this ordering. Then maybe this might be the reasonable answer. Okay? Again, this is all caricature, so the question is how do we execute in this context? But this is roughly the plan. Any questions? Yes, please. >>: So assume you could have some features for each node, and those features are ordered and those features may happen to make better [indiscernible]? >> Devavrat Shah: So, for example, by features, what do you mean by that? >>: For example, let's say for the [indiscernible], some movie has 90 minutes, one movie has 120 minutes and maybe some other, you know, much action within that movie, something like that. So if you have a feature for each node and this feature can help you to do better job in the ranking? >> Devavrat Shah: Sure. Okay. So there are two ways to think of that feature. One is you could just forget -- at some level here I'm thinking of each movie as a separate node. But you could say, well, I don't want to learn everything in detail about each movie, but I want to categorize them through some features. I will have fewer options over which I'm going to do ranking. And then again, I will convert data into that feature space and that's how it will happen. So at that level, I will have more aggregation hatching of data. There is more confidence in some sense. But I will be using some kind of precision, because now I'm comparing two movies, one with super hit and one not so with the same feature set. Yes, please? >>: Does your model take into account that, say, some pairs might have been rated several times? >> Devavrat Shah: >>: Yes, of course. And they do the same way and have more confidence? 10 >> Devavrat Shah: That is correct. I would like to design an algorithm effectively that is taking that into account, which is related to a question of [indiscernible] where I say that, well, if you have one pair compared once, which is like six and five, ideally, and it's because of noise, one was this other, if compared many times, maybe it would even out in terms of my comparisons. So I would like to have an algorithm that does better, as more and more information I have. And also, the algorithm shouldn't rely too much on one comparison. When there's one comparison, it should bias your information only so much. Excellent question. Yeah. But at some level, this is my ground truth, and I want to build guarantees through that kind of thing. >>: Can I ask a quick question. What if the sampling process is done by people making [indiscernible] or sensors of any kinds and the samplers themselves have biases and what if all the samplers could be in a category of three kinds of samplers and then there's a bunch of samplers with positive [indiscernible], there's a bunch of neutral samplers and a bunch of [indiscernible] samplers. That could actually skew your assumption that you have a ground truth distribution, right? >> Devavrat Shah: And sometimes, it's saying that sort of categories of people, each one of them, I would think of choice model one, choice model two, choice model three. I want some kind of hierarchical classified things. I'm doing this for one version of them. >>: It could be an extension of their thinking to add a bias to one or a bias, PDF. >> Devavrat Shah: That's exactly what we're trying to do right now. There are some conjectures we have, and I'm happy to, yeah, excellent point, yeah. All right. So here's very, very brief history. This is a great question. Everybody has been fascinated by it, including myself. Goes centuries back, where it says decades back. So here's one of the celebrated questions that Arrow's impossibility result, where he said, well, suppose I don't have pairwise, but I've got ranking, complete rankings available, and then I want to aggregate things together. 11 Then how will I decide winner? So say I've got three objects, A, B and C and different people have given me their permutations. This is what people would call ranked elections. Now, in 1851, Tom Hare, a British intellectual, he came up with this algorithm called Hare's ranking or proportional ranking. It's been used in all commonwealth countries, including now currently in American psychological association, where this is how you elect your president. So if you want to elect a president, let's say there are four candidates in running, you will rank all of them. And then at the end of it, we will come up with some algorithm. Again, Arrow said, well, if you say that your ranking algorithm satisfies these sets of properties, which are reasonable properties, then there's no such ranking algorithm possible. And this is very nice impossibility result, led to two decades of other impossibility results. And then very recently, function analysts have got into it and they're saying, well, impossibility result of Arrow is not just one counter example, but it's actually present in a very broad sense. So really, this is very hard. In some sense, we are making it harder, because now I'm giving you just pairwise comparisons. So in axiomatic sense, there's no way to solve this problem. So this is a fact. As Condorcet had its own criteria, and Cynthia Dwork was at Microsoft in silicon valley. They had an interesting, two approximation algorithm for what's called Condorcet criteria. There's an algorithm called Borda count algorithm, which I'll quickly mention in a second. That was in Young in 1974 as an economist showed that the algorithm has mice axiomatic properties. Of course, it does not contradict this result. But it has some nice axiomatic properties. So that's a lot of stuff from [indiscernible]. From choice model, there's Thurston, who was a behavioral scientist. In '27, he said I think people [indiscernible] like this. Everybody has some kind of TrueSkill, as it's called in TrueSkill sense and there's some noise in their performance every time they play, let's say. So I'd say I and Lynn play a game and I have my own TrueSkill, he has his own TrueSkill, and when we play, random variable is drawn, which is adding to [indiscernible] his TrueSkill. And depending on the final answer being bigger than one or the other, one wins. That's the type of thing that's used in ELO ranking. It's the type of thing used here. 12 . There's a whole family of things, depending on how you model the distribution of noise that leads to different things. This is a Gaussian. This is what McFadden popularized for policy making which is called multinomial logic model. It's an extreme value distribution and so on, and that's also popularized in the business school world. So it's a long history. There are lots of exciting things that have happened. What I will do is I will relate to these class of models to an algorithm, but first let me tell you the algorithm. It's an algorithm that plays with what you observe that matters. And then after that, we'll see how well it works. So that's a very quick overview of lots of things. I'm missing out here. There are a lot of things So here is my algorithm. So what we call rank centrality for reason that it's like a random walk. So remember, we have a graph with each edge has these kind of numbers, telling how often one [indiscernible]. I'm going to create random walk as follows. Okay. So for each edge that's present there, I'm going to put probability like this. In fact, what this probability is reflecting is that fraction of time the other player defeated me. So intuition behind this random walk is following. I'm going to have a random walk following on this graph, and the stationary distribution of that walk is going to assign me scores. So let's suppose that I'm always defeating everybody. Then the stationary distribution of this random walk should be pretty strong on me, right? It should be high. That means if such is the case and if I'm defeating everybody, as a part of random walk, I should hardly go to the other nodes. And, of course, I should go to other nodes sometimes if I have defeated others only once. If I have defeated others only once, I don't have too much confidence in the data, so while I should have a bias towards me, but not too much. But if I have defeated others all the time, like for over 100 times, it really is a strong bias, in which case I should make sure that I go to other nodes as a part of my random walk with little probability. That's the style of design that we are doing here. When I'm at node I, I'm going to go to node J with probability that's 13 proportional to how often J has defeated me in the normalized sense and these plus ones are taking care of this the finite error correction. So let's suppose that Rico and I have played only once, and Rico defeated me once. Then I will go to him with probability 2 over 3. So there is some bias I'm giving, but not too much. But then if it was 100 games and 100 to zero, then it would be 101 [indiscernible], 102. Okay. And if this is connected, which is minimum you need to have any reasonable ordering, because if there are two sets of things I have never compared, then there is nothing that I can do meaningful between them, and there will be a well defined stationary distribution for this random walk, and that will give me the scores. Okay. And if I want to learn this, run this algorithm or just do power [indiscernible], for example. >>: Is there a typo there? Is that AIJ plus AJI? >> Devavrat Shah: This one. You're right, of course. Thank you. Yeah, I was just focusing on top. Thank you. Okay. That's very good. I'm conveying the details also. All right. So that's an algorithm. Any questions about it? Okay. Another question, how well does it do. This is just to roughly tell you, this is the type of regression you're seeing which is factor recapturing the essence that said well, I have a high score if I have defeated lots of people, or a high score, if I play only with one, but that one person had a very high score. A heavyweight championship, right. The winners only comes in play at the end. And while you're trying to build your score up, you will play a lot of time. All right. This is associate relation to Borda count. iterative version of Borda count. And this is an Now, if one considers this MNL model, which is a Thurston style model, what this model says that each node has some kind of TrueSkill or parameters associated with it, W and W1. So ideally, one would like that ranking to be of that order, and those parameters reflecting the scores, in this case one could design the maximum likelihood estimator, and our algorithm matches the performance there. This is in terms of simulations. What this says is the other algorithms, after some times, stops learning well, 14 even though there's axiomatically supposed to be very good. More formally, mathematically, here's how the result looks like. If my graph of comparisons is a random graph, then this is the standardized error, which scaled with effectively parameter keys, the confidence that I have is how many times two pairs are played with each other. And these, the degree of the graph. This is how it scales down, and you can do better than that. There is -- this is a fundamental lower bound, and this algorithm is effectively getting close to that. And again, this is capturing that, well, if you have -- if you have only so many comparisons, you can learn it well. And this is capturing the fact that you need to have -- if your graph with bounded degree, there's only so much you can learn. Random graph here seems to suggest that, well, with random graph structure, you can essentially get as good as the best algorithm can ever get. And sometimes it is captured well, because if you had any [indiscernible] graph, you didn't have a choice. If I had a choice, maybe I would use random graph. If I did not have a choice, if I had an arbitrary graph, then it's the Laplacian of that graph will play an important role in this case. In particular, it will show up like this. So if you have a graph which is not well connected in terms of this gap of Laplacian being small, then it would blow up. The error would be very high. So if I had a line graph, I connecting to him, he connecting to him and so on, that would lead to very poor performance. But if it's a well connected graph, then it will be very good. Okay. And this is also related to a random talk, a natural random walk on the regional graph of Laplacian. So the take-away message here is that if I were to design a graph, I would choose a graph subject to constraints. But if it's [indiscernible] system, there will be conflicts. I mean, if I have a conflict with somebody else, I cannot be assigned that paper. But subject to those constraints, I would like to choose a graph so that delta is maximized. The spectral gap is maximized. And then if I wanted to maximize that actually as the thing that sort of in these pieces that are shown, it's a nice [indiscernible] optimization problem. And because of that could be solved reasonably well. Question? >>: [indiscernible] based on answers you obtain in earlier rounds? 15 >> Devavrat Shah: Excellent point. Let us suppose that we allow our [indiscernible] graph apriori, and then you choose your [indiscernible], because this information theoretic lower bound applies to that case also. Maybe the best gain you can get is [indiscernible] factor. And I believe log factor is necessary. It's just we can't prove it. So my sense would be sure, maybe you might be able to improve it, but only up to constant factor. Okay. So there was a quick run-through, through one type of questions. There is ranking. Second question is related to crowd sourcing. Again, ranking could be thought of as crowd sources because we are getting information from people. And this is like the world of crowd sourcing. It's a NetFlix prices or sending somebody man on the moon or micro tasking, which is five cents per task. And I'm going to talk about this five cents one, because I can't talk about such a big amount of money, all right. So here is a quick motivation why one might want to do this. If you have a biological lab, let's say, and you're coming up with all sorts of these interesting images of experiment and you want somebody to count how many red cells are there, if you hire an undergrad intern and might sort of deal with the 300 images per hour diligently, and it will comfort you something like this. If you hired and put it out in the Mechanical Turk maybe people will quickly count things, you're a little bit noisy answer, but you will get a lot more done. So the issue would be high versus low reliability. This is essentially the type of experiment that was done by Susan Holmes [indiscernible] at Stanford. What you want to do is you want to bring this reliability high but keep this number high too. And, frankly, this is the type of thing you are aiming for. So more out of your money. Okay. And question is that how do we do that? Well, we know that one way to do that is to have structured [indiscernible] built in. So while we will have noisy answers coming in, we will be able to denoise it if we have structured [indiscernible]. And that's sort of what we're trying to do. So here's a quick example just to set everything, problem and notations right. The example was related -- actually, these images are the type of thing happened in 2008, a plane crashed in Nevada, and people are looking for where 16 the plane is and, of course, image processing is only so much advanced so you wanted humans to do it. They released on the order of 50,000-plus images and lots of people volunteered. And then people started looking at images. So let's say I see these three images and I say looks like there's a plane. May be a little noisy, but there's a plane here. There's no plane debris. There's no plane debris here. Somebody else looks at some other images and gives the answers there. And so on. So you get different people look at different subsets of image. You get their answers on that whether plane debris is there or not. Finally, you decide these places might be plane debris. Let me send people up to look at it. Of course, if you look at this, you will say okay, no plane is there. It's very high likelihood that a plane is there. But then things like this and this, you don't know. So what you want to do is you want to sort of build a confidence somehow from the answers that you got from things, which ones are more likely and which ones are less likely. If I knew who is a person who is giving me answers and how truthful or how not truthful that person is, it would be a really easy problem, because I bias a person's answers that way and then aggregate things. Problem is I don't know. A standard Mechanical Turk platform, I put out my task, people take on the task and they answer. And that's that. Really, I'm not really learning about them. I don't have a choice about them. I do some kind of information from the platform that how sorts of performers are done in the past, but that's only so much limited. The question is how am I going to integrate this thing in a meaningful way? So again, I want to really solve this problem. Label estimation with minimum cost. Cost is just number of edges in this kind of bipartite graph, because each cost is like -- each edge is like one person performing this task. And operational question I want to answer are task assignments. And once you have answers, how to infer the best answer. Again, it's measuring the same sets of questions that we had seen before, correct. How am I going to -- who's going to compare which things and then once I've got comparison, how am I going to infer answers. Similarly here, how am I going to locate tasks to different people. And once I have answers to task, how am I going to infer them. 17 Again, need to tell you [indiscernible] model is to build the algorithm and then understand it. And here's very simplistic model. In the kind of example I give you, the task will be binary, lots of plus 1s and minus 1s. You might have K-ary tasks. In case of images you will have seven cells or ten cells or 20 cells or so on. And each person has some kind of latent reliability as per which the person will answer. In this case, you say this person has probability half. So it's random. With probability half, that person will answer correctly or incorrectly. And that's how I'll see the answers. This person is completely correct and truthful. So all the answers are given correctly. And I would assume that I've got reasonable positive bias, because if I did not have that, then I would not be able to differentiate all the pluses from all the minuses. Okay. So here's with that probabilistic model, here's a quick preview of results. In that probabilistic model, this is a simulation. You will be able to -- this is the best performance you will be able to do. This is the amount of redundancy and the reliability. Higher the reliability, of course, higher the model of redundancy you need, and this is the best rate you can achieve. This log versus linear scale suggests that error probability with reliability goes down exponentially. This is what you would get for majority voting. That is, you look at answers and look at the majority answer, which is the natural thing to do. This is what a popular inference algorithm called expectation maximization will do, and this is what our algorithm will do. And as you can see, there is a similar slope, but a little offset. So something interesting is happening there. Looking at one way that I want to achieve, let's say, 90 percent accuracy or 10 percent error in this simulation our approach would require amount of redundancy which is eight, versus, let's say, existing algorithm would require 12, versus majority, which is 17. And if you're really investing money in it, this is the factor loss you are incurring or gain you are incurring. So really, good inference is very useful. It goes a long way. Now, I will tell you about answers, right, because I told you the model. I gave you the results in terms of a graph. Now let me tell you the algorithm for task assignment and the inference. Like before, the best task assignment would be 18 random regular graph. There, it was [indiscernible] graph as per choosing comparisons. Here, random regular graph, which is saying that if I have a budget that each task should be assigned to L things and each person can perform at most R tasks, then subject to those constraints, I will choose the random graph. Okay. And the inference algorithm would be like this. So let's just build intuition towards algorithm. So if I have a task like that, which is plus minus minus, majority voting would say minus, and that's it. But, well, if I knew that how trustworthy these people were, then I would like to incorporate that information into my answers and in oracle, who would know these answers would just add the log likelihood to that, okay. And, of course if everybody's equally trusted, then best answer would be equal to majority voting. It sort of makes sense too. But, of course, we won't have that. There will be uncertainty, so we would like to understand if we can learn these weights. Now I don't know the weights. I know only the answers. If I know that, well, these answers are given by him or almost all of them are correct, then I should give him very high P. Now how do I know that? Maybe his answers agrees with other person. So somehow, I want to sort of stitch this intuition together. One way to do that is to do iteration. Here is what iterative algorithm will do. It will reliably learn this estimate for these log likelihoods and the way it will do it is follows. This is very as natural as it gets, right. Let's start with giving everybody this equal likelihood. Everybody is equally weighted, say, one. Now I'm going to assign likelihood for a given task and initially, these are just one so I'm just going to sum up all the answers that I've got. Here I can sum up because they're plus and minus one. And in the end, I will get the answer that, let's say there are seven people who have answered this out of which six of them have answered plus one and one of minus one. So my likelihood of being plus is plus five, which is pretty strong. Now, okay, so I got this kind of likelihood for all tasks. Now I go and try to assign the reliability for each worker. Well, for different tasks, there are different likelihoods that I'll obtain from other tasks. Now I want to look at my answer to this task and see does it compare well with the reliability I've obtained. If this is plus one and my answer is also plus one, that's good, 19 because I'm matching. If it's plus one, my answer is minus one, that's really detrimental because I'm going against what everybody believes is true. Okay. So I just sum that up and I it rate this. So if I did not exclude this kind of in my iteration, answers coming from me from previous tasks, then it would be like a power iteration of this matrix A transpose. So I'm just excluding it because it's actually very important for information aggregation, and when we do that, that when the best performance comes out. Before I said I'll give you the precise theorem and let's say I got five more minutes or ->>: We have the room until noon, but it's audience attention. >> Devavrat Shah: Okay. I'll end in five minutes and I've got more pictures now. So we thought, well, we've got simulations, we're got theorem results in a second. What about real world? I mean, maybe this is meaningful, maybe this is not meaningful. So we thought we'll do experiments and this is a great place where you can do experiments, because I can sort of load up my tasks on Mechanical Turk and I can run experiments. First, we thought, well, maybe we should do something like this. Which of these ties are similar. But then similarities is in mind, right. So it's very hard to make it objective. So we said, well, what about things like this. Which tie goes well with this shirt. Again, this is all subjective. So subjective things are very hard to evaluate. Finally, we ended up with this thing. That is, which color is similar. And that is because there are these metrics that exist that actually do cognitive similarity metric. And that seemed to work extremely well, actually. So we showed people these kind of colors, randomly generated and over them we did all sorts of experiments and finally here is the type of performance we see. Iterative algorithm starts doing much better after some threshold, and there's a reason. And the threshold is effective before I tell you the theorem. Say like if information is too noise, iterating actually increases the noise. Okay. But if you're in low noise; that is where you can actually do correct, then iteration actually helps. And that's what comes out in terms of theorem, this kind of qualitative result, and that's also what we saw in the experiment. This is just one instance of that, but this is what we see on all sorts of data, on data we collected. 20 There's a team at MIT which primarily designs crowd sourcing interfaces led by Rob Miller and his colleagues. And on all of their data, also similar performance looks. Yes? >>: It's not a matter of initial condition or anything like that? >> Devavrat Shah: No. >>: So after it converges -- so you could initialize it with majority voting or something, and it's still going to ->> Devavrat Shah: Yes. So you can start with majority voting and it will become worse. Again, it's because in model, you can prove it why it happens. In reality, that's what we observed too. And sort of it makes sense because of this reason. And then is random graph really useful? Again, there are theorems about them. But in practice also, you can see that with graphs with small spectral gap becomes worse. All right. So this is some parameters you can assign which we called quality of crowd. It's effectively just a quadratic norm of the latent things and that is precisely what determines the performance. This is precise theorem. Let's just look at this one. It says that suppose I want to obtain an error of reliability one minus epsilon. How much amount of redundancy do I need? I need redundancy that scales over that parameter as 1 over Q. No matter what algorithm you use, you need this much. And if you use majority, it would be quadratically off because of this exponent being Q square. And Q us usually small, right. So one over Q square is really bad, versus one over Q. Now, again, this is another place where you can ask a question. What about adaptive. Does it help. For example, I have answers like this. These things are well understood. Maybe I should only focus my energy on things like this and this. Surprisingly, it does not and what this says is that it only improves up to constant factor. There is again, in this case, also adaptation does not help. So in both of these class of problems there's a question of what sorts of graph structure you need for task assignment, what sorts of inference algorithm you need, and what sorts of qualitative and quantitative results come out. 21 Makes sense that you need graph which is reasonably well connected. Simple iterative algorithms do very well. And they get you as good as a performance as you want, and adaptivity is not much of use. Okay. Now I think with that, here's a lot more information you can get out of these results. Like you have a bunch of crowds, which one I should employ depending on their quality and amount of money that they're asking me. I can calculate that and decide which one too. So these are very useful operationally. So that brings me to the end of my talk, and I think it's 11:30. It's time to end. So really, there's a lot of data we have, it's a great opportunity, but to realize that we need to process it at scale. And in these examples, what I showed you is there is start with thinking about reasonable model, come up with right algorithm, right algorithm helps you solve the problem well. The model helps you understand mathematically why things are -- why the algorithms are useful. But the algorithms are model independent, so they are useful just on their own. Okay. And I didn't show you this, right. I should show you that. The question is can I predict trend on Twitter before it becomes trending. So here is quick review. I should stop. But I should show you this. So that's Miss Rhode Island, who became Miss USA this June. So naturally, what would you expect when she becomes Miss USA? Things would trend on Twitter. And so that's the perfect time to start predicting whether it become trending or not. So here is the real signal in terms of volume of tweets that are happening. So through MIT has this reasons why we had access to the firewalls of Twitter. So this was a real signal we were tracking then. And it becomes -- Twitter announces it's trending at time that's zero. We had our estimate running at that same time. And our estimator said this would become trending at time minus two hours. And this is -- this happened in this particular case. But this is not atypical. This is very typical. In particular, this is how the ROC curve over a large number of samples that we've done. Basically, the point to take away here is that 95 percent of the time, we can predict something trending correctly before it becomes trending. Four percent of the time, we make an error, because, you know, you make an error. And when we are ahead on average, we are ahead by one hour, 40 minutes. And so that example was not atypical. All right. I think this is where I should 22 stop. >>: Here you're not predicting the future. trend on Twitter? >> Devavrat Shah: You're predicting what will become Yes. >>: For example, Google did this work on flu symptoms. They are trying to predict flu in the future and they figure out if there are searches for flu symptoms, and they can pin them down geographically, then probably there will be a flu breakout in this area at some, you know, close point in time. >> Devavrat Shah: Excellent point. So sometimes there the point was that flu searches are getting out information about something that's going to happen. It will be recorded more massively on a public scale. If you, based on what we observe with all sorts of these -- these are factor time diseases. These time series do have a very simple structure, and for many of them, actually, the information about effective, the information about them becoming popular is already there. It's just Twitter is doing it on volume-based. So it takes a while for them to announce it. But if you are just a little clever about it, then actually you can get in, do that prediction. Of course, you might be wrong sometimes, but looks like you might not be wrong too often. It's a great analogy, actually, yeah. Yes, please? >>: So in the second half of your talk wasn't anything you were talking about. It was extracting information of the reliability of the people giving you information. >> Devavrat Shah: Yes. >>: In the first half of your talk, one example you talked about was conference reviewing. We've got individual reviewers making comparisons. you thought about combining those, because some reviewers are going to be better than other reviewers, assessing the compared quality of papers. Have >> Devavrat Shah: So what you're asking, putting it the other way, is that there's a choice model. See, one way to think of it is there's a crowd that I have, and the crowd is modeled by one distribution over permutations. In fact, 23 going back to Rico's last question is there are people who are reliable, which means that there's a well separated distribution over permutations and there are people who are not reliable, which is sort of mixed distribution over permutations. The question is how can I put them together. One way to go about it is to think of answers coming from multiple choice models and somehow combine them. And that's something that we're trying to do right now. So I don't have any meaningful answer. about. Yes? I have some conjecture I can tell you >>: First part, there was a stationary distribution. answer to this question? Why is that a meaningful >> Devavrat Shah: Okay. So at some level here's what's happening in both cases. You've got some signal that you want to learn. You're observing a specks of signal through some of these, let's call it random matrices. And if you look at some form of a [indiscernible] approximation of these random matrices, they turn out to be closely related to the signal. And in both cases, really what we are trying to learn some form of rank one rank one 0 approximation turns out to second case, that turns out to be the trying to do is through iterations, we're approximation. In the first case, the be the stationary distribution. In the approximation of that chopped-off matrix. >>: In real life, people talk to each other and make sort of, let's say people talk to each other in pairs and they make some kind of comparison, and they talk, you know, I'll talk to you and then you talk. Is there any way to study how these decisions can be made in a distributed way among people? >> Devavrat Shah: Great. So there are two things. There is a dynamics part and there is a decision making eventually. Let's suppose that one way, I mean, one ideal way I would model people's behavior. I mean, this is, I don't know how meaningful it is, but it's still useful to think in the ideal world is that everybody has a choice model of their own that's in their mind. Implicitly or explicitly, over the objects of interest. And every time I intract with somebody else, that information changes my choice model and your choice model. So over time, it's evolving. So while we intract, it evolves. And also, at 24 the same time, this whole evolution could, in principle, lead to some kind of global decision making, if you're extracting those informations out. How to think about that meaningful way, I don't know. But it seems like maybe reasonable way to go about. It's too messy. >>: It's too complicated. First, I want to say I enjoyed the talk very much. >> Devavrat Shah: Thank you very much. >>: Basically, [indiscernible] the fact you're bringing this [indiscernible] models into practical systems. It's very interesting for the last part of the work. Now, for the first part, where you tried to create a partial order of [indiscernible] in the basically space study, one thing I basically noticed is in practical world, many times there's more than one orders, because of preference. So in that case, maybe there is something like a context, for the restaurant model, basically. I mean, not everyone had the same preference. And that may be cause of background. [indiscernible] of Chinese versus basically U.S., maybe it's a different partial orders you have and very few like spiciness, not like spiciness. Like wine, you do not drink wine. You may have different models. However, this hidden [indiscernible], is the possible to be applied into the order. Let's say a year, I mean, say any particular user when you basically [indiscernible] is it possible to take these hidden context into consideration? >> Devavrat Shah: Okay. So I think you bring up again a very interesting point in which both of them have brought up. Is that thinking of entire world as one choice model is not the right thing, because people are heterogenous. Now question is one way to go about dealing with this is saying well, I've got mixture of these choice models is B1 fraction is this type, B2 is this type and so on. How many types? And second is what are those Bs and third is what are those associated choice models. This is a hard question. Again, I have some conjectures and there's some interesting things I can say about, but not with 100 percent confidence. 25 People have tried starting this kind of learning over distributions from partial information, including myself. There are some sparse choice model approximations that we know. But they're not, I think, practically useful, at least in my mind. So what people do in the world of, for example, revenue management, it's a business school world, people put some kind of structured mixture of the multinomial logic models and then try to learn those parameters related to that structure. But again, that's ad hoc and it's [indiscernible] I do not think about it. So it's great sets of questions, which some of you should answer. >>: And -- We can talk. >> Devavrat Shah: Sure. Yes? >>: So I have been working on application really like what you did for first part [indiscernible] first part and your second part. I want to get some opinion from you. This application is how to evaluate stroke patients' movement quality. And how to do that, we need to run all the stroke patients movement. And how we can do that, we pick two of them and ask the therapist who did this better. >> Devavrat Shah: Okay. >>: So then this is a pairwise comparison and we can Compare all of the stroke patient. The problem is, so because the time for the therapist is constant. So we should ask the therapists these questions. So which is your second part. There is the crowd sourcing. You have a lot of therapists and I have a lot of stroke patients friends I want to ask as minimum as possible number of therapists for this pairwise comparison. >> Devavrat Shah: So again, I think I've been through this very quickly, but again what you would like to do is you'd like to do exactly same thing for first setting. That is, you want to maximize the -- there's a comparison graph that you are creating, right? Effectively. >>: Yeah. 26 >> Devavrat Shah: And you want that comparison graph to have as large a spectral graph as possible subject to your constraints. For example, there's a therapist who cannot rank two patients. Then you can't ask that question. So effectively, you've got this huge [indiscernible] matrix and you're going to assign each therapist or one therapist, I don't know how your setting is, to each one of the entries to ask them question. Question, how would you choose those entries. One option is you choose them at random as per [indiscernible] graph or random regular graph. Another option is if you could do each -- you do structured expander, for example. And whichever way you would do it, at least this result would say that you're getting most amount of information out of it. I mean, I'll be happy ->>: On your second part, I want to ask as less time as possible. >> Devavrat Shah: Yes, so this will also, again, it will say that if you have -- this is roughly how your uncertainty would scale. So this is how many times you're asking a given pair and this is how your structured graph would look. So if this multiplication is large enough for your metric of interest, then ->>: Thanks. >> Devavrat Shah: Okay. If you have more questions, please send me an email. I'll be happy to -- yeah. >>: One quick one. When you were estimating the [indiscernible], I don't know if you can bring the slide that does that, you had one slide where you had the summations that related [indiscernible] to tees and tees to ->> Devavrat Shah: >>: The next one. Yeah, you're right. >> Devavrat Shah: Just something that -- yeah, exactly. Yes. >>: So basically, you have a set of the, on the left side, you have an assumed set of values and from that, you update the tees and then on the other side, given the tees, you update the values? 27 >> Devavrat Shah: >>: Now I'll go back and update tees. So you keep going back and forth until it converges? >> Devavrat Shah: Until it converges, actually, I would stop after some number of iterations, which is -- so I will stop after K, which scales like effectively order one over log Q. One over log Q. >>: My question is that these things are kind of in the class of coordinated sense algorithms where you're estimating such spaces and those algorithms have very bad convergence properties, including that if some strong assumptions are not satisfied, they converge to something that's not even a stationary point. >> Devavrat Shah: >>: Excellent. Do you run into issues like that? >> Devavrat Shah: No. It's a great point. Because effectively, if I want to think about that in coordinate descent, this is unconstrained. So there is one reason why it's not happening. If I want to think of in more classical linear algebraic way, effectively I'm doing a power iteration. I'm trying to compute the largest singular vector of a matrix. And these are very well conditions matrices. And there is a reason we are not running into that. It's an excellent point, yeah. >> Rico Malvar: Well, thank you very much.