>> Andrey Kolobov: Hi everyone. Our guest speaker today is Professor Dan Weld from the University of Washington, across the bridge. Over his career Dan has made contributions to many areas of computer science. His… broadly speaking, his interests lie in making computers easier and more effective to use, which includes various aspects of HCI, crowdsourcing. It used to also include sequential and decision making under uncertainty. I don’t know if that’s true anymore, but anyway that’s how I got my PhD under his supervision. Dan is also an active entrepreneur; he’s a member of the Madrona Venture Group; he has started several startups. And with that I’m going to hand over the stage to him. >> Daniel Weld: Thanks so much. So it’s great to have a small audience, make it really interactive. Trying to cover a reasonably broad ‘spanse of material, some of which maybe people have seen in which case we can go right through that, and some of which I know you haven’t seen ‘cause I didn’t see it until yesterday, and so we can spend more time talking about that. As Andrey said, one of my biggest interests is making computers easier to use and especially be personalized, whether it’s a different kind of interfaces on phones or tablets or whether it’s the agents in Cortana or Amazon Echo or what have you, speech-based agents. And I think of the crowdsourcing work as fitting into that frame. The other half of my life is machine reading and information extraction, and unfortunately I won’t really be able to talk about that at all, although you’ll see I snuck it in part way through. So thinking about AI and crowdsourcing, if you… there’s been lots and lots of work of AI models applied to crowdsourcing, but almost all of it in this notion of collective assessment, making sense of the results that come back from the workers. And this is a very, very passive view, very much focused on the state estimation problem in AI and how do we sort of track what’s going on, track the skills of workers, track what the right answers are, think of them as hidden variables, and so on. And the overall thesis of today’s talk is that there’s lots more to AI and there’s lots other ways of applying that to crowdsourcing. So we can use AI methods to take an optimized simple crowdsourcing workflows trying to reduce redundancy where we don’t need it and put more where we do. We can look at complicated tasks and use AI methods to optimize large scale workflows and this subsumes A-B testing which is really common in crowdsourcing today as well as even more complex workflows. We can think about ways of routing the right tasks to the right people and we can also think about ways of making the crowdsource workers themselves more skilled and… which includes both testing them and also teaching them. So all these things are, as Andrey pointed out, sequential decision making problems and so we can use our favorite AI planning algorithms to control most of these tasks. So that’s gonna be the thesis that intelligent control and this particular sequential decision making is essential to get the most out of crowdsourcing. And so you’re gonna see a variant on this slide over and over again. The basic idea, some sort of task comes in, an AI system intelligent controller decides on exactly what to do next, generates a task, sends it off to the crowd, interprets the result with some probabilistic reasoning and then either returns the result or else maybe chooses another task. Yeah? >>: Do you see a one-to-one mapping to task to job, or like, is it the task comes in and needs to get done and then… or does that task get broken into pieces? >> Daniel Weld: For most… I mean, I don’t think it has to be a one-to-one; for most of the things I’ll talk about today it is one-to-one. Okay, so let’s start out with the simple base case and I’m assuming everybody knows about crowdsourcing so I’ll try to spare you all those kinds of slides, but do we have any bird watchers in here? So is this actually an Indigo Bunting? Yes or No? [laughter] No it’s not. This is actually a bluebird. But the point is obviously you can send jobs out to Mechanical Turk and actually the workers are pretty good and people who—you know—know lots about birds actually will probably do better than you guys, but the results that come back are gonna be noisy. Hopefully though, on average, they’re going to be right. Typically the way that people deal with this is by asking a number of different workers, assuming the majority is right, using some form of expectation maximization to come up with better estimates for the hidden variables and estimate. Those hidden variables are both, “What’s the correct answer to the job?” and also, “How accurate are the different workers?” And this is great. There have been lots and lots and lots of papers about this, but it’s very passive, as I said earlier. So it doesn’t address things like, “Well, how many workers should we ask?” Typically, systems ask the same number of workers. Hard problems may need more, easy problems may need less, really hard problems are maybe so hard that there’s not really any point in asking too many questions; better to skip. So our objective is to maximize the quality of the data minus the cost that we’ve paid to the crowdsource workers. So in the simplest case here we’ve got some sort of yes/no question. We decide, “Should we ask another worker?” If so, we send it off to the crowd, we get a result. Now we have to do some probabilistic reasoning, usually with EM to update our posteriors. Now we come back again. Do we need to ask—you know—still more workers? If not, we return the job—you know—our current estimate, the maximum probability estimate. And you can view this as decision making under uncertainty in a partially observable Markov decision process. In fact, this one’s a belief MDP or a simple kind of POMDP, but that’s the basic idea. For those of you… how… let me just go through the POMDP slide. So POMDP is simply a set of states, a set of actions, in this case, generating a ballot job or submitting the best answer, a set of transitions with probabilities to take us to the different states on observation model, so that each time we take a transition we get some sort of noisy observation. The cost is the money spent per work we send out to Mechanical Turk or a labor market, and the reward is some sort of user defined penalty on the accuracy. And again, we’re trying to maximize the expected reward minus the cost. And you saw these using Bellman’s equations and dynamic programming, and I’ll skip that stuff. So in action if we see this we send a ballot job out that allows us to update our posterior. We send another ballot job out, we update our posterior. At this point we’ve got enough confidence and so we just return our result. So this model’s actually very similar to some of the work that HA has done and like her work on Galaxy Zoo we’ve found that this kind of intelligent control approach really does much, much better. So we get much more accurate answers when we’re controlling for cost than if you used a static workflow even with EM. Okay, so that’s simple tasks, and now what I want to do is talk about slightly more complicated tasks. So how do we put those building blocks together? So… yes? >>: Sorry. Can you go back to that graph? I’m just… I just want to get a sense of how much difference it makes—or one more back. This is by… not by how many answers you ask for but like the value of correct answer. I don’t really understand what that means ‘cause… [indiscernible]. >> Dan Weld: So what’s the value? So consider there’s sort of a trade-off between… to make sense of this, obviously, you have to pay, say, one dollar for each job you send to Mechanical Turk or one cent, or what have you. The user gets to specify a utility function saying how much, you know, what’s the cost of false positives, basically? And so this x axis is basically saying, “As you vary that penalty then you get… in fact, as you increase your interest in accuracy then you can do much, much better at the same… for the same price. >>: Oh, so you’re paying the same amount. >> Daniel Weld: You’re paying the same amount to the two different workflows. >>: Okay. Alright, cool. >> Daniel Weld: But, and again, I’m sort of going… >>: No, I know you’re trying to go past this. [indiscernible] to get a feeling of how much difference the POMDP’s make. It’s pretty significant. >> Daniel Weld: It makes a big, big difference. >>: Yeah. >> Daniel Weld: And again, HA’s found similar kinds of things in the Galaxy Zoo, in the Galaxy Zoo work. >>: Sorry. I have a problem about understanding the experiments. So when did you do the experiments on Mechanic Turk? >> Daniel Weld: I’m sorry, say … I’m … >>: No, I use app on Mechanic Turk, but I don’t know how to do a sequential crowdsource app on mechanic Turk, it seems [indiscernible] the support sequential crowdsourcing. >> Daniel Weld: So the way we do most of these jobs is by putting up one job on Mechanical Turk and if—you know—maybe many instances of that, but it’s when the worker actually comes and clicks on it we generate a dynamic page and decide exactly what we’re going to ask that worker at that time. Once they actually click we know a little bit about them in terms of their—you know—our history with them. >>: I see, it’s a typical page. I see. >> Daniel Weld: It’s all dynamic. Yeah, if you use the Amazon interface you… it’s much harder to do these things. >>: Have you used TurKit in… during these experiments? >> Daniel Weld: I’m sorry. Say again. >>: There is the TurKit toolkit for sequentially posting tasks. Lydia worked on it and… >> Daniel Weld: Yeah, yeah, Michael Tumen’s thing, and Chris and Jonathan are also trying to build up a toolkit which is taking parts—they’ve used that too. It’s taking parts of that but then trying to put more of the sense. So one of our longer term objective is actually to put out a much better set of tools that had this kind of POMDP stuff built in. Okay, so let’s go to more complicated workflows, and here I’m gonna show what’s called an iterative improvement workflow developed by Greg Little along with Lydia and Rob Miller a number of years ago. And the basic idea is you take some sort of initial artifact—I’ll make this more concrete in a second—you ask a worker if they can improve it. Now you’ve got the original and the improved. Now you ask a different worker, “Which one of these two is better?” You take the better one—maybe there’s many votes—you take the better one and you come back and you try to improve it again. So here’s an example, and the task is: given a picture, create a description—English language description—of the picture. So here’s the first version and after going around this iterative improvement workflow with three votes at each iteration, going around it eight times, that’s the description that comes out, which is a pretty fantastic description. It’s probably better than any one person could have written if you’d paid them the whole sum. In fact, Greg showed that it was, so it sort of demonstrates the wisdom of the crowds, and it’s really, really cool. And he showed that it works on many other things including deciphering handwriting and… which is incomprehensible gibberish if I look at it, but the crowd got it. But there’s a whole bunch of questions that aren’t answered in this and one is like, “Well, how many times… how’d you know to go around eight times?” Another one is, “Why ask three people?” In fact what he did was he asked two people and if they agreed he figured that he was done, and if they disagreed he asked a third person to disambiguate. Why is that the right strategy for making the decision about which of these two things is better? So one thing that we did, Pong Dai, a former PhD student of Mausam and myself, is basically again, frame it as a POMDP, and here is—you know—do we actually need to improve the description? If so, we send out a job saying, “Please improve it.” Otherwise we basically say, “Well, are we sure which one is better?” If not, then maybe we should ask some people to judge it, and we go around that loop a number of different times. That’s basically the same simple decision-making we saw a few slides ago. And then we come back again and see, “Can we… do we need to improve it more or should we just send it back?” And again, for coming up with images, descriptions of images, where controlling for cost, we see that the POMDP generates much better descriptions than the hand coded policy which was the one out of Greg’s thesis. And if you instead, quote, control for quality, it costs thirty percent less labor to generate descriptions of the same quality. What’s even more interesting is why. So if you look at Greg’s policy, you see that as it goes through the iterations it asks about two and a half questions per iteration, right? ‘Cause it asks two people if they agree, then it just decides that that one’s better; if they disagree it gets a third person to disambiguate. What the POMDP does is that, and when I first saw that I—you know—was like, “There’s a bug. Go back; fix the POMDP solver.” And it was only after we followed that a little bit we realized actually it’s doing exactly the right thing, because in the earlier iterations, like, even a chimpanzee can make the description better. And then after it’s gone around a couple times, then sometimes the workers actually make it worse, not better. And so you want to spend more of your resources deciding whether you’ve actually made progress or whether you’ve moved backwards. And furthermore, you can now go round the loop one more time with the same budget. So POMDP is actually doing something that’s pretty smart. >>: Can you say something about the… how to decide whether you need more improvement? How does the modem know that? >> Daniel Weld: So there’s lots and lots of details here. It’s modeling the quality of the descriptions with a number between zero and one, which sort of maps to the probability it thinks that the next worker is likely to actually make it better versus making it worse. And again, what is driving the whole thing is the utility function the user has to specify, which basically says, “How much do I value different quality levels?” So that’s an input and that’s what the system is using to tradeoff whether it’s better to go… to keep trying to improve it or whether it’s better to go on to the next image. Yep? >>: In this version was there an offline data collection to build to learn to [indiscernible]… >> Daniel Weld: In this version there was. We actually experimented with a number and Chris later… so this is using largely supervised learning to learn the model. In some follow-up work that Chris did, both on this workflow and then for most of the workflows that are coming, we do reinforcement learning. But this earlier work used supervised learning to learn the models. More questions? Okay. So what I want to do now is talk about complex tasks and I talked about iterative improvement and now I want to talk about taxonomy generation, and I’m going to describe some of the work that Lydia Chilton did and so it will be a brief… a couple slides away from POMDPs and then we’ll get back to POMDPs in a minute. So what Lydia was interested in is how do you take a large collection of data and actually make it easier for people to make sense of and to browse? So for example, got a whole bunch of pictures, it would be really great if we could organize those into a taxonomy like the one here. So we’ve got our tiger pictures which are animal pictures and we’ve got our pictures of workers which are people. How do we both generate this taxonomy and also populate it with the different data items? Similarly, you might have textual responses from some QA website and we’d like to, again, taxonomize. And these are about air travel—you know—tips for saving time at the air, and some of them have to do with packing ahead of time, some of them do with having to get through security. How do we come up with that taxonomy to make it easier for people to browse the data? So generating a taxonomy is really, really hard for the crowd because a good taxonomy means you need to look at all the data, but then how do you actually parallelize it? So Lydia tried a couple things. The first thing she tried was iterative improvement where there’s sort of a taxonomy on one side and some items on the other side, and it didn’t really work at all because the taxonomy became overwhelming to the workers and the workers got confused about what they were supposed to do and she couldn’t make it work. >>: So were people… everyone was seeing the same taxonomy growing over time? >> Daniel Weld: Yes. >>: Everyone could [indiscernible]? >> Daniel Weld: Everyone… every… I mean, and the implementation she tried the workers could do it. They could add it, they could edit the taxonomy and they could place things in it and then the next worker would see it and try to improve it. And people would judge—you know—is this taxonomy better than that taxonomy? And it was just very hard for the workers to make those tasks. So the lesson was: these tasks are too big and complicated; need to decompose it. So the next thing she tried was asking workers about different possible nodes in the taxonomy. “Is this one more general than that one?” This was sort of decompose smaller problems, but the workers really found it hard to make those judgements because without context they didn’t really know whether one was a subclass of the other or whether it went the other way around. So the take away was you don’t actually want to ask workers about these abstractions, you need to ask them about concrete things. And she tried a couple other things but ended up with an algorithm called “Cascade,” where she uses the crowd to generate the category names, to select the best categories, to place the data into the categories, and then uses the machine to then turn that into a taxonomy. So I’ll illustrate this on colors since that’s the easiest way. First thing to do is to subsample the data into a small set and then generate the categories. So for each color you would ask the workers, “Well, what’s a category that might be good for this?” And so you get a bunch of categories out, and these are sort of the candidates. The second thing is to go through and for each color go through and say, “Well, for this color, which of these categories looks like the best one to describe it?” And that allows you to get rid of vague categories and to get rid of noise in the data. And then the third thing is to go for each color and category basically say, “Okay, which categories does this one fit into?” So this color’s both green and greenish. And then you sum up those things, and then the final machine step basically looks at this and is able to eliminate duplicates and look at the sort of subset, superset, a relationship, and outputs the category. So it outputs the whole taxonomy. So that’s… yeah? >>: So all the categories are done initially and then they’re fixed going forward. So like, do you ever go back and revise the category later? >> Daniel Weld: You know… no… well, yes—yes and no. No, I’ll say no first and then I’ll come back and say why yes. No you don’t; this is the whole process. Now in fact, what I think Lydia should have done is now put an iterative improvement step on top of this afterwards to sort of prune-up the categories and make sure that sibling nodes are—you know—are appropriately—you know—it makes sense… sibling relationships make sense, and so on. But this… and I think there’s lots of follow-on work some other people have done, but this is the way she left it. So that’s the “no, but she should have” answer, and then the other answer is… the “yes” answer is actually… the final step is to recurse, ‘cause we started by subsampling so now we need to go back and take all the rest of the data and try to put that into the taxonomy, and sometimes when you do that you realize that some of these new items don’t have a place to fit and you have to actually add new elements of the taxonomy and see where they go. So the recursion is a little bit more complicated than what I just said. >>: It’s probably [indiscernible] you just said, so I was wondering if, when you say category, it’s actually flat thing, or is that already a hierarchy [indiscernible] thing? >> Daniel Weld: So here we see light blue as a subcategory of blue so it’s building a deep hierarchy. That said, the individual workers are only making judgements about, “Does this fit in this particular category?” So that substructure of the hierarchy comes because some categories overlap very strongly with another category. >>: Probably, I have my … what if you have if these flat and then what if you have too many categories and it’s actually hard for older [indiscernible]? >> Daniel Weld: Yeah, and this gets in to the whole problem of evaluation and—you know—what’s the purpose of the taxonomy? Some taxonomies, maybe you don’t want them to be too deep, if you’re using it for another reason maybe you do want it to be very deep. That’s actually really hard to evaluate how good a taxonomy is, and it’s a very squishy thing and was actually the big thorn in Lydia’s side and one of the reasons why she decided not to keep working on this problem afterwards. Instead, as I mentioned earlier, she’s working on joke telling, which seems squishy too but actually it’s pretty easy to measure whether a joke is funny or not. And so that’s what she is doing next. Yep? >>: So what would happen if you had pictures of tigers and then pictures of elephants and pictures of some very disjoint classes so there’s no subset/superset relationship and everyone very clearly labels it as a tiger or an elephant and they all agree on that. So then, would the hierarchy be flat or would it [indiscernible]…? >> Daniel Weld: The hierarchy would be totally flat, but hopefully somebody would’ve proposed… so I mean there would be a single root, I guess, but hopefully somebody would’ve proposed “animal” as being a good thing and all those things would’ve fit into animal. And so we’d get a substructure where animal’s at the top, tiger, elephant are down below. Again, I should just point out, this is a single workflow—I think it is a very cool workflow—it worked much better than the other things she tried. It’s definitely not perfect and I think there’s a number of ways that one could improve it. So like one thing that it doesn’t do well is it doesn’t make sure that siblings are parallel in any way, but there’s other things it doesn’t necessarily guarantee either. Still, it works pretty well. How well does it work? Okay, so it‘s very difficult to evaluate this but what she did was get a number of different, in this case, textual descriptions and then get a bunch of people from the information school and ask them to build taxonomies and then look at the overlap between one human and other humans and the computer versus the other humans. And then found that actually the inter-annotator agreement was pretty much as good or almost as good between the computer Cascade algorithm and the other humans. One human versus the other humans was slightly better, but I think the quality overall is really pretty good. The other thing she did was look at how much it cost and unfortunately here as the jobs got bigger it ended up costing quite a bit more to use Cascade than it did to just hire a single person to do it. And, you know, the flip side though is actually it’s really hard to hire these people to do it, especially if you need experts. Sometimes it’s basically impossible to do it and this can not only… and so it takes a long time, that’s why we don’t have a lot of these hierarchies, and the crowdsource workflow can be parallelized and actually produce these output very quickly. So that’s kind of cool. Just as a segue, at the Madrona Venture Group we saw a company, we actually didn’t invest in it, but it got funded, it’s pretty cool, and what it does… it tackles the problem of evaluating surgeon performance. So they videotape a surgeon doing an operation, possibly on a simulator, and then they have three expert surgeons watch and see how competent is that resident? And the problem is the expert surgeons are really busy and they don’t want to do it and they just don’t do it, and there’s this huge backlog, you can’t get the feedback back to the residents. What they did is they put this out on Mechanical Turk and they were able, by asking enough Turkers, to get expert level sort of indistinguishable from the surgeon, estimates about how well the residents were doing and they got these very, very quickly. And so now they’re selling that to hospitals. >>: So, Mechanical Turk, I mean, who are the people who are giving feedback, just completely…? >> Daniel Weld: Could be you. >>: Yeah, but like, who are they actually? I mean, I understood it could be me, but… >> Daniel Weld: They’re just ordinary Mechanical Turkers. >>: So you don’t know if, I mean… ‘cause they could’ve been doctors. I don’t know why, I mean… >> Daniel Weld: They could’ve been doctors. Yeah. >>: You don’t know. You don’t know… >> Daniel Weld: I don’t know, but I’m willing to bet that they’re not doctors. I’m guessing that they’re— you know—housewives and househusbands and… I mean, they have to go through a little training phase. But actually, you know, if you look at the video [laughter] like, one person, like, drops something and then—you know—it’s like their hand is kind of like my hand and whereas the expert is like very smooth and fluid and the motions are… it’s actually… it makes sense. Yeah, go ahead. >>: So Dan, I was just going to say, like, just to make sure I understand your procedure correctly. There seems to be a pretty interesting connection between this and what’s done in cognitive science in terms of norming. That’s where the task is if you give me a word I’m first supposed to think of actually what this word has. Like, if it’s a dog, it has fur, it has four legs, it’s a mammal, et cetera. I’m asked to do that. Then there is the second stage which is exactly paralleling this. Given that I’m asked, like, does the dog has… does this thing has fur? Now I write… I check a lot of objects with that property or not. The end result is a matrix, and then that’s given to a class trig algorithm to do whatever we want. It seems like that’s somewhat close to. >> Daniel Weld: I think, yeah, I think there’s a lot of overlap. Okay. I’m looking at time. I want to keep going. So Lydia was very disappointed by that cost result, but I was overjoyed ‘cause remember we haven’t talked about POMDPs in a while. So the natural thing to do is to put decision-theoretic control on top of this. And the first step is like why is this workflow expensive? And the reason is because of this SelectBest step where you have to ask lots and lots and lots and lots of SelectBest questions, because it’s one question for each color or for each element and for each possible category. The categorize step doesn’t take very many. In here she was asking five workers each one of those questions and there’s sort of a—you know—squared number of questions. Do you really need all those questions? No, you don’t really all those questions. And furthermore, if you optimize the order of the questions you could do much better. So again, you can frame this as a POMDP and here you’re basically saying, you know, “What questions exactly should we ask?” And then we’re gonna do an update on both what’s are the probability of the labels for this particular element; what is the label co-occurrence model; and then also what’s the accuracy of the worker that we’re talking to based on agreement with other workers? We’re gonna update models, all three of those, we looked at a number of different graphical models to represent the co-occurrence model, which is the big one. And the net effect is as we look at categorizing doing this multiclass categorization, the more questions you ask the higher the performance you get is. And so this is a model which is doing joint inference with a sort of simple probabilistic model and is greedily asking the best possible question. And what happens is we get the same percentage, the same accuracy that Lydia was able to get, but we only have to ask thirteen percent as many questions. And all of a sudden now it’s much cheaper than asking an expert to do the same thing. So that’s pretty cool. Okay. Let’s go on to the next question which is task routing and that is really the following model: let’s assume that workers have different level of skill and that the questions we have are of different difficulties; how should we match those up? And actually have Andrey give this part of the talk since he was a coauthor on the paper. But basically, at each time step we’re gonna want to assign jobs to workers and intuitively what we’re probably gonna want to do is assign the really hard questions to the really skilled workers and the easy questions to the not so skilled workers. Although, maybe if a question’s so hard that nobody’s gonna get it right—you know—we shouldn’t ask it to anybody or give it to an easy worker and save—you know—a medium difficulty question for the hard worker. There’s many different variations on this problem depending on whether you know the worker’s skill, whether you know the question difficulty, whether you don’t know either one. And in fact, it’s a difficult problem even if you know the skill and the difficulty, and of course it’s even harder if you have to learn those along the way. But you can frame this, again, as a POMDP—you’ve seen that diagram before—and here are some results that we got… yes? >>: Is there an assumption that you get to choose what worker you’re giving the next task to, or... >> Daniel Weld: Yes. >>: … is the model that the worker comes and then …? >> Daniel Weld: The model is that the worker comes, but once the worker comes we have our past relationship with that worker so we know how… we have some information about how good they are, and then we can give them any question that we want. We’ve got a—you know—we have expectations about which workers might be coming as well. >>: Okay. You don’t get to control that though… >> Daniel Weld: No we don’t get… >>: … and say like, “I want more of this worker…” >> Daniel Weld: Again, there’s different models where you could do that and there are some platforms where you could do that, but certainly Mechanical Turk you can’t do that to… yeah, you get the workers that come. Okay, so this was looking at problems doing name identity linking and natural language, for a noun phrase figure out which Wikipedia entry it’s talking about. And round-robin is a pretty natural baseline that people use and the decision-theoretic controller did quite a bit better. The curve may not look that much higher, but to get to ninety-five percent accuracy of sort of the asymptotic accuracy it takes only a half as much labor if you actually do that assignment correctly. Okay, this was supposed to say “interlude.” At the beginning of the talk I said half of my life is natural language processing. And in particular, what we’re interested in doing is information extraction, going form news text or web text to a big knowledge base which is schematically drawn there. And what we’ve done in our NLP papers is look at different kinds of distant supervision that allow us to train these information extractors without any human labeled data. And one of them is sort of aligning a background knowledge base to a corpus heuristically and we got a whole bunch of papers on how to do that. And another kind is looking at newswire text and using some cool co-occurrence information to automatically identify events and then we automatically learn extractors for those events. Some really cool papers on that. But these are like two separate camps and I had one group working on crowdsourcing and another group working on information extraction, and so the obvious thing is how come we’re not using informa…. crowdsourcing to get some supervise labels which could make our information extraction doing better? And then we could train using a mixture of distant supervision and supervised data. And so that’s what we’ve been working on lately, and so let me give a couple observations. The first is that we’re not the only person to think of this. So Chis Callison-Burch at UPenn spends two hundred and fifty thousand dollars a year—this is an academic researcher—spends two hundred and fifty thousand dollars a year on Mechanical Turk. The linguistics data consortium has fortyfour full-time annotators. All they do is create supervised training data for the rest of academia. So generating this training—and I know you guys here at Microsoft do a lot of training data generation as well. So a number of people have tried doing this distant supervision semi-supervision and their results for relation extraction and they’ve reported that it doesn’t work at all. So in contrast to some previous things actually they’re getting really crappy data out. And Chris Ray’s group said, “Getting the labels from Mechanical Turk, it’s worse than distant supervision. Just doesn’t really help at all.” Of course, he’s a database researcher and he’d been scaling up distant supervision to like terra corpus corpora so maybe he had an incentive to say that his database techniques would work better ‘cause he’s actually not really a crowdsourcing researcher even though he did the paper. And then a reasonably recent paper by Chris Manning’s group at Stanford says—you know—with some fancy active learning we can actually do a little bit better, but fundamentally there’s a bunch of negative results, which just struck me as a… like, it’s wrong. They should be able to do much, much better. So what’s going wrong? One thing is that they’re not using the latest crowdsourcing. Another thing is that they’re actually not training their workers very well. And I think the dirty little secret of crowdsourcing is the instructions really, really, really matter. And like, nobody can write a paper saying, “We improved the instructions of our crowdsource job and we got a fifty percent improvement in the quality of the results.” But in fact, that’s the truth and… so I’m still trying to figure out how we can get that instruction optimization into a research paper and how we can automate that process. Sometimes it’s just iterative design, but there’s got to be something in there. Another thing is they’re not qualifying and testing the workers very well. And a final thing is, generally speaking, they’re thinking about data quality, not the quality of the classifier. In fact, for all of the different things I’ve shown before we’ve been trying to get high quality data out, not trying to generate a high quality classifier, and those two things are not necessarily the same. So anyway, we’ve been pushing on this. And so the rest of what I’m gonna talk about today is hot off the press, actually it’s not even on the press. It’s hot out of the experiments and none of the stuff is published and some of it’s subject to change, but I hope I’ll get feedback. >>: Yeah, I was wondering that, given what you said about instructions, I mean, how then can we trust the results in crowdsourcing literature if instructions matter that much? I mean, I understand for simple tasks, for primitive tasks, okay, they’re kind of self-evident with [indiscernible] for something more complicated. I mean when somebody says that they can improve the accuracy by this much and now that paper says you improve the accuracy by this much, I mean, how…? >> Daniel Weld: Well I think you have to look very carefully at how… what their testing methodology is and—you know—are they… what are they using to test, you know? And if the instructions that they’re giving to the workers don’t match the instructions they’re giving to the graders, then that presumably is just gonna cause their performance to be bad. Or if they put instructions out to the workers but they’re not actually making sure that the workers are reading those instructions, that’s gonna lead to bad performance. I don’t think you have to worry about the performance being better than expected. I think the question is if you’re getting bad performance, why are you getting bad performance? >>: Yeah, but when you’re comparing to papers where what one says—you know—they’ll perform at this level and the other one says we’ll perform at this level, it could be that the difference is explained away by [indiscernible]. >> Daniel Weld: Absolutely. Absolutely. And a lot of those papers they never actually said how good was the data they were getting out of the crowdsource workers. They got data from the crowdsource workers and they put into the learning algorithm and they said, “Gee, it’s not working very well.” Did you see whether or not the annotations were good or were the annotations not good? We got some of the data, the annotations are not good, so they’re not doing quality control right. Yeah? >>: Sometimes people use instructions to convert the subjective task to an objective task. >> Daniel Weld: Right. >>: For example, maybe people have different opinions about a task but they don’t know how to handle subjectivity very well in quality management, so I’ve seen people writing very detailed instructions saying, “This is how you should judge.” And at that point you actually turn a subjective task into an objective task. Is this an issue in this kind of tasks as well? For example, if you had experts they would completely agree there’s an objective way of extracting information from these news articles or there’s a subjectivity piece in here as well. >> Daniel Weld: All of the tasks that we’ve looked at are objective tasks pretty much, if there is such a thing, and I think you’re absolutely right, it’s much, much harder if you’re dealing with subjective tasks as well. But, I mean, like… tricky, for example in this information extraction write: Joe Blow was born in Barcelona, Spain. Is that give positive evidence that he lived in Barcelona, Spain? You know, it’s a corner case. In fact, the annotation guidelines say, “No, that does not mean that he lived there.” I like… my thinking about topology is well, there’s like a half open interval where he lived… anyway, but the bottom line is—you know—if that’s the annotation guideline that you’re going to be grading against, you better make sure the workers know that ‘cause it’s possibly counterintuitive. Okay, so I don’t want to talk about this is NLP stuff and we haven’t published it yet, but we’re actually able to get much, much better and now the question is how can we automate that? So let’s talk about how we automate that. First thing I want to talk about is making sure you’ve got skilled workers. So here, the work that Jonathan Bragg is working on right now is basically, suppose that every time step you’ve got a choice: should I give the worker some training? Should I give the worker a test to see how good they are? Or should I actually have the worker do some work for me? Keeping in mind—you know—that just because I trained them they won’t necessarily get much better, and furthermore they might leave at any time so if I train them too much and then they quit I’ve wasted a lot of money. So how should I optimize that? And there’s been some related work, for example, by Emma Brunskill and Zoran Papovic at UW on using POMDP models to teach better. Which I think is really, really cool. But this is a different problem because we actually don’t care so much whether their learning, we care about getting the most work done, and… so it’s a different objective. So it’s a POMDP and, yeah, that’s what it looks like. And let’s go on. So here’s some of… this is just looking at the POMDP model and the behavior we get out, this is not got teaching and it’s just got testing. You can see what the system’s doing, it’s doing a lot of testing up front and then it’s firing people. It’s a log scale which is why it looks a little funny. And then it’s getting some work done and then it’s a little bit worried that maybe they’ve been forgotten so it’s testing a little bit more and it’s backing off the testing slowly as it goes on. >>: So is the radius to fire people? >> Daniel Weld: Sorry, what’s the question? >>: What was the… where’s the teaching part? Or is [indiscernible]… >> Daniel Weld: This does not have teaching. >>: This is without teaching? >> Daniel Weld: This is with just… sorry. He’s done some work on the teaching too but the results I have to show today are actually just where there’s three actions. You can either ask somebody to do some work; you can test to see how good they are; or you can fire them and take the next worker. And then the workers also drop off on their own. And this is assuming we’ve got two classes of workers: ninety percent accurate and sixty percent accurate. And this is a simulation experiment he’s starting to run on the real workflow, but I don’t have those results yet. But on the simulation studies, what you see is that the POMDP—the red curve—does really very, very well. Here’s a baseline which corresponds to current practice which is basically to insert a random—you know—like between ten and twenty percent, he does seventeen percent gold questions, and if the worker ever gets less than eighty percent of the gold questions they get fired. And that does pretty well. Another baseline which is stupid is just ask people to work all the time and that way you can’t get rid of the bad workers. And then the purple and the yellow curves are two reinforcement learning models. One which is learning just the class ratio; another one which is learning both the class ratio and a model of the worker behavior. And if we’re learning just the class ratio where we’ve got the true parameters for the other things the system does super, super well. It starts out with an exploration policy which is the green baseline and it very quickly goes up and does as well as the optimal policy. And what we see is that the… when we’re trying to do reinforcement learning of more parameters, it’s not working quite as well yet. And that may just mean that we need to figure out a better exploration policy, which is what we’re working… we need to go for farther. What you can see is it’s learning some of the parameters. This is the error estimate. It’s learning two of the parameters really well and the other two parameters it’s actually got a big error, actually can’t distinguish between the two classes. So we’re still trying to get that to work better. And so right now what Jonathan’s doing is actually trying this on real world domains as well as introducing the testing actions and the teaching actions together, something he did earlier and it was working well without the reinforcement learning, but he was having trouble with the reinforcement learning. Yeah? >>: Then so the testing is on gold data, right? >> Daniel Weld: That’s right. >>: You have the right answer? So is there some sort of like testing budget or something where you have to decide when you’re gonna… like you wanted someone else… >> Daniel Weld: This is what the system is doing is it’s trying to figure out, if all I care about is how much good work gets done, what’s the right testing methodology? >>: Does it assume I have an endless pool of tests I can give, or… >> Daniel Weld: It does. >>: Okay. >> Daniel Weld: Although, we actually… it doesn’t end up asking all that many tests, so we’ve got more than that amount of data right now. There is a challenge that people have found—you know—that the test questions maybe get identified and shared amongst the workers—that’s an issue. So in real life one needs to be careful about how many test questions you have and reusing test questions. >>: But so… but ignoring that issue of them cheating, the… you’ve never really had an issue then where the number of tests it wants to ask is is more than the size of your tests… that your gold tests [indiscernible]. >> Daniel Weld: We haven’t had that problem. >>: Okay. >> Daniel Weld: But as I said, in people at CrowdFlower have had that problem, have had to come up ways of generating… automatically generating new test questions. And you can also generate new test questions by asking one question to a whole bunch of workers who you know are good and then looking at a consensus answer and then taking that as a test question. And—you know—in general, a test question where there is very little disagreement about it so there’s not a whole lot of nuance is gonna be a better test question than one… actually, so the educational literature knows lots about creating good test questions and there’s a whole methodology for doing that which I think one could plug in. >>: Right, so I forgot… so the test questions are… it’s per user so you only need enough test questions to satisfy all the tests you’re gonna ask one user. You don’t need to create enough for all the tests [indiscernible]… >> Daniel Weld: That is correct, assuming that they’re not sharing information. >>: Assuming that they’re not sharing, but like that’s why you don’t have to make that… okay. >> Daniel Weld: That’s right. >>: Yeah. >> Daniel Weld: Okay, so the final part of the talk is actually addressing the other issue and… that we sort of come to when we’re trying to generate training data. And as I mentioned, pretty much everybody’s focused on, “How do we get the highest quality… come up with a crowdsource workflow that gets the highest quality data that we can come up with?” But maybe a better question is, “If we’ve got a fixed budget and all we care about is generating a really good machine learning classifier, maybe focusing on the data quality isn’t the right thing to do.” In particular there’s a tradeoff. If we’ve got a budget of nine and a whole bunch of images we want to label, we could get one label for each one of these images saying, you know, “Is it a bird? Yes or no?” Or we could… and the workers are gonna make lots of mistakes, so assume the workers are seventy-five percent accurate. Or what we could do is ask…only show three images and get three labels for each of those images and then do majority vote or expectation maximization, and in that case we get results that are eighty-four percent accurate. Or— you know—we could ask all nine workers to label a single image and get something that’s ninety-eight percent accurate. What’s the right thing to do? In practice, what everybody does is they do something like two-three relabeling which is get two people to label an image and if they agree then we’re done, otherwise get a third person to disambiguate. That’s what LDC does, that’s what lots of people do. For most these crowdsource studies that’s pretty much what everybody does ‘cause it just seems we can’t trust these workers. But in fact, unilabeling with really crummy labels might be a much, much better strategy. So in a really nice paper by Chris Lin last year, he identified the sort of aspects of the learning problem that affect this decision. So in particular, if the inductive bias is really weak, if you’ve got a very expressive representational language that you’re trying to learn, for example, many, many, many features, then relabeling is more important, otherwise unilabeling is more likely to be good strategy. If your workers are really, really, good, obviously you don’t want to relabel. If your workers are really, really, bad, you also don’t want to relabel. If the workers are sort of at seventy-five percent, that’s when relabeling gives the maximum benefit. If your budget is really large you might think you should just definitely relabel everything, ‘cause why not? Your budget’s really large. In fact, if your budget’s really large you’re much better off unilabeling and allowing the learning system to deal with the noise. What Chris has been working on more recently and just submitted a paper to triple A I on is the problem of reactive learning. And that’s basically like active learning, you’re trying to pick what’s the next data point to label. Reactive learning is, not just what’s the next data point to label, but maybe should I go back and relabel something as well? I’m gonna trade off those two different kinds of actions. So a natural thing to do is take a good active learning algorithm like uncertainty sampling and it turns out it doesn’t work. It oftentimes loops infinitely and starves out all the other data points. And with expected error reduction developed here, we actually see the same kind of behavior—looping behavior. Again… yeah? >>: I was wondering what loop means here? >> Daniel Weld: Loop means the system says, “I’m really uncertain about data point x so I want to get a label for that.” Now it has the label, or maybe it’s a second label and it said, “Hmm, what am I most uncertain about? Actually I’m still really uncertain about x.” And then it just from then on it only asks about x and… ‘cause at certain point that label doesn’t change the classifier and so it’s uncertainty remains on that. And the basic idea is you need to take into account not just the classifier uncertainty, but also how likely is a new label going to change the classifier? So that is the idea behind impact sampling; and the idea is: think about a new label, how likely is that going to be to change the classifier? And we want to pick the label the next point that has the highest expected impact to the classifier. And there’s actually many different variations on this idea. The first one is, should we actually look for the expected impact, or should we be optimistic and choose the data point which has the largest possible impact? Another one is, sometimes if you’ve labeled a point twice and you’ve gotten two positives, a third label is not gonna change your belief in that data point. But maybe… you still want the system maybe to go back. So if you do look-ahead search then you could realize, well, if we got two or three labels and they are all negative that would change our classifier, but any one label isn’t gonna change the classifier. So in general, look-ahead search is much better than greedy search, but—you know—it’s also much more expensive. So what Chris looked at was something called pseudo look ahead where he calculates how many data points would you need to execute, to actually come up with a change, and then divide that impact by the number of data points. Kind of a quick heuristic that gives you the benefits of look-ahead without having to actually do anything other than myopic search. And then of course, there’s many, many data points and so to make any of these things practical you need to sample the data points that you’re going to consider for the expected impact. And so here’s a set of studies. The first are on Gaussian datasets with ninety features and we look at a number of different strategies, including passive sampling, expected error reduction, different kinds of uncertainty sampling that have been fixed to avoid the infinite loops, and impact sampling which does much, much better on the simulated datasets. Another thing we’ve done is take a number of UC Irvine datasets, so real world datasets, and then use a noisy annotator model to grate simulated labels since all we have are the gold labels. And here we see that on the internet ads we actually do quite a bit better and on arrhythmia we also do quite a bit better, there’s some other datasets I should in... honestly say that we don’t do much better. So that’s too bad. And then finally, HA was nice enough to give us some Galaxy Zoo data and we’ve also tested on the relation extraction data and we see that the benefits are slimmer unfortunately here with the real annotators. We get a definite improvement in relation extraction for some of the relations, in the Galaxy Zoo Data it’s indistinguishable from some of the other methods. And so that’s where we are on that. Yes? >>: How do you… I mean, how do you even run this, like if your algorithm needs to, say, get another label for this axe point. >> Daniel Weld: Yeah. So the way we do this is we go out and we get like a really large number of labels and… from the real annotators, and then we have the decision-theoretic controller which doesn’t actually go out to the crowd and instead goes to the database and says, “Give me what a person would have said if I’d gone to the crowd.” That way we can repeat the process. >>: [indiscernible] >> Daniel Weld: Yeah, and that way we can also do a number, randomize the workers and get these confidence intervals. >>: What do you do for the arrival time of the worker in your model? >> Daniel Weld: So we’re not modeling the time varying behavior of the workers here and we are assuming that we’re getting one label and then we’re doing some thinking about what the best label is to get again and then we’re going out to get another label. So if we actually ran this on Mechanical Turk in a live experiment there is a practicality challenge that it would be very slow because at any one time there would be only one Turker working on the task. There’s ways of fixing that by—you know—asking, like, ten questions at once, kind… you know, batching things up that would make it more practical, but actually that’s an important point that I should have made, that we’re taking a pretty optimistic case for our system ‘cause it’s assuming that the labels are very, very expensive and trying to do the best thing. Also, like expected error reduction, this is pretty computationally intensive method. So if your labels are really, really cheap and your compute power is expensive, maybe you’re better off with a simpler strategy. >>: Have you compiled the annotation error rate for these data sets and tried to see if the trends are changing with respect to the kind of error rate? >> Daniel Weld: We haven’t and that’s another key thing. In the earlier experiments with the simulated data we’re telling the POMDP what the error rate is; in these experiments we’re not telling it and it doesn’t know about the error rate. So there’s a couple reasons why we’re not doing as well here; three possible reasons. One is there’s correlations between the human annotator error which we’re not modeling in the simulation studies; another one, uh…. I’ve just forgotten; and the third one is that it’s something we don’t know. So… rats, that’s embarrassing. Anyway… >>: You know the reason I’m asking that question is when we were running these time with analysis on the Galaxy Zoo data… >> Daniel Weld: Right. >>: … we actually saw that—you know—even in terms of solving the POMDP, the look-ahead has to be much larger than the other datasets you’ve seen before to actually reason about the value of getting another label. >> Daniel Weld: Right. >>: Because the error rate was so high that you have to reason about a long look-ahead to be able to really see about you’re getting an annotation. So I’m wondering if the same kind of pattern is emerging for this study as well, that the error rate of workers is so high that even the look-ahead of impact policy may not be, you know, enough. I’m talking [indiscernible]… >> Daniel Weld: Yeah, I know, I remember what you’re saying. We haven’t seen that yet and I think it’s because of the pseudo look-ahead, but again, these experiments are ones that have just been run in like the last day or two, so… >>: You can take it offline. I’m [indiscernible]… >> Daniel Weld: Yeah. It’d be fun to talk more about it. >>: In those experiments the POMDP is learning what the error... the label error is as it’s gathering labels. Is that… [indiscernible]? >> Daniel Weld: It is learning how often the workers disagree. Okay, so you guys probably believed in POMDPs when you came in, but hopefully I’ve argued that they’re great for controlling crowdsourcing and that you can apply them to all sorts of different kinds of decision problems where you’re trying to decide what to ask, who to ask, when to teach, how often to test, and so on. And that we’ve looked at two different things: one is that how do we get high quality data, especially good if you’re looking for a test set. And then more recently—you know—forget about the data, what we really want is a good classifier; maybe more bad data is better than really good data. And—you know—these are—you know—our first steps towards a bigger vision about how do we build mixed initiative systems where we combine human intelligence and machine intelligence, and wouldn’t that be cool? Yes, it would definitely be cool. So let me just end by thanking my many, many collaborators including Mausam and James who have cosupervised many of these people, and then also really, really wonderful funding agents. And thanks very much. [applause]. I see we’re over time and you’ve already asked lots of questions, so if people need to leave you should certainly leave, if not I’m… love questions. Okay, we can take the mic off and then you can ask questions. [laughter] >>: So…