Document 17865185

advertisement
>> Andrey Kolobov: Hi everyone. Our guest speaker today is Professor Dan Weld from the University of
Washington, across the bridge. Over his career Dan has made contributions to many areas of computer
science. His… broadly speaking, his interests lie in making computers easier and more effective to use,
which includes various aspects of HCI, crowdsourcing. It used to also include sequential and decision
making under uncertainty. I don’t know if that’s true anymore, but anyway that’s how I got my PhD
under his supervision. Dan is also an active entrepreneur; he’s a member of the Madrona Venture
Group; he has started several startups. And with that I’m going to hand over the stage to him.
>> Daniel Weld: Thanks so much. So it’s great to have a small audience, make it really interactive.
Trying to cover a reasonably broad ‘spanse of material, some of which maybe people have seen in which
case we can go right through that, and some of which I know you haven’t seen ‘cause I didn’t see it until
yesterday, and so we can spend more time talking about that. As Andrey said, one of my biggest
interests is making computers easier to use and especially be personalized, whether it’s a different kind
of interfaces on phones or tablets or whether it’s the agents in Cortana or Amazon Echo or what have
you, speech-based agents. And I think of the crowdsourcing work as fitting into that frame. The other
half of my life is machine reading and information extraction, and unfortunately I won’t really be able to
talk about that at all, although you’ll see I snuck it in part way through.
So thinking about AI and crowdsourcing, if you… there’s been lots and lots of work of AI models applied
to crowdsourcing, but almost all of it in this notion of collective assessment, making sense of the results
that come back from the workers. And this is a very, very passive view, very much focused on the state
estimation problem in AI and how do we sort of track what’s going on, track the skills of workers, track
what the right answers are, think of them as hidden variables, and so on. And the overall thesis of
today’s talk is that there’s lots more to AI and there’s lots other ways of applying that to crowdsourcing.
So we can use AI methods to take an optimized simple crowdsourcing workflows trying to reduce
redundancy where we don’t need it and put more where we do. We can look at complicated tasks and
use AI methods to optimize large scale workflows and this subsumes A-B testing which is really common
in crowdsourcing today as well as even more complex workflows. We can think about ways of routing
the right tasks to the right people and we can also think about ways of making the crowdsource workers
themselves more skilled and… which includes both testing them and also teaching them. So all these
things are, as Andrey pointed out, sequential decision making problems and so we can use our favorite
AI planning algorithms to control most of these tasks. So that’s gonna be the thesis that intelligent
control and this particular sequential decision making is essential to get the most out of crowdsourcing.
And so you’re gonna see a variant on this slide over and over again. The basic idea, some sort of task
comes in, an AI system intelligent controller decides on exactly what to do next, generates a task, sends
it off to the crowd, interprets the result with some probabilistic reasoning and then either returns the
result or else maybe chooses another task. Yeah?
>>: Do you see a one-to-one mapping to task to job, or like, is it the task comes in and needs to get
done and then… or does that task get broken into pieces?
>> Daniel Weld: For most… I mean, I don’t think it has to be a one-to-one; for most of the things I’ll talk
about today it is one-to-one.
Okay, so let’s start out with the simple base case and I’m assuming everybody knows about
crowdsourcing so I’ll try to spare you all those kinds of slides, but do we have any bird watchers in here?
So is this actually an Indigo Bunting? Yes or No? [laughter] No it’s not. This is actually a bluebird. But
the point is obviously you can send jobs out to Mechanical Turk and actually the workers are pretty good
and people who—you know—know lots about birds actually will probably do better than you guys, but
the results that come back are gonna be noisy. Hopefully though, on average, they’re going to be right.
Typically the way that people deal with this is by asking a number of different workers, assuming the
majority is right, using some form of expectation maximization to come up with better estimates for the
hidden variables and estimate. Those hidden variables are both, “What’s the correct answer to the
job?” and also, “How accurate are the different workers?” And this is great. There have been lots and
lots and lots of papers about this, but it’s very passive, as I said earlier. So it doesn’t address things like,
“Well, how many workers should we ask?” Typically, systems ask the same number of workers. Hard
problems may need more, easy problems may need less, really hard problems are maybe so hard that
there’s not really any point in asking too many questions; better to skip. So our objective is to maximize
the quality of the data minus the cost that we’ve paid to the crowdsource workers.
So in the simplest case here we’ve got some sort of yes/no question. We decide, “Should we ask
another worker?” If so, we send it off to the crowd, we get a result. Now we have to do some
probabilistic reasoning, usually with EM to update our posteriors. Now we come back again. Do we
need to ask—you know—still more workers? If not, we return the job—you know—our current
estimate, the maximum probability estimate. And you can view this as decision making under
uncertainty in a partially observable Markov decision process. In fact, this one’s a belief MDP or a
simple kind of POMDP, but that’s the basic idea.
For those of you… how… let me just go through the POMDP slide. So POMDP is simply a set of states, a
set of actions, in this case, generating a ballot job or submitting the best answer, a set of transitions with
probabilities to take us to the different states on observation model, so that each time we take a
transition we get some sort of noisy observation. The cost is the money spent per work we send out to
Mechanical Turk or a labor market, and the reward is some sort of user defined penalty on the accuracy.
And again, we’re trying to maximize the expected reward minus the cost. And you saw these using
Bellman’s equations and dynamic programming, and I’ll skip that stuff. So in action if we see this we
send a ballot job out that allows us to update our posterior. We send another ballot job out, we update
our posterior. At this point we’ve got enough confidence and so we just return our result. So this
model’s actually very similar to some of the work that HA has done and like her work on Galaxy Zoo
we’ve found that this kind of intelligent control approach really does much, much better. So we get
much more accurate answers when we’re controlling for cost than if you used a static workflow even
with EM.
Okay, so that’s simple tasks, and now what I want to do is talk about slightly more complicated tasks. So
how do we put those building blocks together? So… yes?
>>: Sorry. Can you go back to that graph? I’m just… I just want to get a sense of how much difference it
makes—or one more back. This is by… not by how many answers you ask for but like the value of
correct answer. I don’t really understand what that means ‘cause… [indiscernible].
>> Dan Weld: So what’s the value? So consider there’s sort of a trade-off between… to make sense of
this, obviously, you have to pay, say, one dollar for each job you send to Mechanical Turk or one cent, or
what have you. The user gets to specify a utility function saying how much, you know, what’s the cost of
false positives, basically? And so this x axis is basically saying, “As you vary that penalty then you get… in
fact, as you increase your interest in accuracy then you can do much, much better at the same… for the
same price.
>>: Oh, so you’re paying the same amount.
>> Daniel Weld: You’re paying the same amount to the two different workflows.
>>: Okay. Alright, cool.
>> Daniel Weld: But, and again, I’m sort of going…
>>: No, I know you’re trying to go past this. [indiscernible] to get a feeling of how much difference the
POMDP’s make. It’s pretty significant.
>> Daniel Weld: It makes a big, big difference.
>>: Yeah.
>> Daniel Weld: And again, HA’s found similar kinds of things in the Galaxy Zoo, in the Galaxy Zoo work.
>>: Sorry. I have a problem about understanding the experiments. So when did you do the
experiments on Mechanic Turk?
>> Daniel Weld: I’m sorry, say … I’m …
>>: No, I use app on Mechanic Turk, but I don’t know how to do a sequential crowdsource app on
mechanic Turk, it seems [indiscernible] the support sequential crowdsourcing.
>> Daniel Weld: So the way we do most of these jobs is by putting up one job on Mechanical Turk and
if—you know—maybe many instances of that, but it’s when the worker actually comes and clicks on it
we generate a dynamic page and decide exactly what we’re going to ask that worker at that time. Once
they actually click we know a little bit about them in terms of their—you know—our history with them.
>>: I see, it’s a typical page. I see.
>> Daniel Weld: It’s all dynamic. Yeah, if you use the Amazon interface you… it’s much harder to do
these things.
>>: Have you used TurKit in… during these experiments?
>> Daniel Weld: I’m sorry. Say again.
>>: There is the TurKit toolkit for sequentially posting tasks. Lydia worked on it and…
>> Daniel Weld: Yeah, yeah, Michael Tumen’s thing, and Chris and Jonathan are also trying to build up a
toolkit which is taking parts—they’ve used that too. It’s taking parts of that but then trying to put more
of the sense. So one of our longer term objective is actually to put out a much better set of tools that
had this kind of POMDP stuff built in.
Okay, so let’s go to more complicated workflows, and here I’m gonna show what’s called an iterative
improvement workflow developed by Greg Little along with Lydia and Rob Miller a number of years ago.
And the basic idea is you take some sort of initial artifact—I’ll make this more concrete in a second—you
ask a worker if they can improve it. Now you’ve got the original and the improved. Now you ask a
different worker, “Which one of these two is better?” You take the better one—maybe there’s many
votes—you take the better one and you come back and you try to improve it again. So here’s an
example, and the task is: given a picture, create a description—English language description—of the
picture. So here’s the first version and after going around this iterative improvement workflow with
three votes at each iteration, going around it eight times, that’s the description that comes out, which is
a pretty fantastic description. It’s probably better than any one person could have written if you’d paid
them the whole sum. In fact, Greg showed that it was, so it sort of demonstrates the wisdom of the
crowds, and it’s really, really cool. And he showed that it works on many other things including
deciphering handwriting and… which is incomprehensible gibberish if I look at it, but the crowd got it.
But there’s a whole bunch of questions that aren’t answered in this and one is like, “Well, how many
times… how’d you know to go around eight times?” Another one is, “Why ask three people?” In fact
what he did was he asked two people and if they agreed he figured that he was done, and if they
disagreed he asked a third person to disambiguate. Why is that the right strategy for making the
decision about which of these two things is better? So one thing that we did, Pong Dai, a former PhD
student of Mausam and myself, is basically again, frame it as a POMDP, and here is—you know—do we
actually need to improve the description? If so, we send out a job saying, “Please improve it.”
Otherwise we basically say, “Well, are we sure which one is better?” If not, then maybe we should ask
some people to judge it, and we go around that loop a number of different times. That’s basically the
same simple decision-making we saw a few slides ago. And then we come back again and see, “Can
we… do we need to improve it more or should we just send it back?” And again, for coming up with
images, descriptions of images, where controlling for cost, we see that the POMDP generates much
better descriptions than the hand coded policy which was the one out of Greg’s thesis. And if you
instead, quote, control for quality, it costs thirty percent less labor to generate descriptions of the same
quality. What’s even more interesting is why. So if you look at Greg’s policy, you see that as it goes
through the iterations it asks about two and a half questions per iteration, right? ‘Cause it asks two
people if they agree, then it just decides that that one’s better; if they disagree it gets a third person to
disambiguate. What the POMDP does is that, and when I first saw that I—you know—was like, “There’s
a bug. Go back; fix the POMDP solver.” And it was only after we followed that a little bit we realized
actually it’s doing exactly the right thing, because in the earlier iterations, like, even a chimpanzee can
make the description better. And then after it’s gone around a couple times, then sometimes the
workers actually make it worse, not better. And so you want to spend more of your resources deciding
whether you’ve actually made progress or whether you’ve moved backwards. And furthermore, you
can now go round the loop one more time with the same budget. So POMDP is actually doing
something that’s pretty smart.
>>: Can you say something about the… how to decide whether you need more improvement? How
does the modem know that?
>> Daniel Weld: So there’s lots and lots of details here. It’s modeling the quality of the descriptions
with a number between zero and one, which sort of maps to the probability it thinks that the next
worker is likely to actually make it better versus making it worse. And again, what is driving the whole
thing is the utility function the user has to specify, which basically says, “How much do I value different
quality levels?” So that’s an input and that’s what the system is using to tradeoff whether it’s better to
go… to keep trying to improve it or whether it’s better to go on to the next image. Yep?
>>: In this version was there an offline data collection to build to learn to [indiscernible]…
>> Daniel Weld: In this version there was. We actually experimented with a number and Chris later… so
this is using largely supervised learning to learn the model. In some follow-up work that Chris did, both
on this workflow and then for most of the workflows that are coming, we do reinforcement learning.
But this earlier work used supervised learning to learn the models. More questions?
Okay. So what I want to do now is talk about complex tasks and I talked about iterative improvement
and now I want to talk about taxonomy generation, and I’m going to describe some of the work that
Lydia Chilton did and so it will be a brief… a couple slides away from POMDPs and then we’ll get back to
POMDPs in a minute. So what Lydia was interested in is how do you take a large collection of data and
actually make it easier for people to make sense of and to browse? So for example, got a whole bunch
of pictures, it would be really great if we could organize those into a taxonomy like the one here. So
we’ve got our tiger pictures which are animal pictures and we’ve got our pictures of workers which are
people. How do we both generate this taxonomy and also populate it with the different data items?
Similarly, you might have textual responses from some QA website and we’d like to, again, taxonomize.
And these are about air travel—you know—tips for saving time at the air, and some of them have to do
with packing ahead of time, some of them do with having to get through security. How do we come up
with that taxonomy to make it easier for people to browse the data? So generating a taxonomy is really,
really hard for the crowd because a good taxonomy means you need to look at all the data, but then
how do you actually parallelize it? So Lydia tried a couple things. The first thing she tried was iterative
improvement where there’s sort of a taxonomy on one side and some items on the other side, and it
didn’t really work at all because the taxonomy became overwhelming to the workers and the workers
got confused about what they were supposed to do and she couldn’t make it work.
>>: So were people… everyone was seeing the same taxonomy growing over time?
>> Daniel Weld: Yes.
>>: Everyone could [indiscernible]?
>> Daniel Weld: Everyone… every… I mean, and the implementation she tried the workers could do it.
They could add it, they could edit the taxonomy and they could place things in it and then the next
worker would see it and try to improve it. And people would judge—you know—is this taxonomy better
than that taxonomy? And it was just very hard for the workers to make those tasks. So the lesson was:
these tasks are too big and complicated; need to decompose it.
So the next thing she tried was asking workers about different possible nodes in the taxonomy. “Is this
one more general than that one?” This was sort of decompose smaller problems, but the workers really
found it hard to make those judgements because without context they didn’t really know whether one
was a subclass of the other or whether it went the other way around. So the take away was you don’t
actually want to ask workers about these abstractions, you need to ask them about concrete things.
And she tried a couple other things but ended up with an algorithm called “Cascade,” where she uses
the crowd to generate the category names, to select the best categories, to place the data into the
categories, and then uses the machine to then turn that into a taxonomy. So I’ll illustrate this on colors
since that’s the easiest way. First thing to do is to subsample the data into a small set and then generate
the categories. So for each color you would ask the workers, “Well, what’s a category that might be
good for this?” And so you get a bunch of categories out, and these are sort of the candidates. The
second thing is to go through and for each color go through and say, “Well, for this color, which of these
categories looks like the best one to describe it?” And that allows you to get rid of vague categories and
to get rid of noise in the data. And then the third thing is to go for each color and category basically say,
“Okay, which categories does this one fit into?” So this color’s both green and greenish. And then you
sum up those things, and then the final machine step basically looks at this and is able to eliminate
duplicates and look at the sort of subset, superset, a relationship, and outputs the category. So it
outputs the whole taxonomy. So that’s… yeah?
>>: So all the categories are done initially and then they’re fixed going forward. So like, do you ever go
back and revise the category later?
>> Daniel Weld: You know… no… well, yes—yes and no. No, I’ll say no first and then I’ll come back and
say why yes. No you don’t; this is the whole process. Now in fact, what I think Lydia should have done is
now put an iterative improvement step on top of this afterwards to sort of prune-up the categories and
make sure that sibling nodes are—you know—are appropriately—you know—it makes sense… sibling
relationships make sense, and so on. But this… and I think there’s lots of follow-on work some other
people have done, but this is the way she left it. So that’s the “no, but she should have” answer, and
then the other answer is… the “yes” answer is actually… the final step is to recurse, ‘cause we started by
subsampling so now we need to go back and take all the rest of the data and try to put that into the
taxonomy, and sometimes when you do that you realize that some of these new items don’t have a
place to fit and you have to actually add new elements of the taxonomy and see where they go. So the
recursion is a little bit more complicated than what I just said.
>>: It’s probably [indiscernible] you just said, so I was wondering if, when you say category, it’s actually
flat thing, or is that already a hierarchy [indiscernible] thing?
>> Daniel Weld: So here we see light blue as a subcategory of blue so it’s building a deep hierarchy.
That said, the individual workers are only making judgements about, “Does this fit in this particular
category?” So that substructure of the hierarchy comes because some categories overlap very strongly
with another category.
>>: Probably, I have my … what if you have if these flat and then what if you have too many categories
and it’s actually hard for older [indiscernible]?
>> Daniel Weld: Yeah, and this gets in to the whole problem of evaluation and—you know—what’s the
purpose of the taxonomy? Some taxonomies, maybe you don’t want them to be too deep, if you’re
using it for another reason maybe you do want it to be very deep. That’s actually really hard to evaluate
how good a taxonomy is, and it’s a very squishy thing and was actually the big thorn in Lydia’s side and
one of the reasons why she decided not to keep working on this problem afterwards. Instead, as I
mentioned earlier, she’s working on joke telling, which seems squishy too but actually it’s pretty easy to
measure whether a joke is funny or not. And so that’s what she is doing next. Yep?
>>: So what would happen if you had pictures of tigers and then pictures of elephants and pictures of
some very disjoint classes so there’s no subset/superset relationship and everyone very clearly labels it
as a tiger or an elephant and they all agree on that. So then, would the hierarchy be flat or would it
[indiscernible]…?
>> Daniel Weld: The hierarchy would be totally flat, but hopefully somebody would’ve proposed… so I
mean there would be a single root, I guess, but hopefully somebody would’ve proposed “animal” as
being a good thing and all those things would’ve fit into animal. And so we’d get a substructure where
animal’s at the top, tiger, elephant are down below. Again, I should just point out, this is a single
workflow—I think it is a very cool workflow—it worked much better than the other things she tried. It’s
definitely not perfect and I think there’s a number of ways that one could improve it. So like one thing
that it doesn’t do well is it doesn’t make sure that siblings are parallel in any way, but there’s other
things it doesn’t necessarily guarantee either. Still, it works pretty well.
How well does it work? Okay, so it‘s very difficult to evaluate this but what she did was get a number of
different, in this case, textual descriptions and then get a bunch of people from the information school
and ask them to build taxonomies and then look at the overlap between one human and other humans
and the computer versus the other humans. And then found that actually the inter-annotator
agreement was pretty much as good or almost as good between the computer Cascade algorithm and
the other humans. One human versus the other humans was slightly better, but I think the quality
overall is really pretty good.
The other thing she did was look at how much it cost and unfortunately here as the jobs got bigger it
ended up costing quite a bit more to use Cascade than it did to just hire a single person to do it. And,
you know, the flip side though is actually it’s really hard to hire these people to do it, especially if you
need experts. Sometimes it’s basically impossible to do it and this can not only… and so it takes a long
time, that’s why we don’t have a lot of these hierarchies, and the crowdsource workflow can be
parallelized and actually produce these output very quickly. So that’s kind of cool. Just as a segue, at
the Madrona Venture Group we saw a company, we actually didn’t invest in it, but it got funded, it’s
pretty cool, and what it does… it tackles the problem of evaluating surgeon performance. So they
videotape a surgeon doing an operation, possibly on a simulator, and then they have three expert
surgeons watch and see how competent is that resident? And the problem is the expert surgeons are
really busy and they don’t want to do it and they just don’t do it, and there’s this huge backlog, you can’t
get the feedback back to the residents. What they did is they put this out on Mechanical Turk and they
were able, by asking enough Turkers, to get expert level sort of indistinguishable from the surgeon,
estimates about how well the residents were doing and they got these very, very quickly. And so now
they’re selling that to hospitals.
>>: So, Mechanical Turk, I mean, who are the people who are giving feedback, just completely…?
>> Daniel Weld: Could be you.
>>: Yeah, but like, who are they actually? I mean, I understood it could be me, but…
>> Daniel Weld: They’re just ordinary Mechanical Turkers.
>>: So you don’t know if, I mean… ‘cause they could’ve been doctors. I don’t know why, I mean…
>> Daniel Weld: They could’ve been doctors. Yeah.
>>: You don’t know. You don’t know…
>> Daniel Weld: I don’t know, but I’m willing to bet that they’re not doctors. I’m guessing that they’re—
you know—housewives and househusbands and… I mean, they have to go through a little training
phase. But actually, you know, if you look at the video [laughter] like, one person, like, drops something
and then—you know—it’s like their hand is kind of like my hand and whereas the expert is like very
smooth and fluid and the motions are… it’s actually… it makes sense. Yeah, go ahead.
>>: So Dan, I was just going to say, like, just to make sure I understand your procedure correctly. There
seems to be a pretty interesting connection between this and what’s done in cognitive science in terms
of norming. That’s where the task is if you give me a word I’m first supposed to think of actually what
this word has. Like, if it’s a dog, it has fur, it has four legs, it’s a mammal, et cetera. I’m asked to do that.
Then there is the second stage which is exactly paralleling this. Given that I’m asked, like, does the dog
has… does this thing has fur? Now I write… I check a lot of objects with that property or not. The end
result is a matrix, and then that’s given to a class trig algorithm to do whatever we want. It seems like
that’s somewhat close to.
>> Daniel Weld: I think, yeah, I think there’s a lot of overlap.
Okay. I’m looking at time. I want to keep going. So Lydia was very disappointed by that cost result, but
I was overjoyed ‘cause remember we haven’t talked about POMDPs in a while. So the natural thing to
do is to put decision-theoretic control on top of this. And the first step is like why is this workflow
expensive? And the reason is because of this SelectBest step where you have to ask lots and lots and
lots and lots of SelectBest questions, because it’s one question for each color or for each element and
for each possible category. The categorize step doesn’t take very many. In here she was asking five
workers each one of those questions and there’s sort of a—you know—squared number of questions.
Do you really need all those questions? No, you don’t really all those questions. And furthermore, if
you optimize the order of the questions you could do much better. So again, you can frame this as a
POMDP and here you’re basically saying, you know, “What questions exactly should we ask?” And then
we’re gonna do an update on both what’s are the probability of the labels for this particular element;
what is the label co-occurrence model; and then also what’s the accuracy of the worker that we’re
talking to based on agreement with other workers? We’re gonna update models, all three of those, we
looked at a number of different graphical models to represent the co-occurrence model, which is the big
one. And the net effect is as we look at categorizing doing this multiclass categorization, the more
questions you ask the higher the performance you get is. And so this is a model which is doing joint
inference with a sort of simple probabilistic model and is greedily asking the best possible question. And
what happens is we get the same percentage, the same accuracy that Lydia was able to get, but we only
have to ask thirteen percent as many questions. And all of a sudden now it’s much cheaper than asking
an expert to do the same thing. So that’s pretty cool.
Okay. Let’s go on to the next question which is task routing and that is really the following model: let’s
assume that workers have different level of skill and that the questions we have are of different
difficulties; how should we match those up? And actually have Andrey give this part of the talk since he
was a coauthor on the paper. But basically, at each time step we’re gonna want to assign jobs to
workers and intuitively what we’re probably gonna want to do is assign the really hard questions to the
really skilled workers and the easy questions to the not so skilled workers. Although, maybe if a
question’s so hard that nobody’s gonna get it right—you know—we shouldn’t ask it to anybody or give it
to an easy worker and save—you know—a medium difficulty question for the hard worker. There’s
many different variations on this problem depending on whether you know the worker’s skill, whether
you know the question difficulty, whether you don’t know either one. And in fact, it’s a difficult problem
even if you know the skill and the difficulty, and of course it’s even harder if you have to learn those
along the way. But you can frame this, again, as a POMDP—you’ve seen that diagram before—and here
are some results that we got… yes?
>>: Is there an assumption that you get to choose what worker you’re giving the next task to, or...
>> Daniel Weld: Yes.
>>: … is the model that the worker comes and then …?
>> Daniel Weld: The model is that the worker comes, but once the worker comes we have our past
relationship with that worker so we know how… we have some information about how good they are,
and then we can give them any question that we want. We’ve got a—you know—we have expectations
about which workers might be coming as well.
>>: Okay. You don’t get to control that though…
>> Daniel Weld: No we don’t get…
>>: … and say like, “I want more of this worker…”
>> Daniel Weld: Again, there’s different models where you could do that and there are some platforms
where you could do that, but certainly Mechanical Turk you can’t do that to… yeah, you get the workers
that come.
Okay, so this was looking at problems doing name identity linking and natural language, for a noun
phrase figure out which Wikipedia entry it’s talking about. And round-robin is a pretty natural baseline
that people use and the decision-theoretic controller did quite a bit better. The curve may not look that
much higher, but to get to ninety-five percent accuracy of sort of the asymptotic accuracy it takes only a
half as much labor if you actually do that assignment correctly.
Okay, this was supposed to say “interlude.” At the beginning of the talk I said half of my life is natural
language processing. And in particular, what we’re interested in doing is information extraction, going
form news text or web text to a big knowledge base which is schematically drawn there. And what
we’ve done in our NLP papers is look at different kinds of distant supervision that allow us to train these
information extractors without any human labeled data. And one of them is sort of aligning a
background knowledge base to a corpus heuristically and we got a whole bunch of papers on how to do
that. And another kind is looking at newswire text and using some cool co-occurrence information to
automatically identify events and then we automatically learn extractors for those events. Some really
cool papers on that. But these are like two separate camps and I had one group working on
crowdsourcing and another group working on information extraction, and so the obvious thing is how
come we’re not using informa…. crowdsourcing to get some supervise labels which could make our
information extraction doing better? And then we could train using a mixture of distant supervision and
supervised data. And so that’s what we’ve been working on lately, and so let me give a couple
observations. The first is that we’re not the only person to think of this. So Chis Callison-Burch at UPenn
spends two hundred and fifty thousand dollars a year—this is an academic researcher—spends two
hundred and fifty thousand dollars a year on Mechanical Turk. The linguistics data consortium has fortyfour full-time annotators. All they do is create supervised training data for the rest of academia. So
generating this training—and I know you guys here at Microsoft do a lot of training data generation as
well. So a number of people have tried doing this distant supervision semi-supervision and their results
for relation extraction and they’ve reported that it doesn’t work at all. So in contrast to some previous
things actually they’re getting really crappy data out. And Chris Ray’s group said, “Getting the labels
from Mechanical Turk, it’s worse than distant supervision. Just doesn’t really help at all.” Of course,
he’s a database researcher and he’d been scaling up distant supervision to like terra corpus corpora so
maybe he had an incentive to say that his database techniques would work better ‘cause he’s actually
not really a crowdsourcing researcher even though he did the paper. And then a reasonably recent
paper by Chris Manning’s group at Stanford says—you know—with some fancy active learning we can
actually do a little bit better, but fundamentally there’s a bunch of negative results, which just struck me
as a… like, it’s wrong. They should be able to do much, much better.
So what’s going wrong? One thing is that they’re not using the latest crowdsourcing. Another thing is
that they’re actually not training their workers very well. And I think the dirty little secret of
crowdsourcing is the instructions really, really, really matter. And like, nobody can write a paper saying,
“We improved the instructions of our crowdsource job and we got a fifty percent improvement in the
quality of the results.” But in fact, that’s the truth and… so I’m still trying to figure out how we can get
that instruction optimization into a research paper and how we can automate that process. Sometimes
it’s just iterative design, but there’s got to be something in there. Another thing is they’re not qualifying
and testing the workers very well. And a final thing is, generally speaking, they’re thinking about data
quality, not the quality of the classifier. In fact, for all of the different things I’ve shown before we’ve
been trying to get high quality data out, not trying to generate a high quality classifier, and those two
things are not necessarily the same. So anyway, we’ve been pushing on this. And so the rest of what
I’m gonna talk about today is hot off the press, actually it’s not even on the press. It’s hot out of the
experiments and none of the stuff is published and some of it’s subject to change, but I hope I’ll get
feedback.
>>: Yeah, I was wondering that, given what you said about instructions, I mean, how then can we trust
the results in crowdsourcing literature if instructions matter that much? I mean, I understand for simple
tasks, for primitive tasks, okay, they’re kind of self-evident with [indiscernible] for something more
complicated. I mean when somebody says that they can improve the accuracy by this much and now
that paper says you improve the accuracy by this much, I mean, how…?
>> Daniel Weld: Well I think you have to look very carefully at how… what their testing methodology is
and—you know—are they… what are they using to test, you know? And if the instructions that they’re
giving to the workers don’t match the instructions they’re giving to the graders, then that presumably is
just gonna cause their performance to be bad. Or if they put instructions out to the workers but they’re
not actually making sure that the workers are reading those instructions, that’s gonna lead to bad
performance. I don’t think you have to worry about the performance being better than expected. I
think the question is if you’re getting bad performance, why are you getting bad performance?
>>: Yeah, but when you’re comparing to papers where what one says—you know—they’ll perform at
this level and the other one says we’ll perform at this level, it could be that the difference is explained
away by [indiscernible].
>> Daniel Weld: Absolutely. Absolutely. And a lot of those papers they never actually said how good
was the data they were getting out of the crowdsource workers. They got data from the crowdsource
workers and they put into the learning algorithm and they said, “Gee, it’s not working very well.” Did
you see whether or not the annotations were good or were the annotations not good? We got some of
the data, the annotations are not good, so they’re not doing quality control right. Yeah?
>>: Sometimes people use instructions to convert the subjective task to an objective task.
>> Daniel Weld: Right.
>>: For example, maybe people have different opinions about a task but they don’t know how to handle
subjectivity very well in quality management, so I’ve seen people writing very detailed instructions
saying, “This is how you should judge.” And at that point you actually turn a subjective task into an
objective task. Is this an issue in this kind of tasks as well? For example, if you had experts they would
completely agree there’s an objective way of extracting information from these news articles or there’s
a subjectivity piece in here as well.
>> Daniel Weld: All of the tasks that we’ve looked at are objective tasks pretty much, if there is such a
thing, and I think you’re absolutely right, it’s much, much harder if you’re dealing with subjective tasks
as well. But, I mean, like… tricky, for example in this information extraction write: Joe Blow was born in
Barcelona, Spain. Is that give positive evidence that he lived in Barcelona, Spain? You know, it’s a
corner case. In fact, the annotation guidelines say, “No, that does not mean that he lived there.” I like…
my thinking about topology is well, there’s like a half open interval where he lived… anyway, but the
bottom line is—you know—if that’s the annotation guideline that you’re going to be grading against,
you better make sure the workers know that ‘cause it’s possibly counterintuitive.
Okay, so I don’t want to talk about this is NLP stuff and we haven’t published it yet, but we’re actually
able to get much, much better and now the question is how can we automate that? So let’s talk about
how we automate that. First thing I want to talk about is making sure you’ve got skilled workers. So
here, the work that Jonathan Bragg is working on right now is basically, suppose that every time step
you’ve got a choice: should I give the worker some training? Should I give the worker a test to see how
good they are? Or should I actually have the worker do some work for me? Keeping in mind—you
know—that just because I trained them they won’t necessarily get much better, and furthermore they
might leave at any time so if I train them too much and then they quit I’ve wasted a lot of money. So
how should I optimize that? And there’s been some related work, for example, by Emma Brunskill and
Zoran Papovic at UW on using POMDP models to teach better. Which I think is really, really cool. But
this is a different problem because we actually don’t care so much whether their learning, we care about
getting the most work done, and… so it’s a different objective. So it’s a POMDP and, yeah, that’s what it
looks like. And let’s go on.
So here’s some of… this is just looking at the POMDP model and the behavior we get out, this is not got
teaching and it’s just got testing. You can see what the system’s doing, it’s doing a lot of testing up front
and then it’s firing people. It’s a log scale which is why it looks a little funny. And then it’s getting some
work done and then it’s a little bit worried that maybe they’ve been forgotten so it’s testing a little bit
more and it’s backing off the testing slowly as it goes on.
>>: So is the radius to fire people?
>> Daniel Weld: Sorry, what’s the question?
>>: What was the… where’s the teaching part? Or is [indiscernible]…
>> Daniel Weld: This does not have teaching.
>>: This is without teaching?
>> Daniel Weld: This is with just… sorry. He’s done some work on the teaching too but the results I
have to show today are actually just where there’s three actions. You can either ask somebody to do
some work; you can test to see how good they are; or you can fire them and take the next worker. And
then the workers also drop off on their own. And this is assuming we’ve got two classes of workers:
ninety percent accurate and sixty percent accurate. And this is a simulation experiment he’s starting to
run on the real workflow, but I don’t have those results yet. But on the simulation studies, what you see
is that the POMDP—the red curve—does really very, very well. Here’s a baseline which corresponds to
current practice which is basically to insert a random—you know—like between ten and twenty percent,
he does seventeen percent gold questions, and if the worker ever gets less than eighty percent of the
gold questions they get fired. And that does pretty well. Another baseline which is stupid is just ask
people to work all the time and that way you can’t get rid of the bad workers. And then the purple and
the yellow curves are two reinforcement learning models. One which is learning just the class ratio;
another one which is learning both the class ratio and a model of the worker behavior. And if we’re
learning just the class ratio where we’ve got the true parameters for the other things the system does
super, super well. It starts out with an exploration policy which is the green baseline and it very quickly
goes up and does as well as the optimal policy. And what we see is that the… when we’re trying to do
reinforcement learning of more parameters, it’s not working quite as well yet. And that may just mean
that we need to figure out a better exploration policy, which is what we’re working… we need to go for
farther. What you can see is it’s learning some of the parameters. This is the error estimate. It’s
learning two of the parameters really well and the other two parameters it’s actually got a big error,
actually can’t distinguish between the two classes. So we’re still trying to get that to work better. And
so right now what Jonathan’s doing is actually trying this on real world domains as well as introducing
the testing actions and the teaching actions together, something he did earlier and it was working well
without the reinforcement learning, but he was having trouble with the reinforcement learning. Yeah?
>>: Then so the testing is on gold data, right?
>> Daniel Weld: That’s right.
>>: You have the right answer? So is there some sort of like testing budget or something where you
have to decide when you’re gonna… like you wanted someone else…
>> Daniel Weld: This is what the system is doing is it’s trying to figure out, if all I care about is how much
good work gets done, what’s the right testing methodology?
>>: Does it assume I have an endless pool of tests I can give, or…
>> Daniel Weld: It does.
>>: Okay.
>> Daniel Weld: Although, we actually… it doesn’t end up asking all that many tests, so we’ve got more
than that amount of data right now. There is a challenge that people have found—you know—that the
test questions maybe get identified and shared amongst the workers—that’s an issue. So in real life one
needs to be careful about how many test questions you have and reusing test questions.
>>: But so… but ignoring that issue of them cheating, the… you’ve never really had an issue then where
the number of tests it wants to ask is is more than the size of your tests… that your gold tests
[indiscernible].
>> Daniel Weld: We haven’t had that problem.
>>: Okay.
>> Daniel Weld: But as I said, in people at CrowdFlower have had that problem, have had to come up
ways of generating… automatically generating new test questions. And you can also generate new test
questions by asking one question to a whole bunch of workers who you know are good and then looking
at a consensus answer and then taking that as a test question. And—you know—in general, a test
question where there is very little disagreement about it so there’s not a whole lot of nuance is gonna
be a better test question than one… actually, so the educational literature knows lots about creating
good test questions and there’s a whole methodology for doing that which I think one could plug in.
>>: Right, so I forgot… so the test questions are… it’s per user so you only need enough test questions to
satisfy all the tests you’re gonna ask one user. You don’t need to create enough for all the tests
[indiscernible]…
>> Daniel Weld: That is correct, assuming that they’re not sharing information.
>>: Assuming that they’re not sharing, but like that’s why you don’t have to make that… okay.
>> Daniel Weld: That’s right.
>>: Yeah.
>> Daniel Weld: Okay, so the final part of the talk is actually addressing the other issue and… that we
sort of come to when we’re trying to generate training data. And as I mentioned, pretty much
everybody’s focused on, “How do we get the highest quality… come up with a crowdsource workflow
that gets the highest quality data that we can come up with?” But maybe a better question is, “If we’ve
got a fixed budget and all we care about is generating a really good machine learning classifier, maybe
focusing on the data quality isn’t the right thing to do.” In particular there’s a tradeoff. If we’ve got a
budget of nine and a whole bunch of images we want to label, we could get one label for each one of
these images saying, you know, “Is it a bird? Yes or no?” Or we could… and the workers are gonna
make lots of mistakes, so assume the workers are seventy-five percent accurate. Or what we could do is
ask…only show three images and get three labels for each of those images and then do majority vote or
expectation maximization, and in that case we get results that are eighty-four percent accurate. Or—
you know—we could ask all nine workers to label a single image and get something that’s ninety-eight
percent accurate. What’s the right thing to do? In practice, what everybody does is they do something
like two-three relabeling which is get two people to label an image and if they agree then we’re done,
otherwise get a third person to disambiguate. That’s what LDC does, that’s what lots of people do. For
most these crowdsource studies that’s pretty much what everybody does ‘cause it just seems we can’t
trust these workers. But in fact, unilabeling with really crummy labels might be a much, much better
strategy.
So in a really nice paper by Chris Lin last year, he identified the sort of aspects of the learning problem
that affect this decision. So in particular, if the inductive bias is really weak, if you’ve got a very
expressive representational language that you’re trying to learn, for example, many, many, many
features, then relabeling is more important, otherwise unilabeling is more likely to be good strategy. If
your workers are really, really, good, obviously you don’t want to relabel. If your workers are really,
really, bad, you also don’t want to relabel. If the workers are sort of at seventy-five percent, that’s when
relabeling gives the maximum benefit. If your budget is really large you might think you should just
definitely relabel everything, ‘cause why not? Your budget’s really large. In fact, if your budget’s really
large you’re much better off unilabeling and allowing the learning system to deal with the noise.
What Chris has been working on more recently and just submitted a paper to triple A I on is the problem
of reactive learning. And that’s basically like active learning, you’re trying to pick what’s the next data
point to label. Reactive learning is, not just what’s the next data point to label, but maybe should I go
back and relabel something as well? I’m gonna trade off those two different kinds of actions. So a
natural thing to do is take a good active learning algorithm like uncertainty sampling and it turns out it
doesn’t work. It oftentimes loops infinitely and starves out all the other data points. And with expected
error reduction developed here, we actually see the same kind of behavior—looping behavior. Again…
yeah?
>>: I was wondering what loop means here?
>> Daniel Weld: Loop means the system says, “I’m really uncertain about data point x so I want to get a
label for that.” Now it has the label, or maybe it’s a second label and it said, “Hmm, what am I most
uncertain about? Actually I’m still really uncertain about x.” And then it just from then on it only asks
about x and… ‘cause at certain point that label doesn’t change the classifier and so it’s uncertainty
remains on that. And the basic idea is you need to take into account not just the classifier uncertainty,
but also how likely is a new label going to change the classifier? So that is the idea behind impact
sampling; and the idea is: think about a new label, how likely is that going to be to change the classifier?
And we want to pick the label the next point that has the highest expected impact to the classifier. And
there’s actually many different variations on this idea. The first one is, should we actually look for the
expected impact, or should we be optimistic and choose the data point which has the largest possible
impact? Another one is, sometimes if you’ve labeled a point twice and you’ve gotten two positives, a
third label is not gonna change your belief in that data point. But maybe… you still want the system
maybe to go back. So if you do look-ahead search then you could realize, well, if we got two or three
labels and they are all negative that would change our classifier, but any one label isn’t gonna change
the classifier. So in general, look-ahead search is much better than greedy search, but—you know—it’s
also much more expensive. So what Chris looked at was something called pseudo look ahead where he
calculates how many data points would you need to execute, to actually come up with a change, and
then divide that impact by the number of data points. Kind of a quick heuristic that gives you the
benefits of look-ahead without having to actually do anything other than myopic search. And then of
course, there’s many, many data points and so to make any of these things practical you need to sample
the data points that you’re going to consider for the expected impact. And so here’s a set of studies.
The first are on Gaussian datasets with ninety features and we look at a number of different strategies,
including passive sampling, expected error reduction, different kinds of uncertainty sampling that have
been fixed to avoid the infinite loops, and impact sampling which does much, much better on the
simulated datasets. Another thing we’ve done is take a number of UC Irvine datasets, so real world
datasets, and then use a noisy annotator model to grate simulated labels since all we have are the gold
labels. And here we see that on the internet ads we actually do quite a bit better and on arrhythmia we
also do quite a bit better, there’s some other datasets I should in... honestly say that we don’t do much
better. So that’s too bad.
And then finally, HA was nice enough to give us some Galaxy Zoo data and we’ve also tested on the
relation extraction data and we see that the benefits are slimmer unfortunately here with the real
annotators. We get a definite improvement in relation extraction for some of the relations, in the
Galaxy Zoo Data it’s indistinguishable from some of the other methods. And so that’s where we are on
that. Yes?
>>: How do you… I mean, how do you even run this, like if your algorithm needs to, say, get another
label for this axe point.
>> Daniel Weld: Yeah. So the way we do this is we go out and we get like a really large number of labels
and… from the real annotators, and then we have the decision-theoretic controller which doesn’t
actually go out to the crowd and instead goes to the database and says, “Give me what a person would
have said if I’d gone to the crowd.” That way we can repeat the process.
>>: [indiscernible]
>> Daniel Weld: Yeah, and that way we can also do a number, randomize the workers and get these
confidence intervals.
>>: What do you do for the arrival time of the worker in your model?
>> Daniel Weld: So we’re not modeling the time varying behavior of the workers here and we are
assuming that we’re getting one label and then we’re doing some thinking about what the best label is
to get again and then we’re going out to get another label. So if we actually ran this on Mechanical Turk
in a live experiment there is a practicality challenge that it would be very slow because at any one time
there would be only one Turker working on the task. There’s ways of fixing that by—you know—asking,
like, ten questions at once, kind… you know, batching things up that would make it more practical, but
actually that’s an important point that I should have made, that we’re taking a pretty optimistic case for
our system ‘cause it’s assuming that the labels are very, very expensive and trying to do the best thing.
Also, like expected error reduction, this is pretty computationally intensive method. So if your labels are
really, really cheap and your compute power is expensive, maybe you’re better off with a simpler
strategy.
>>: Have you compiled the annotation error rate for these data sets and tried to see if the trends are
changing with respect to the kind of error rate?
>> Daniel Weld: We haven’t and that’s another key thing. In the earlier experiments with the simulated
data we’re telling the POMDP what the error rate is; in these experiments we’re not telling it and it
doesn’t know about the error rate. So there’s a couple reasons why we’re not doing as well here; three
possible reasons. One is there’s correlations between the human annotator error which we’re not
modeling in the simulation studies; another one, uh…. I’ve just forgotten; and the third one is that it’s
something we don’t know. So… rats, that’s embarrassing. Anyway…
>>: You know the reason I’m asking that question is when we were running these time with analysis on
the Galaxy Zoo data…
>> Daniel Weld: Right.
>>: … we actually saw that—you know—even in terms of solving the POMDP, the look-ahead has to be
much larger than the other datasets you’ve seen before to actually reason about the value of getting
another label.
>> Daniel Weld: Right.
>>: Because the error rate was so high that you have to reason about a long look-ahead to be able to
really see about you’re getting an annotation. So I’m wondering if the same kind of pattern is emerging
for this study as well, that the error rate of workers is so high that even the look-ahead of impact policy
may not be, you know, enough. I’m talking [indiscernible]…
>> Daniel Weld: Yeah, I know, I remember what you’re saying. We haven’t seen that yet and I think it’s
because of the pseudo look-ahead, but again, these experiments are ones that have just been run in like
the last day or two, so…
>>: You can take it offline. I’m [indiscernible]…
>> Daniel Weld: Yeah. It’d be fun to talk more about it.
>>: In those experiments the POMDP is learning what the error... the label error is as it’s gathering
labels. Is that… [indiscernible]?
>> Daniel Weld: It is learning how often the workers disagree.
Okay, so you guys probably believed in POMDPs when you came in, but hopefully I’ve argued that
they’re great for controlling crowdsourcing and that you can apply them to all sorts of different kinds of
decision problems where you’re trying to decide what to ask, who to ask, when to teach, how often to
test, and so on. And that we’ve looked at two different things: one is that how do we get high quality
data, especially good if you’re looking for a test set. And then more recently—you know—forget about
the data, what we really want is a good classifier; maybe more bad data is better than really good data.
And—you know—these are—you know—our first steps towards a bigger vision about how do we build
mixed initiative systems where we combine human intelligence and machine intelligence, and wouldn’t
that be cool? Yes, it would definitely be cool.
So let me just end by thanking my many, many collaborators including Mausam and James who have cosupervised many of these people, and then also really, really wonderful funding agents. And thanks very
much. [applause].
I see we’re over time and you’ve already asked lots of questions, so if people need to leave you should
certainly leave, if not I’m… love questions. Okay, we can take the mic off and then you can ask
questions. [laughter]
>>: So…
Download