Document 17864501

advertisement
>> Ofer Dekel: The last talk of this session will be given by Ece
Kamar from Microsoft Research. And the title is: Combining human and
machine intelligence in large scale crowdsourcing.
>> Ece Kamar: Thank you. And also thank you for the earlier
advertisement for this talk. I will be diverging a little bit from the
general theme today, because this talk not only talks about machine
learning, it really talks about the synergy between using machine
learning approaches with decision-making approaches for solving a real
world problem.
So I'm a researcher in the Adaptive Systems Group, and in our group
what we are working on is developing systems that can function in a
stochastically changing world. And these systems need to interact with
the same environment and with also actors, different actors that exist
in this world.
So such an adaptive system needs to have a set of capabilities. One of
the most important capabilities is making sense of the world through
noisy and partial observations about the world; and, of course, machine
learning approaches are crucial for having this capability.
But usually having this capability is not enough. These systems also
need to act effectively in the world. And for having this capability,
you need to have good decision-making models as well as these models of
the world and there could be like palm DP like algorithms, and that's
what I'll be talking about today, but also there could be some
reinforcement learning algorithms, too.
And in our group we are working on a set of domains varying from the
outlook systems to systems that connect in mobile systems to other -like assistance for users, but today the particular domain that I want
to talk about is crowdsourcing. And other speakers today have been
talking about crowdsourcing, but just as a recap, in the recent years,
crowdsourcing has been really popular for providing programmatic access
for human intelligence. But the current crowdsourcing applications
have one particular drawback, which is it is very difficult to manage
the quality in crowdsourcing.
The responsibility of managing these crowdsourcing tasks and assuring
quality is on these task owners, for example, is a person going to
Mechanical Turk, you need to say I want to hire three workers for this
task, I want to pay this much for this and this is my consensus. This
is how I will define my ground truth afterwards. And this is not the
best way of doing this, because as a test goal going to Mechanical Turk
for the first time you don't know what the parameters are the best ones
to do.
So -- and this is not the best use of people resources. And also at
the end of the day we're not maximizing our worker resources in the
best possible way. Today I want to talk about crowdsynth, a system
we're designing for managing system tasks.
Crowdsynth has two main capabilities. The first capability is machine
learning capabilities. It has a set of machine learning models for
learning about task and workers, and also for fusing mission analysis
with the noisy worker reports we are getting through crowdsourcing.
In addition to this, it has decision heuristic planning capabilities
for optimizing efficiency as we are solving these consents for these
crowdsourcing tasks. Crowdsynth is designed for a special subset of
crowdsourcing tasks that we call consensus tasks.
In consensus tasks, the test governor has a question in mind. It
doesn't know what that correct answer is, what the correct answer of
the question is. But it can go to the crowdsourcing system. The
crowdsourcing system can hire multiple workers, and the idea is if it
can ask sufficiently many people, the consensus of these workers will
give us the correct answer of the solution.
And in this system we are paying money or we are spending time with
each worker. So hiring each worker is costly. We see examples of
consensus tasks in many crowdsourcing applications in game though the
purpose, in paid crowdsourcing programs like Mechanical Turk or
instance applications like Galaxy Zoo.
However, solving consensus tasks effectively is not trivial. First of
all, in addition to having these noisy workers providing us input, we
could also have some automated task analysis. For example, we could
have a machine learning model in addition to these workers that are
providing hypothesis about what the correct answer is. So there's a
question about how to fuse these automated analysis with human input to
get to the correct answer as cheaply as possible.
Also, as some other speakers pointed out today, not all workers or all
tasks in these crowdsourcing systems are equal. Some workers are
better than others and some tasks are easier than others.
So there's a question about how do we aggregate these people's reports
together to get to the answer quickly. And finally each worker that we
hire provided some evidence but that each worker is costly. So there's
a question about, and I think somebody in the audience pointed out this
question before: How many of these worker reports are enough, when is
it fine to say I'm done, I'm just hiring this many people and this is
the answer that I want to give to the task owner.
So crowdsynth is designed to address these questions. It has access to
some automated machine analysis, but it also has access to workers in a
crowdsourcing system. When a consensus task comes into the crowdsynth
system, it can decide to hire workers and in return it gets a worker
report and each step of the process it needs to make a decision between
hiring an additional worker, paying more money, versus stopping,
getting a prediction for the correct answer and giving it to the task
owner.
And crowdsynth has multiple components in it. It has task and worker
databases that stores historical information about the tasks, what
their correct answers, what their consensus answers has been, and what
workers had set for these different tasks. And then it has a future
generation component that gets the automated task analysis, the
sequence of worker reports collected for the task, historical
information coming from the databases and generates a set of features
describing the current state of the consensus task.
In addition to these features, now we need to have some predictive
models for making sense of these features. So we have two particular
models, answer models for predicting the correct answer of the
consensus task at any point in time as more worker reports are coming,
and also for accessing the confidence of the system about the correct
answer.
And you also have both models for predicting the future execution, for
saying if I hire another worker, what that worker will tell me. And
these next models help us to say -- predict the future and say what am
I expecting to happen in the future and will that improvement in the
future be enough to compensate for my additional costs in the future.
So it will enable us to make that kind of an analysis.
And we have a decision heuristic planner on top of these predictive
models that makes, that makes these hiring decisions innovative,
optimizes the final utility of the system.
We are performing this study of crowdsynth ideas on a crowdsourcing,
well-known crowdsourcing platform called Galaxy Zoo. Galaxy Zoo is a
project that's introduced by the Universe Team. The Universe Team has
been phenomenal for introducing a number of very influential customs
and science applications to the Web and they have actually energized a
large community of students and scientists, student scientists what
they mean regular people in the world they have Internet connections
and they want to make a difference in science and they want to
contribute to the science. So they have these platforms for engaging
the regular people in the world to scientific research.
So this Galaxy Zoo system is one of their most successful applications.
The goal of Galaxy Zoo is asking regular people in the world, these
student scientists, about the type of galaxies. The system, when
student scientist comes to the platform, the Galaxy Zoo system shows an
image of the galaxy. There are millions of these galaxies but there
are not enough experts in the world to classify all the galaxies. So
the student scientist looks at the picture and says: I think the
correct classification of this galaxy is elliptical or spiral or
merger. And the idea if it can ask enough people about these
classifications, we can classify them correctly and actually they have
experiments showing that the student scientists classification match
the expert opinion, so that it has been very successful.
And since the launch of the Galaxy Zoo project, hundreds of thousands
of people have contributed to Galaxy Zoo. They've successfully
classified millions of galaxies, and this resulted in discovering new
galaxy types and new astronomical objects and this led to more than two
research papers, increases our knowledge about how the galaxies evolve
in time.
So we want to thank the Galaxy Zoo team for giving access to this
beautiful dataset to us for our studies. In this dataset, there are
more than 34 million worker worlds collected for nearly 900,000
galaxies, from 100,000 unique workers. The first step we take is
generating a set of features that are describing these can consensus
tasks.
So the first set of features we have is task features. These task
features are the result of automated task analysis which uses computer
vision algorithms to look at these Galaxy Zoo images and says, gives us
different features representing the images, for example, how much noise
these images have or what is the radius of a galaxy you see in the
image. And these features are coming from this low digital sky server,
and we have 453 visual features like this. This represents the
automated task analysis component.
In addition
sequence of
these words
each class,
others.
to this, we defined a set of word features that gets the
votes collected for a task so far and then represents how
look like, what is the distribution, number of words for
the entropy of the distribution, the mode class and many
And we also defined a set of worker features who said based on the
historical data that we have, what is the usual time these workers
spend for each task, how much experience they have, what has been their
accuracy for different kinds of tasks, and you also define some word
worker features which uses words, word features in addition to the
worker features to generate better descriptions.
So now that we know how we are doing the feature generation, I will
talk about building the predictive models. In consensus tasks, there
are a number of different uncertainties. For example, the system is
always uncertain about what the correct answer is. And also it is
uncertain about what the next worker would say if he hired additional
one. So we are building these predictive models for representing these
uncertainties. Answer models predict the correct answer and word
models predict the next word. This figure on top presents one typical
model would be rebuilt for predicting the answer and next word, and as
you can see these models can bring these different features together to
make the predictions. And we use supervised learning with bayesian
model selection to train these models.
For predicting the correct answer, we took two general approaches. The
first one is building a generative model that makes an inference based
on a prior answer model just using the automated task analysis features
and a word model which predicts the next word based on the correct
answer, task features and word features.
And then we implemented two very well-known generative approaches in a
bayesian model which makes an independent assumption about the votes
coming for a task. And an iterative based model that does not make
this assumption. And then we also try the discriminative approach
which learns a direct model from all the features available for a task.
In this graph we can see the different accuracies of these models when
the number of words are increasing from zero to 60. So one important
lesson from this, as the number of words increases, we can see a
benefit for using a discriminative approach because the words are
coming together in synergetic ways and the discriminative model is able
to represent this.
But in addition to this, for these consensus tasks, these tasks are
really difficult. We may need a very large number of worker reports to
be able to predict the correct answer accurately.
And in this figure we can see two discriminative models, one trained
for the case where we have a few number of words, and the second one
trained when there are large number of words available. As you can
see, when a few number of workers are available, the task features are
really playing an important role in predicting the answer and the word.
But as the number of words increases, the word features become the bit
predictors for the correct answer.
So now that I've described how these answer and word models are
trained, now I want to talk about how we will make decisions by using
these predictive models. So the goal of the decision-making component
is optimizing hiring decisions to maximize expected utility for solving
these consensus tasks. And here we are defining the utility as the
final accuracy of the answers produced by the system minus the payment
we are giving to workers.
So we don't want to hire a lot of workers because they're costly. We
can't hire infinitely many people, but at the same time you want to get
to good accuracy and the system needs to do a good trade-off for
different cost values.
And we are modeling consensus tasks as a finite horizon mark of
decision process with partial observability. The state here includes
task features and words. At each step the system keeps a belief about
what the correct answer is.
Here the system has two high level actions. The first action is hiring
a random worker and continue the task. The second one is not to hire
worker and terminating the task.
The reward for taking a hiring action depends on the cost of a worker
and the reward for not hiring any more workers depends on the
confidence of the system on answer prediction, and the stochastic
transitions of the model is determined by the next word models. So as
you can see here, the predictive models become inputs to our decision
heuristic planner.
And the decision heuristic planner makes a decision about whether to
hire new workers or not.
And this decision about which action to take is guided to a value of
information analysis. So for each state the system decides what is my
value for taking action H versus action not H, and I should take the
action which provides more expected value.
However, just using regular MDP or pro MDP approaches for solving these
tasks is not possible. Because the horizon of these consensus tasks
can be very large. And we know that when there's partial
observability, the complexity of solving these models grows
exponentially through horizon, which makes the exact solution approach
infeasible. For some tasks we have up to 90 words, so we need to do a
horizon of 90, which is not feasible.
So what is our solution for solving these tasks efficiently? The
solution is understanding the special structure, because consensus
tasks have a special structure. There are two high level actions:
Hire a worker or not hire a worker. The task terminates when not each
action is taken. Each hire action as new evidence and answer
predictions are most accurate at the horizon when all words are
collected.
So by looking at this special structure, we designed a new Monte Carlo
planning algorithm that can solve consensus tasks efficiently.
So that's what this algorithm does. It starts from a state SI, the
current state of consensus task, and it generates samples. Each sample
represents one feature possible execution of the system, if the system
continues to hire as many workers as possible to reaching the horizon.
Starting from the state SI, we say, okay, I'm hiring the worker, let's
sample what that worker would say, maybe that worker will say
elliptical or spiral. Updates state, take another higher action,
sample another worker word, update the state until reaching the
horizon.
At the horizon, we have a lot of evidence about the task. At that
point from the belief at that state we can sample a correct answer
which is A bar. And now we will use the sampled correct answer to
backpropagate and it will negate the value of previous states are for
determining at any point. So backpropagate this A bar to the state ST
and say if you terminated at that state, if you took action not H, what
that state would say is the correct answer, does it agree with the
sampled correct answer? If it agrees, here's a good reward for you.
If it doesn't agree, there's a bad reward for you. And we do this for
all states encountered on the sample.
And if at the end we can assign some utility to these different states
we've seen on the sample, and then later, after we generate many
samples like this, we can bring these samples together in a search
tree, partial search tree like this. Here the branches represent the
stochastic words that we can get from workers. And starting from the
bottom of the tree, we can evaluate for each state on the partial
search tree what the value is for hiring workers or not hiring workers,
make decisions like this. So what are the properties of this
algorithm? It is an anytime algorithm. With each sample we can
explore states at different times, which is a good property when the
horizon is so long. And a single sample can evaluate value for
multiple action sequences. So by using these properties we can solve
consensus tasks sufficiently.
Now I will show you results from an empirical evaluation with the
Galaxy Zoo data. We compare crowdsynth to two baselines. No hire
baseline which hires no workers, just uses the automated task analysis.
Hire baseline, which is the original Galaxy Zoo system, and we try
different crafts and variations, limited look ahead approach which
looks at 16 steps ahead, UCT, which is an upper confidence bound Monte
Carlo algorithm. Monte Carlo trees search algorithm and our novel
sampling bayes algorithm. And in this graph you can see the utility of
the system when the cost of hiring a worker calls for a label decreases
from 1 to 0.01 cents. And as can you see here, when the cost of a
worker is not very high, our algorithm and crowdsynth can do much
better than all the baselines and all the Monte Carlo planning
algorithms. Why do we see this? Because our system can successfully
trade off the expected benefit worker with the cost. As you can see
here, the blue line is the accuracy of crafts. Red bars represent
percentage of votes collected. And as you can see, when the cost is
really high, this system does not hire many workers but it can still
increase the accuracy up on the baseline. As the cost increases, it
increases the number of workers it hires and gradually the accuracy
increases.
And, finally, at the end, when the cost is really low, crowdsynth can
reach the maximum accuracy of the original system by only hiring
47 percent of the workers. These are a similar analysis when there's
no cost for a worker, but there's a fixed budget and we showed that
crowdsynth can do much better than other baselines.
So I know I'm out of time. So I will just conclude with the feature
work or current work slide. What are the things we are working on now?
This version of crowdsynth works if there's a large set of historical
data. You're interested in how we can use the same ideas when such a
dataset is not available. So we are interested in a design of a system
that can learn and act simultaneously. You want to go beyond consensus
tasks, for example, the picture on the right is another example from a
universe platform, where it's no longer a consensus task, it's a
discovery task. And you're interested in other prediction problems
like predicting engagement of workers. And we have an ongoing
collaboration with the universe team, they're very interested in trying
these ideas in their platform, so you'll see these ideas come true in
their systems.
So thank you very much and I'd like to have your questions.
[applause]
>>: It sounds like your system really is beneficial if the task is
hard, you really need multiple, multiple judgments, and also it assumes
sort of one judgment at a time.
Collected three workers, three judgments and decides whether to go for
the fourth one or fifth one. In practice, you usually cannot really
get only one because you have the problem with Watson style that you
have to keep returning to. So you have to [inaudible]
>> Ece Kamar:
How do you need to get --
>>: Righteous [phonetic]. Because you have to keep looking -- you
have to keep filtering the box. I guess you can use, I guess you can
change it -- the question is what would change if you have, if either
have like very small budget, we really cannot get past the labels and
also what would change if we have to, we cannot request one label, we
have to request like five labels, three labels
>> Ece Kamar: So the ideas would be pretty much the same, and you can
hire 50 but you can hire five. Your decision-making problem will get
much simpler because you just need to think about five steps ahead at
most.
So you don't have many of the challenges we've seen and the problems.
So actually the task becomes simpler when you can hire at most five
workers.
And what you've seen here, even for the Galaxy Zoo task, there were
times when the system was confident enough to say that I'm done after
hiring one worker, because the automated task analysis was already
confident that it was elliptical, the next work, the first worker
system hired, it's elliptical and the system said yes, with these calls
I don't think I can improve much better by paying more workers, just
stopping now after one worker.
So even when the number of workers you need to hire is not that large,
you can still generate benefits by using an approach like this. And
coming back to the question about batches, such a system would be more
beneficial when you can actually dynamically adjust what you want to do
after each word because you have more freedom about how much
granularity of workers you want to hire.
So the same idea would still work: Instead of hiring one worker you
want to hire a couple of them, that would still work. But at the end
you may naturally choose all the benefits you would like to get out of
the system just because your decisions are not that granular.
>>: Looks like your action space, you're looking at is hire or not
hire, and that's fine. But in practice often I have a fixed budget: I
do the best I can. Have you looked at instead of trying to determine
which tasks to give more of a crowdsourcing active learning type of -looks like similar techniques could work with that didn't seem to be in
scope at all here.
>> Ece Kamar: Here, because of the limitations of the dataset and
Galaxy Zoo platform, we can't really target individual workers. This
is why we've performed our simulations like this. But in separate
simulations that I haven't presented today, we build next word models
that were personalized. So we were actually predicting what this
particular worker would say for this galaxy and then customize our
decision-making model by also making decisions about which worker
should I hire if I can hire between these 30. And you can actually see
benefits in that.
>>: You would hire which is closer, it's not what you could do is it's
not about what tasks, what actions, what questions do I give you.
>> Ece Kamar: That is like the analog problem looking at it from the
task, for the same ideas would be applicable affording that kind of
analysis, too.
>> Ofer Dekel:
[Applause]
Let's thank the speaker.
Download