>> Ofer Dekel: So the first speaker in the... Zhou from Microsoft Research, and the title of his talk...

advertisement
>> Ofer Dekel: So the first speaker in the next session is Dengyong
Zhou from Microsoft Research, and the title of his talk is Learning
From the Wisdom of Crowds by Minimax Entropy.
>> Dengyong Zhou: Good afternoon, everyone. So I'll be talking about
crowd sourcing. It's joint work with Sumit Basu, Yi Mao and John
Platt. The paper will appear in EEPs this year.
In the current days, crowd sourcing has become popular to generate
labels. There are -- many companies can provide crowd sourcing
service. Here I just listed a few of them. Perhaps the most popular
one is Amazon Mechanical Turk.
Generally, labeling one item, for example, by image, just [inaudible]
even just one cent. So it's very easy to get a large amount of
training examples with low cost.
That sounds great. Unfortunately, there is no free lunch. So here
is -- we submitted a subset of data to Amazon Mechanical Turk. The
Turkers are not experts. So they need to label their talks into four
plays. The average accuracy is 68 percent. So a fundamental challenge
in crowd sourcing is how to infer ground truth from the noise labels by
non-experts.
Now let me define the problem rigorously. We have a set of workers
indexed I, I from 1 to N. We also have a set of items indexed by J
from 1 to N. The items could be images or documents, or something
else. Also have categories from indexed G from 1 to C. If C is equal
to 2 that would be just by labeling.
For example, if we went to label an apartment as spam or not spam. If
C is lighter than 2, there would be more class. So we have a response
matrix denoted by Z. The response matrix will present the labels given
by the workers. If the entry IJT is 1, that means worker I labels item
G as category K. If the entry is 0, that means the worker I labels
item G as others.
So if worker doesn't label that item, the entry will be unknown. Our
goal is to estimate the ground truth or the true labels for each item.
So here perhaps I need to print out crowd sourcing problem is totally
different from collaborative filtering. In collaborative filtering,
one needs to estimate the unknown entries. But here we need to
estimate unknown ground truth.
Perhaps the simplest approach for crowd sourcing is majority voting.
They just simply count the votes for each class. For example, here,
let's look at item four. So there's three workers labeled the item
four as class one. And the two workers label item four as class two.
According to majority voting, the true label of item four should be
class one. So we need to improve majority voting because most of the
skilled workers should have more weight in voting. This problem is
beautifully addressed by Dawid and Skene. They assume each worker is
associated with a confusing matrix. C by C matrix. C is number of
categories. So each entry of recurrent matrix means the probability of
the confusion between two classes.
But here, for example, for PKL, that means the true label is L but the
worker gives categories L. True categories is K but the worker give
category L. And then they assume for any labeling task, the label by
worker is generated according to confusion matrix.
Then the [inaudible] would be standard. Now we have two set of
parameters. We have confusing matrix for each worker. Worker needs to
estimate the ground truth. Both can be jointly estimated by maximal
likelihood estimation. So the implementation will be EM algorithm.
So let me illustrate the confusion matrix. Here is, for example,
assume we have the true labels for each item. We have no true labels.
And the first three items belongs to class one and the remaining three
items belong to class two. If we look at worker four, for the first
three items, the classifier correctly, so the first -- so look at
confusing matrix. So first row is 1 and 0. But here they're
misclassified, item four and item five as 1. So we have 1 of 3 there.
And also quietly classified item six as 2. So the last entry will be 1
over 3.
So for understanding this approach better, let's look at a simplified
case. In the simplified case, each worker is associated with a single
number, PI. Actually it's accuracy. So means with probability PI the
worker will give a right answer for the labels. With the probability 1
minus PI, he will give wrong answer. So the limitation of this
approach is also obvious. So from the generative model, the labels
given by the workers are [inaudible] task. Actually that's not true.
Let's look at these two images also from Stanford dataset.
Clearly the image on the left side should be easier than the image on
the right side. So now the [inaudible] in this talk can generalize the
[inaudible] approach such that the task difficulty is also considered.
Let's assume these are labeling distribution for each pair of worker
and item. We don't assume any particular forum for this distribution.
So we can see that out of these things the [inaudible] would depend on
I, depend on T, also depend on K. So that's different from the DNS
approach.
So assume now first, just assume that we know the true labels. We want
to estimate the distribution. So we know it's a well-known approach
for this kind of estimation is maximum entropy. For maximum entropy
when [inaudible] if we have no [inaudible], we just get uniform
distribution. So later I will give more detailed explanation about
this [inaudible] but first I give high level description.
So you can imagine the response, the response matrix is also empirical
distribution. And then our goal is to estimate unknown distribution
where we want to match to each other at the first moment. Let me first
explain a bit the [inaudible] on the left side. So let's go back to
the problem. And we have for each item we count the number of workers
labeled as class one. This is just labeling. And we also count the
number of workers labeled as class two. And then [inaudible] the
expected number of workers to be equal to the [inaudible] counts. And
also for the other set of workers, the other set of contributors,
they're compared to each worker. So just like in the class, Dawid and
Skene's approach, we count misclassifications. So we count the number
of misclassifications from class one to class two and also count the
number of misclassifications from class two to class one. And then on
this side is just expected number of misclassifications. Again, we
expect the expanded numbers of miscalculations to be equal to the
empirical counts.
Now I have introduced how to use a maximum entropy approach to estimate
the labeling assignment distribution. So better for our crowd sourcing
program, there are no countries, so how to estimate countries. One to
consider simultaneously estimate the ground truth, why? And also the
label distribution by maximization over both pi and Y. And it's the
same [inaudible]. So through a little bit of mathematical tracking we
have found this kind of formulation will lead to uniform distribution
over Y. Of course, that will not be desirable.
So that's real the maximum entropy formulation. In this formulation we
have true labels here. That means for the true labels take different
values. Work at different estimation of entropy. So if so, they may
have idea. So why not choose the labels such that the resultant
entropy is the smallest. So intuitively entropy can be understood as
uncertainty.
If the entropy is small, that means the average, those workers are
confident about the labels. So mathematically this kind of idea can be
implemented by wrapping up the maximum entropy with minimization
problem. That's all we need to do. We have N other steps. So so far
since we have solved this problem for -- to estimate the ground truth.
But at the beginning our method of task difficulty and worker
expertise, we want to generalize Dawid and Skene's approach such that
we can consider both. So from our primary problem we were no
variables. But if we consider dual form over the minimax formulation
and use K to T condition, we can show the optimal solution has to be
look like this form.
So now we have two long-range multiplers.
we can understand tau as task difficulty.
The first is tau. Tau is -And we also have another
sigma. Sigma can be understood as worker expertise. So those are just
naturally introduced as long-range multipliers. We don't have to
explicitly introduce those variables.
So I point out, for the path difficulty here, it's a vector. Not just
a single number. That means they want to represent how confusing of
this labeling task, this labeling model has a nice property we call
objectivity principle. So first talked about objectivity of working
expertise. The difference of expertise between two workers is
independent of the item being labeled. We also have objectivity over
the task difficulty. The difference in labeling, the labeling between
items is independent of the worker.
Let me use Y example to illustrate the objective principle. Here the
two images that I showed before. So no matter who labeled these two
images, the right one is intrinsically more difficult than the left
one. Because the right one is not very clear. And then we can have a
nice result for the connection between the labeling model and the
objectivity principle. Still seeing the labeling model uniquely
satisfies the management objective principle. If we want to make -- if
we want to objectively evaluate workers and also items, you have to use
this model.
So in real world crowd sourcing problems, we generally have many
missing entries. That means we cannot have each item labeled many,
many times. Because if we label them each item N many times, they'll
be [inaudible] and hiring experts will be better strategy. And also
generally each worker will not label all of items, have no time to
label all of them and pairs of those get aboard labeling task. So due
to the finite data, we cannot ensure the constituents will be
rigorously called. So we need to relax the [inaudible] so we put slack
variables here such that we allow a little bit fluctuation.
So, of course, we put variation on these two newly introduced
variables. These two terms are just L2-based norm. So this is norm,
means we don't allow there to file away from the zeros. So one might
prefer [inaudible] like L1 norm or trace norm or other matrix norms.
We use L2 norm only for the simplicity of the computation. Now I
showed experiment results. So the first results is data from the Web
search. We have a set of query and UI pairs. And then we have many
workers join labeled pairs. We have skill, perfect, actually good,
fair and bad. For example, if the queries Microsoft and the bad page
is Microsoft.com it would be perfect match. So each pair was labeled
by six workers. Of course, six different workers. So if we check the
average worker [inaudible] that means we check accuracy for each worker
and take average, it's just 37 percent. That's really amazingly low,
but random cast, random cast is 25 percent. If you use majority
voting, accuracy, 70, 80 percent, if you use Dawid and Skene's
approach, that approach doesn't take the task difficulty in account and
get accuracy 84 percent. Minimum entropy can achieve the best results
is around 89 percent. So where is the experience about the image
labeling? So we have submitted the sample of the set around 800 tasks
and four plays. So each image was labeled by five different workers.
So we can see that for the labeling task the challenge is much easier.
So the average worker performance is around 68 percent. Majority
voting achieve 80 percent. And the Dawid Skene approach is 82 percent.
And our approach can do a little better by considering the task
difficulty.
So now let me conclude the talk. So we have minimum principle to infer
the ground truth from labels by crowd and we can derive labeling model
which includes task difficulty and the worker expertise and assure
equivalence by objective principle and labeling model and here we have
many other things taken apart here we can look at the newspaper. I
didn't talk about how to implement this approach because we minimize
entropy formulation. So we need to know how to design algorithm to
solve that kind of a program, and actually -- we also have a
theoretical judgment about why we would use minimax entropy. Here we
just talked about intuition. That's all. Thanks.
[applause]
Any questions?
>>: Minimax. Minimax models usually suffer from instability, if you
don't have enough judgments per each item. Like if only two or three
judges judge each item, in fact, if only two, then two for sure will
break, it's usually very unstable. Can you comment on ->> Dengyong Zhou: That's why we need randomization.
on these parameters.
We put a player
>>: But prior on the parameters will converge, simply converge to most
simple model two [inaudible].
>> Dengyong Zhou: If it's just two, that's a difficulty, no, right.
We need at least a minimum number of laps for profile [inaudible]
something like that.
>>: I see.
>> Dengyong Zhou:
We can talk more off line.
>>: I have some more questions.
>> Dengyong Zhou:
Sure.
>>: This is actually related. So does this formulation suggest like in
the real world you might have an infinite number of tasks you can
afford a finite number of ratings, does this suggest what the best way
to allocate ratings to that infinite set of tasks is, like three per
task and pick randomly, or can you do something smarter?
>> Dengyong Zhou: So I don't know. This will be the best way or not.
So far that's the only solution I figure out, yeah. So that one
actually for crowd sourcing problems, that really is quite positive.
>>: Would you say quickly how big can this scale to, how many tasks and
workers can you handle, computationally. Does it get hard?
>> Dengyong Zhou: Computationally, actually we just use for the
[inaudible] the problem with solve the dual problem just coordinated,
coordinated decent. For each sub program it's convex and it's WTS.
>>: That probably means that you are not going to be at random workers,
the number of judgments [inaudible] probably [inaudible] like if you
have too many labels.
>> Ofer Dekel: Okay. So let's thank the speaker again. [applause]
Download