>> Ofer Dekel: So the first speaker in the next session is Dengyong Zhou from Microsoft Research, and the title of his talk is Learning From the Wisdom of Crowds by Minimax Entropy. >> Dengyong Zhou: Good afternoon, everyone. So I'll be talking about crowd sourcing. It's joint work with Sumit Basu, Yi Mao and John Platt. The paper will appear in EEPs this year. In the current days, crowd sourcing has become popular to generate labels. There are -- many companies can provide crowd sourcing service. Here I just listed a few of them. Perhaps the most popular one is Amazon Mechanical Turk. Generally, labeling one item, for example, by image, just [inaudible] even just one cent. So it's very easy to get a large amount of training examples with low cost. That sounds great. Unfortunately, there is no free lunch. So here is -- we submitted a subset of data to Amazon Mechanical Turk. The Turkers are not experts. So they need to label their talks into four plays. The average accuracy is 68 percent. So a fundamental challenge in crowd sourcing is how to infer ground truth from the noise labels by non-experts. Now let me define the problem rigorously. We have a set of workers indexed I, I from 1 to N. We also have a set of items indexed by J from 1 to N. The items could be images or documents, or something else. Also have categories from indexed G from 1 to C. If C is equal to 2 that would be just by labeling. For example, if we went to label an apartment as spam or not spam. If C is lighter than 2, there would be more class. So we have a response matrix denoted by Z. The response matrix will present the labels given by the workers. If the entry IJT is 1, that means worker I labels item G as category K. If the entry is 0, that means the worker I labels item G as others. So if worker doesn't label that item, the entry will be unknown. Our goal is to estimate the ground truth or the true labels for each item. So here perhaps I need to print out crowd sourcing problem is totally different from collaborative filtering. In collaborative filtering, one needs to estimate the unknown entries. But here we need to estimate unknown ground truth. Perhaps the simplest approach for crowd sourcing is majority voting. They just simply count the votes for each class. For example, here, let's look at item four. So there's three workers labeled the item four as class one. And the two workers label item four as class two. According to majority voting, the true label of item four should be class one. So we need to improve majority voting because most of the skilled workers should have more weight in voting. This problem is beautifully addressed by Dawid and Skene. They assume each worker is associated with a confusing matrix. C by C matrix. C is number of categories. So each entry of recurrent matrix means the probability of the confusion between two classes. But here, for example, for PKL, that means the true label is L but the worker gives categories L. True categories is K but the worker give category L. And then they assume for any labeling task, the label by worker is generated according to confusion matrix. Then the [inaudible] would be standard. Now we have two set of parameters. We have confusing matrix for each worker. Worker needs to estimate the ground truth. Both can be jointly estimated by maximal likelihood estimation. So the implementation will be EM algorithm. So let me illustrate the confusion matrix. Here is, for example, assume we have the true labels for each item. We have no true labels. And the first three items belongs to class one and the remaining three items belong to class two. If we look at worker four, for the first three items, the classifier correctly, so the first -- so look at confusing matrix. So first row is 1 and 0. But here they're misclassified, item four and item five as 1. So we have 1 of 3 there. And also quietly classified item six as 2. So the last entry will be 1 over 3. So for understanding this approach better, let's look at a simplified case. In the simplified case, each worker is associated with a single number, PI. Actually it's accuracy. So means with probability PI the worker will give a right answer for the labels. With the probability 1 minus PI, he will give wrong answer. So the limitation of this approach is also obvious. So from the generative model, the labels given by the workers are [inaudible] task. Actually that's not true. Let's look at these two images also from Stanford dataset. Clearly the image on the left side should be easier than the image on the right side. So now the [inaudible] in this talk can generalize the [inaudible] approach such that the task difficulty is also considered. Let's assume these are labeling distribution for each pair of worker and item. We don't assume any particular forum for this distribution. So we can see that out of these things the [inaudible] would depend on I, depend on T, also depend on K. So that's different from the DNS approach. So assume now first, just assume that we know the true labels. We want to estimate the distribution. So we know it's a well-known approach for this kind of estimation is maximum entropy. For maximum entropy when [inaudible] if we have no [inaudible], we just get uniform distribution. So later I will give more detailed explanation about this [inaudible] but first I give high level description. So you can imagine the response, the response matrix is also empirical distribution. And then our goal is to estimate unknown distribution where we want to match to each other at the first moment. Let me first explain a bit the [inaudible] on the left side. So let's go back to the problem. And we have for each item we count the number of workers labeled as class one. This is just labeling. And we also count the number of workers labeled as class two. And then [inaudible] the expected number of workers to be equal to the [inaudible] counts. And also for the other set of workers, the other set of contributors, they're compared to each worker. So just like in the class, Dawid and Skene's approach, we count misclassifications. So we count the number of misclassifications from class one to class two and also count the number of misclassifications from class two to class one. And then on this side is just expected number of misclassifications. Again, we expect the expanded numbers of miscalculations to be equal to the empirical counts. Now I have introduced how to use a maximum entropy approach to estimate the labeling assignment distribution. So better for our crowd sourcing program, there are no countries, so how to estimate countries. One to consider simultaneously estimate the ground truth, why? And also the label distribution by maximization over both pi and Y. And it's the same [inaudible]. So through a little bit of mathematical tracking we have found this kind of formulation will lead to uniform distribution over Y. Of course, that will not be desirable. So that's real the maximum entropy formulation. In this formulation we have true labels here. That means for the true labels take different values. Work at different estimation of entropy. So if so, they may have idea. So why not choose the labels such that the resultant entropy is the smallest. So intuitively entropy can be understood as uncertainty. If the entropy is small, that means the average, those workers are confident about the labels. So mathematically this kind of idea can be implemented by wrapping up the maximum entropy with minimization problem. That's all we need to do. We have N other steps. So so far since we have solved this problem for -- to estimate the ground truth. But at the beginning our method of task difficulty and worker expertise, we want to generalize Dawid and Skene's approach such that we can consider both. So from our primary problem we were no variables. But if we consider dual form over the minimax formulation and use K to T condition, we can show the optimal solution has to be look like this form. So now we have two long-range multiplers. we can understand tau as task difficulty. The first is tau. Tau is -And we also have another sigma. Sigma can be understood as worker expertise. So those are just naturally introduced as long-range multipliers. We don't have to explicitly introduce those variables. So I point out, for the path difficulty here, it's a vector. Not just a single number. That means they want to represent how confusing of this labeling task, this labeling model has a nice property we call objectivity principle. So first talked about objectivity of working expertise. The difference of expertise between two workers is independent of the item being labeled. We also have objectivity over the task difficulty. The difference in labeling, the labeling between items is independent of the worker. Let me use Y example to illustrate the objective principle. Here the two images that I showed before. So no matter who labeled these two images, the right one is intrinsically more difficult than the left one. Because the right one is not very clear. And then we can have a nice result for the connection between the labeling model and the objectivity principle. Still seeing the labeling model uniquely satisfies the management objective principle. If we want to make -- if we want to objectively evaluate workers and also items, you have to use this model. So in real world crowd sourcing problems, we generally have many missing entries. That means we cannot have each item labeled many, many times. Because if we label them each item N many times, they'll be [inaudible] and hiring experts will be better strategy. And also generally each worker will not label all of items, have no time to label all of them and pairs of those get aboard labeling task. So due to the finite data, we cannot ensure the constituents will be rigorously called. So we need to relax the [inaudible] so we put slack variables here such that we allow a little bit fluctuation. So, of course, we put variation on these two newly introduced variables. These two terms are just L2-based norm. So this is norm, means we don't allow there to file away from the zeros. So one might prefer [inaudible] like L1 norm or trace norm or other matrix norms. We use L2 norm only for the simplicity of the computation. Now I showed experiment results. So the first results is data from the Web search. We have a set of query and UI pairs. And then we have many workers join labeled pairs. We have skill, perfect, actually good, fair and bad. For example, if the queries Microsoft and the bad page is Microsoft.com it would be perfect match. So each pair was labeled by six workers. Of course, six different workers. So if we check the average worker [inaudible] that means we check accuracy for each worker and take average, it's just 37 percent. That's really amazingly low, but random cast, random cast is 25 percent. If you use majority voting, accuracy, 70, 80 percent, if you use Dawid and Skene's approach, that approach doesn't take the task difficulty in account and get accuracy 84 percent. Minimum entropy can achieve the best results is around 89 percent. So where is the experience about the image labeling? So we have submitted the sample of the set around 800 tasks and four plays. So each image was labeled by five different workers. So we can see that for the labeling task the challenge is much easier. So the average worker performance is around 68 percent. Majority voting achieve 80 percent. And the Dawid Skene approach is 82 percent. And our approach can do a little better by considering the task difficulty. So now let me conclude the talk. So we have minimum principle to infer the ground truth from labels by crowd and we can derive labeling model which includes task difficulty and the worker expertise and assure equivalence by objective principle and labeling model and here we have many other things taken apart here we can look at the newspaper. I didn't talk about how to implement this approach because we minimize entropy formulation. So we need to know how to design algorithm to solve that kind of a program, and actually -- we also have a theoretical judgment about why we would use minimax entropy. Here we just talked about intuition. That's all. Thanks. [applause] Any questions? >>: Minimax. Minimax models usually suffer from instability, if you don't have enough judgments per each item. Like if only two or three judges judge each item, in fact, if only two, then two for sure will break, it's usually very unstable. Can you comment on ->> Dengyong Zhou: That's why we need randomization. on these parameters. We put a player >>: But prior on the parameters will converge, simply converge to most simple model two [inaudible]. >> Dengyong Zhou: If it's just two, that's a difficulty, no, right. We need at least a minimum number of laps for profile [inaudible] something like that. >>: I see. >> Dengyong Zhou: We can talk more off line. >>: I have some more questions. >> Dengyong Zhou: Sure. >>: This is actually related. So does this formulation suggest like in the real world you might have an infinite number of tasks you can afford a finite number of ratings, does this suggest what the best way to allocate ratings to that infinite set of tasks is, like three per task and pick randomly, or can you do something smarter? >> Dengyong Zhou: So I don't know. This will be the best way or not. So far that's the only solution I figure out, yeah. So that one actually for crowd sourcing problems, that really is quite positive. >>: Would you say quickly how big can this scale to, how many tasks and workers can you handle, computationally. Does it get hard? >> Dengyong Zhou: Computationally, actually we just use for the [inaudible] the problem with solve the dual problem just coordinated, coordinated decent. For each sub program it's convex and it's WTS. >>: That probably means that you are not going to be at random workers, the number of judgments [inaudible] probably [inaudible] like if you have too many labels. >> Ofer Dekel: Okay. So let's thank the speaker again. [applause]