>> Dengyong Zhou: So let's start. Today our Microsoft visiting speaker is Qiang Liu. Qiang is a
PhD student at UC Irvine in the computer science department. His research interest is in machine learning and these applications. Qiang received MSR PhD Fellowship in 2011 and that same year he did internship with John Pratt and Chris Meek [phonetic]. Today he'll be talking about crowd sourcing. Thank you.
>> Qiang Liu: Thanks Denny for the introduction. Can you hear me? Can you hear me? Okay.
So this is joint work with my friend Jian Peng and my advisor Alex Ihler. It was presented in
MIPS last year. Crowd sourcing is the process of outsourcing the problems you want to solve to crowds of people and recently it has become a very powerful approach for solving problems that computers cannot solve alone and using human, by harvesting human intelligence. It's also very powerful to gather information and data from crowds of people usually can do a lot of powerful things. However, crowd sourcing can also backfire if you don't treat them very carefully because humans tend to be very unreliable and they are very different from each other, so different people may have different opinions. It's a problem how to de-noise the labels from crowds and how to aggregate the opinions from different peoples. So today I'm going to focus on the consensus algorithm for this crowd sourcing set up. So to be more specific, assume we have a set of images which have unknown two labels, binary labels. For example, in this case maybe you want identify whether there are ducks in the image. Assume there are too many images so we hire a set of people and we open [inaudible] one of the crowd sourcing platforms such as Amazon Mechanical Turk and hire a set of workers. So now each worker is assigned with a subset of images and each image is again labeled by multiple workers.
Doing this can make sure the redundant information to help the accuracy. But because the workers are very unreliable so they may have different labels even for the same image. The problem is how to aggregate these noisy labels to gather estimation of the unknown true label zi. So the criteria function here is to minimize the bitwise error rate or just the expected number of mistakes you can make, the number of wrong images. Because the workers are very diverse, so I measured the diversity using this reliability measurement qj which is just the accuracy, the probability for the workers to give the correct answer. Now we have experts which have very high accuracy qj close to one. We have spammers. They are lazy people who tend to just give random answers ignoring the problem itself, so they have qj of approximately
1/2. Also we have adversaries who tend to give opposite answers for reasons, for all sorts of reasons. So there are many hours for doing this problem. Albeit a naïve algorithm is just for doing majority voting, so here because the labels are +1, -1 so I write as a sum and a sign formula. But this problem is definitely not very good for counting the diversity of the workers because it treats all the workers and all the labels uniformly, so in this plot I show one of the results, the results of one of my experiments. This black line is an oracle lower bound that you can get if you exactly know the accuracy of each worker. So this is what you can get if you have some oracle knowledge. And the majority voting is here so this is the error rate so that's a very large gap there that you can close by potentially combining the diversity of different workers.
>>: [inaudible] spammers adversaries and so on, right? The fact that oracle, the oracle’s
[inaudible]
>> Qiang Liu: It assumes that qj is exactly known to you.
>>: But you also picked a distribution over the q’s?
>> Qiang Liu: So once you know the exact values it doesn't matter the distribution.
>>: But doesn't this plot change when the distribution changes?
>> Qiang Liu: Sure. Sure. This is one of the examples. This is just a demonstration.
>>: So this is like empirical or this is just some like…
>> Qiang Liu: This is some article simulation to show some concept. I will show more experiments later.
>>: Do you know what the actual, what the true oracle is when the number works with the…
>> Qiang Liu: I know in this case.
>>: Okay.
>> Qiang Liu: I will definitely go back to this prop later. So another algorithm is an iterative algorithm proposed by David Karger, [inaudible] Oh, and Debra Shah. So this is a quite interesting algorithm. It's basically a weighted version of majority voting. So here each label, Lij is weighted by a confidence level Yji. It's a real value and this Yji is iteratively calculated by these two linear updated walks, so basically intuitively you can interpolate these linear updates as sort of passing messages between the workers and the tasks. This algorithm is very interesting. I won't refer to this algorithm as a KOS for convenience in this talk. So it's very interesting algorithm. People actually can show some optimal properties on the sample complexity in asymptotic case. However, it doesn't perform very well sometimes in practice.
You may be wondering how people derive this algorithm. Actually there's no derivation for this algorithm. In the paper they just write this algorithm, follow some intuition of belief propagation but there's no sort of connection there. So it's curious to understand what this algorithm is exactly doing and so we can improve it and understand the advantages and disadvantages of this algorithm. So in this talk I'm going to present more general algorithms that actually unifies both majority voting and KOS, so both of them will be extreme cases of algorithm with some special assumption. And by smooth MV majority voting on the KOS we can actually get much better performance. So the blue line is our algorithm and the red line is
KOS. I will show more details on this part later. And our algorithm has a principal derivation.
Another set of algorithms, EM algorithms, there are a lot of EM algorithms starting from 1979 to very recent papers and these algorithms are basically, what usually people will do, so basically they will build a generate of probabilistic model on how the labels are generated and then they will estimate the parameters of the label using maximizing likelihood by EM algorithms, treating the true hidden variable, or the true labels as hidden variables. And then
they go back to estimate the labels with the parameters they just estimated by maximum likelihood. So the problem for EM is that there are so many models you can use, from very simple models that use only Belongie [phonetic] distribution and confusion matrix in ‘79 and two very complicated models that account all sorts of factors and use multi-dimensional representations very recently. So the real question is how to choose these models so we have simple model and complex model, which one should we use? There is a trade-off here. And also given a model, how could you inference and decode the labels? So EM has this two-step approach where they estimate the parameters and then go back to estimate the labels. So is this an optimal way to do it in the sense of minimizing the bitwise error rate? So if it's not, how much improvement can we get from using another different efficient algorithm even if the model is the same. And also we can show that EM, and also what is the connection between
EM algorithm and the algorithm like majority voting on the KOS I showed before? This paper, this work tries to adjust to these problems by re-writing the crowd sourcing problem into an inference problem of graphical models and use inference techniques such as belief propagation and mean field. So some background on graphical model here, so graphical model is a high dimensional distribution, special high dimensional distribution whose probability is a product of many local factors. Each of which is the strategy here depends only on a subset of variables, so this structure can be represented using a spectrograph in here where the cycles are variables.
The squares are factors and they are connected if this variable is involved in this factor. So this is a very simple example. And now given a graphical model we want to, a usual problem is to calculate the marginal probability which requires, of single variables, which requires us to sum over or marginalize over all of the other variables. And this is a difficult problem usually NP hard because you have to sum over exponential terms. Assume z here is binary variables. So there are many approximation algorithms for doing this including belief propagation and mean field. So belief propagation proposed by [inaudible] and many other people works by approximating the marginal distribution, pz with a product of a set of functions, and here this function can be interpreted as passing messages because they are calculated intuitively by these two updates. One of them is sort of you can understand it as a passing messages from variables to factors and another is passing messages from factors to variables. The detail of this algorithm is not of importance for this talk, but just remember they have this iterative message passing style which is sort of similar to the KOS algorithm I introduced earlier. Another set of algorithms is the mean field algorithms which also approximate the marginal distributions, but it works by approximating the joint distribution, pz, the order [inaudible] with a fully independent model and you do this by minimizing the KL divergence and usually this problem can be solved using [inaudible] descent and it's very efficient usually. So there are more background in Wainwright and Jordan’s book or Koller and Friedman’s book. Now we need to view the graphical model for our crowd sourcing problem. To do this we start from the simplest thing that we can do. So we assume the workers, the labels of the workers are generated by a very simple Belongie distribution. Basically, Lij is the label of worker i gave to image j is correct, equal to the true label zi with probability qj. Otherwise it's wrong because it's binary. And now here qj is the accuracy of worker j and I assume it's strong from some prior distribution. So all of the workers will have the same prior distribution but they have different
[inaudible] of qj. So now we can calculate the posterior distribution, the joint posterior distribution of the true label z and the reliability q which is proportional to the product of the
prior distribution times the Belongie likelihood term. So here, in here this pj here is the total number of images labeled by worker j. And zj is the number of correct images labeled by j amount of all of the dj images.
>>: [inaudible] NP hard [inaudible]
>> Qiang Liu: Yeah. We assume equal difficulty for all. It's the simplest model you can use.
Now this cj is the number of correct images so it's actually a function of the true label zi and also the noisy label Lij. So now given this model you can figure out what's the optimal estimator for the true label z. You actually should maximize the marginal posterior distribution of each zi individually. This will exactly minimize the expected bitwise error rate. And if you do this it requires you to calculate this marginalized distribution in which you have to marginalize over the reliabilities and then sum over all of the other true labels. So this is a difficult inference problem because it requires integration and as I mentioned over high dimensional space. This is where we are going to use inference algorithms. Any questions for this slide?
>>: [inaudible]
>> Qiang Liu: Sorry.
>>: [inaudible] something [inaudible] labels.
>> Qiang Liu: So high parameters are hidden in the prior of the qj and you somehow you can treat the qj as a prior. So you can also add priors on the zi but I didn't do it just for simplicity.
Right now it's uniform. So now we can calculate, so now we can actually first integrate over the continuous variable q, q are the reliabilities so we can get our marginal distribution over the true labels only. So the true labels are discrete labels. It's good for belief propagation algorithm. And this integration can be actually exactly calculated by pushing the integration into the inside to the product, so now the integration are only one dimensional integration and it's easy to calculate either by numerical methods or closed form solution. So now we can actually define this one dimensional integration as some local factors ij, which is a function of cj but remember cj is a function of a true labels zi that connects to worker j. So overall we can actually rewrite this posterior distribution of z as a product of some local factors which has exactly the graphical model form.
>>: [inaudible]
>> Qiang Liu: For what?
>>: For the prior, you don't need [inaudible]
>> Qiang Liu: Don't need to. So yeah. You can set arbitrary prior, so you can still integrate over this thing, yeah. Now we can actually transform this sort of bipartite [inaudible] graph of crowd sourcing to a standard vector graph representation where variables are the tasks or images and
the factors are the workers. So the idea here is each worker actually introduces some correlation on your posterior distribution of the images that they labeled, so if the worker is very accurate, is very good then all of the labels should be consistent with his labels, so this is sort of correlation. So now we can run a standard brief propagation on just the posterior distribution of pz. So we have these local factors [inaudible] here so what shape it looks like.
So it has this integration form that depends on the definition of the prior qj so if qj has a flat prior, so the reliability can be uniform between zero and one and they have this symmetric and convex shape, so we are on both ends when the workers are either perfectly wrong or perfectly correct it provides a lot of information. So it has a high-value. And when the workers have half of them correct, it's very random and it basically provides no information so it has no values.
Any questions? So here are some other prior. So in this case you have larger probability to be q larger than .5 so there are more experts, so in this case all correct counts more information than all wrong. So here's another prior, so we have half of the spammers and half experts.
Here's when the workers are deterministically equal to some q larger than .5. In this case the factor is actually a straight line.
>>: [inaudible] you expect [inaudible] the one you didn't show, which is the [inaudible] around
.5 [inaudible]
>> Qiang Liu: Oh, you mean just this one with this?
>>: [inaudible]
>> Qiang Liu: So actually you can guarantee for arbitrary choice of qj this thing is always have like quality second derivative, second in a sense of final difference, so you always have this…
>>: [inaudible]
>> Qiang Liu: Yeah, yeah. You can actually show that. It could be flat, but yeah. So fundamentally, so if you look at a form of this shape, fundamentally something close to the entropy, so when you have deterministic values it has to more, and active entropy, sorry, oh yeah.
>>: Did you [inaudible] when you have a hammer, what's that? You have no expert [inaudible] just the [inaudible] red.
>> Qiang Liu: To remember I have this thing, so basically the value of q will decide the slope of line, so if it's in the middle it's just flat. So now you can just round the standard belief propagation algorithm. So this is just a form, don't bother the form, but essentially we will decode the true labels zi by maximizing the marginals we get from the algorithms.
>>: But it's not [inaudible], right? You're doing loopy propagation.
>> Qiang Liu: Yeah, it's approximation.
>>: So you don't know whether it will converge?
>> Qiang Liu: Actually, yeah, I don't know. This problem is somehow easy because if you have a lot of workers, the distribution actually concentrates around the true value, so that's not a problem here. But yes, it's always a problem for loopy propagation. So now what's interesting is that because the labels are binary labels so you can actually transform this, the binary distribution into a log odds ratio, and also the same for the messages. Now both the messages and the marginals are write as real values and then we can transform the whole [inaudible] algorithm using this log odds form and we get something very similar to KOS algorithm, so basically we have the same formula for estimation labels again a weighted marginal majority voting linear update from task to the workers, but a different message update from workers to task. So for KOS this is a linear update but for our algorithm we have this nonlinear sigma, sigma function whose form I will not define it in this talk, but basically it's some symmetric and monotonically increasing function. It's a sigma function [inaudible] with large values. And you can calculate it very efficiently when it's O dj log dj squared, so dj is the number of images labeled by worker j, so it's only slightly worse than KOS whose complexity is Odj. So the idea here is that using this sigma function is somehow more robust. It doesn't go to infinite as the linear update, so this is actually important.
>>: [inaudible] values [inaudible]
>> Qiang Liu: Ah, this one. Oh yeah. So it's actually, so yeah, it's complicated because sigma is a function of many variables. Sigma is a symmetric function so this is just a, it's [inaudible] where you treat all of the inputs are equal. Yeah, a special case. In general, you can prove it's a symmetric monotonic increasing; in general it has this shape.
>>: So that means the law [inaudible] is now defined as the -- can you go back a page?
>> Qiang Liu: So actually…
>>: Saturate. This doesn't say that the [inaudible] put anything on the marginals. It's just that the size of the messages are saturated.
>> Qiang Liu: Yeah. So because this algorithm you actually prove the prior so because it was a prior they don't set. They somehow, the probability [inaudible] from zero. Now it's interesting to look at special, the algorithm when we take some special priors. If the workers have this prior, the deterministic prior equal, q equal to some large value larger than .5 then my algorithm reduce to majority voting not surprising and it's not good because this prior doesn't count any diversity nor the adversaries for the workers.
>>: Sorry. [inaudible]
>> Qiang Liu: Yeah, any.
>>: I see.
>> Qiang Liu: Once it's larger than .5. If it's smaller than .5 then you get some sort of like…
>>: [inaudible]
>> Qiang Liu: Yeah, yeah.
>>: So what is the, what is the function in this case? Is it like a step function?
>> Qiang Liu: So actually it's a uniform function, so every time the sigma, every time you update you will set this one equal to uniform. Then you go back this update you will get majority voting. So now what's interesting is that if you take some prior called Haldane prior.
In this case the workers’ reliability equals to either zero or one with half probability. Then in this case my algorithm reduced to the KOS algorithm, in which case the sigma function reduced to the straight-line. And this prior is actually very special. It's a limit of the beta [inaudible] prior. So in objective studies takes, many people discuss this prior because it has very nice properties that matches Bayesian statistics through frequent list. For example, if you do a
Bayesian instance over this prior you will exactly get a likely [inaudible] estimator. So how is this prior is also not reasonable in practice? Because it actually counts too many adversaries.
In practice, you [inaudible] you have a prior like this. So basically you have almost as many adversaries as workers and you don't have anything in between them. This is very extreme.
And because my algorithm actually can take arbitrary prior so you can think about take some more reasonable priors like this so we have a reasonable amount of adversaries but not over dominant. And in practice this works way better than [inaudible].
>>: [inaudible] questions estimate for the priors?
>> Qiang Liu: I actually tried that. We can actually…
>>: Is that the sort of curve [inaudible] or…
>> Qiang Liu: So it depends actually. But actually, so yes. In general it has the shape because, but sometimes it has some peaks, so it's not smooth as I tried, because real data is not that large so you can not see the effects. The point is that I tried different priors and I even tried to learn the priors. Actually the priors, the choice of priors actually doesn't influence the accuracy a lot, once you have some general shape like this. So I will talk about this in the next few minutes.
>>: [inaudible]
>> Qiang Liu: So for the multiclass, so basically this is a beta prior. For multiclass you have to issue a prior. Then you have multi-[inaudible] prior.
>>: But then the factors -- oh, but it's conjugate so you don't need to do the integral.
>> Qiang Liu: Yeah. You don't have to. Yeah, integral will be high dimensional if it's not conjugate.
>>: Right, but if it's [inaudible]
>> Qiang Liu: Yeah, yeah sure. So now I talked about how to use belief propagation to do the inference. Here we can use mean field algorithm also and it actually connects to the EM algorithm. So what we do here is slightly different. So we have this joint posterior distribution over z, the true label and the reliability q. Now we just approximate the whole distribution with a special fully independent model, a model whose probability is just a product of some local probability ui over zi and some new j over qj, so both mu and nu are probabilities and that we minimize the KL divergence. So we can use coordinate descent algorithm to solve this. We update mu with [inaudible] nu and update [inaudible]. And what's special is that this nu function distribution is a continuous distribution so you cannot update exactly but you can actually show that optimal point the nu is always a beta distribution, so every time you adjust these for updates a [inaudible] of the parameter of beta distribution that's fine and that's the exactly the optimal solution. So and then what you can do is that you can actually approximate the update of nu with some one order approximation. So the reason I want to do this approximation is because then I get a much simpler update. So this is after approximation we have update for mu and q. Now the q distribution you only need to [inaudible] of the mean value of qj. So alpha-beta are the prior, so here I assume the reliability have a prior of beta distribution. Alpha-beta are the coefficients in the prior. So now what's interesting is that if you go back and divide the EM algorithm on the same model, you find that you get almost the same update except that your update for qj your alpha is replaced by alpha -1, so that's only one small difference. But actually this small difference actually makes a lot of sense because it essentially adds one smoothness, so you will make the algorithm more robust. So in general if you take a uniform prior, in that case alpha and beta are equal to one, then this update is likely to exactly, qj is likely to update zero or one. If that happens you can check the EM algorithm. It will then stop at this point and won't ever update, so in many cases using add one smoothness is actually a good thing to do. And the difference between this alpha and alpha -1 is just the difference between marginalizing the beta distribution or maximizing the beta distribution intuitively. So our algorithm can have some, several different extensions because it is sort of derived using a general weighted model so you can think about different models. Before I use this very simple Belongie model where the labels are just generated from a single [inaudible] so qj is the accuracy here, but you can actually extend it to a more complicated model that essentially has this confusion matrix structure, so in this case depending on different value of zi, the probability of correctness is different. One is sensitivity and another is specificity. And this model is actually very important in many practical data sets because usually we have this position or core issue where if they label is +1 then the accuracy could be very high or very low and this model can capture this. I will show you experiments on that later. And also, since we are able to marginalize over the parameters we can actually do model selection by just
comparing. So if we have these two models and we want to decide which model fits the data better, what we can do is to calculate the marginal distribution condition on model one and model two and compare which marginal likelihood is larger. And also we can incorporate item features and expect labels and all sorts of things that you can think about. So now I'm going to present some experiments starting from some simulations examples and then to the real data.
So this is a figure that I showed you earlier. So basically the [inaudible] graph is a random bipartite graph and then the workers’ reliability are drawn from this very simple prior where you have half experts and half spammers. Then I increased the number of workers per images, so as the number of workers increased the accuracy generally decays so KOS is generally better than majority voting, but it actually perform badly when the workers are, the number of workers are small. So now if you actually use BP algorithm with a uniform prior, you get a slight improvement. Remember KOS is actually also BP algorithm but with beta 00 prior, so the only difference is different choice of prior including the majority voting. Now if you use BP beta 2, 1 prior you get much better, so what this means is basically saying because the true prior actually doesn't have any adversaries so in this case you count less adversaries, and it's reasonable to get much better. And here is what I, if I use the true prior because I know it, in my BP algorithm so in some sense this is what you can do in the best case, in practical, because you never know the true distributions. And actually, the BP 2, 1 prior is very close to the true distribution, so before I said I tried say different priors and I even tried to learn the prior, but actually it doesn't matter in this case because it's already very close to the true distribution. So okay. Oh sorry, I have something else. So now I also ran the EM algorithm and approximate mean field algorithm so their performance is slightly worse than the BP but not very significantly, so only minor improvement. So here is the same data but now I change the number of images per workers with a fixed number of workers per image, so you can think about if you fix the number of workers and only changed the number of images for workers what you can do is get more accurate estimation of workers’reliability. And because majority voting doesn't use any reliability information, so it's always flat and all the other algorithms actually get better when you have more images. Basically you train the workers more. But majority, KOS is worse and then the other algorithms are sort of close, also close to the true distribution.
>>: So is the [inaudible] here actually the [inaudible] but without the [inaudible]?
>> Qiang Liu: This one?
>>: The EM just in general, so [inaudible] model but minus the [inaudible]
>> Qiang Liu: Yeah, yeah, so I forgot that paper but this is a one point model without difficulty but with the prior. So here's when I change the data priors. So before I had this model now I increase, gradually increase the percentage of adversaries, so now I actually have adversaries.
So as the adversaries increase majority voting gets worse because, you know, it's majority voting, but all of the other algorithms actually get better because as you have more adversaries the adversary actually carries some information. Now if you can detect them you can
[inaudible] the labels and you can get a character label and this is what this algorithm do. But what's interesting you can see the EM algorithm which is this light, dark curve is actually much
worse than the approximate mean field approximation which essentially is different from the
EM algorithm by adding an alpha one. And it actually gets worse with the, when there are a large number of adversaries, so this is because this numerical stability issue.
>>: [inaudible]
>> Qiang Liu: Yeah, I didn't say.
>>: [inaudible] generally it's better to run [inaudible]
>> Qiang Liu: Yeah, okay. Yeah. So here's a real data set. So it's an image data set with a set of birds, so we have two types of birds here and we want to identify which is which. So it's a binary classification problem. One hundred eight images and 39 workers. So here we, because originally it is a fully connected graph, so we subsample the number of workers, and what's interesting is that actually all of these algorithm EM, BP, KOS are worse than majority voting in this case. And they are even worse than algorithms proposed in the same paper, Welinder et al
2010, that use multidimensional representation model. That model is sort of complicated because it has difficulty, all sorts of things, but it got much better than majority voting, so what's the reason for this? Because for these three curves I'm actually using the one point model. If I use it two point model but I run the same inference algorithm I immediately get as good as learning the algorithm.
>>: Excuse me, what is the two point model?
>> Qiang Liu: Two point model is the confusion matrix model, so depending on -- yeah, sorry.
>>: The arrow [inaudible] symmetric.
>> Qiang Liu: Yeah, so this is the natural of this data set because I forgot which, but one of the birds is more difficult to identify than the other one so they also talk about this thing in their paper. So basically for this data set the mostly important factor is this two-point assumption.
Other than that anything like multidimensional representation or major difficulty is seems doesn't help at least according to this result.
>>: [inaudible] result [inaudible]?
>> Qiang Liu: Yeah, I have. I think it's around the same. It's just too many priors. I did not want to show.
>>: [inaudible] because [inaudible] must be the [inaudible] learned [inaudible] a label
[inaudible] easy.
>> Qiang Liu: So you are saying in which part is unreasonable? And this?
>>: [inaudible] thing to have [inaudible]
>> Qiang Liu: Oh, yeah, sure, sure.
>>: So if you [inaudible] easy than it must be [inaudible] you would think that the judges would learn as a judge.
>>: But they are not getting any feedback, are they? [multiple speakers] [inaudible]
>>: Are they getting feedback on the data set? Or was the data set just bulk labeled without feedback to the [inaudible]?
>> Qiang Liu: Ah, what you mean by feedback? So you mean the workers are like sort of random result feedback to the…
>>: Other than instructions, were the workers given any error correction information as they were labeling?
>> Qiang Liu: I don't know, so it's just a data set that they gave me.
>>: [inaudible] classifier that's operating the base error rate. Let's assume people are. Then it's not certainly true that the optimum confusion matrix in fact is symmetric depending on the distributions, so they could be being, the humans could be being base optimal and still have an asymmetric confusion [inaudible]
>>: But presumably if they get feedback they have incentive to go base optimal, right?
>>: Yeah, or even if they're trying to be, just being generous, right?
>> Qiang Liu: So you can actually plot more what an empirical confusion matrix looks like in this data set. And it is actually asymmetric. So and then we tried another natural language data set. In this case we have a set of sentence and in which we want to rank the temporal orders of two verbs, so in this case John fell and Sam pushed him. Push is the first so it's the answer.
And now in this data set actually story is different because if you try one point, two point model it's most likely the same; you get the same performance except KOS and majority voting is much worse. So yeah. Sorry?
>>: There were features, right?
>> Qiang Liu: No features. Just sentence, just the problems here, no features.
>>: [inaudible] label. The interesting thing is [inaudible] there is a paper from the [inaudible] there approach is the best. So actually it's the worst.
>> Qiang Liu: Yeah, from theory it's best, but assumes a lot of things. So they are showing the number of workers and number of images has to be very large and the graph has to be generated randomly and more importantly, they are showing the model is from the one point model which is not reasonable. And even the assumption are all correct, they still perform worse in the case where you have small number of workers, so they call this phenomena phase transition. They actually discussed this problem. So yeah. I think fundamentally it's because they used this hardness prior. They don't add any smoothness into the algorithm. That's their problem. To summarize, I have used this graph co-model method to solve crowd sourcing problem, generalize KOS, majority voting and connected EM algorithm with mean field methods. Some insight here, so first of all the choice of prior is critical, so majority voting, KOS,
BP, they're different. They are just the same algorithm with different priors. And also sometimes for EM algorithm you can also see improvement when using different priors. So another thing choose the model is really critical. One point model, two point model sometimes can make really difference. So the fundamental question here is how to do model selection on this problem. And now inference problem sometimes matter and sometimes doesn't, so BP is sometimes slightly better than EM, but sometimes the same, so I wouldn't say this is a critical factor, but it's something that you should be careful. And finally, this related work of Minimax mean principal algorithm is another approach that also use a very simple unifying principle by
Danny Zhou [inaudible] and [inaudible] so it's interesting to see.
>>: [inaudible] because your system [inaudible] doesn't have a [inaudible]
>> Qiang Liu: We don't have.
>>: [inaudible]
>> Qiang Liu: Which, which?
>>: Your whole approach does not have a measure of difficulty.
>> Qiang Liu: Yeah, we don't have.
>>: Except [inaudible]
>>: Even that doesn't have it. Is just the same coin, the same two coins for everyone.
>>: Yeah.
>> Qiang Liu: Yeah. Sorry, I didn't hear you.
>>: [inaudible] hard to compare [inaudible] [multiple speakers] [inaudible]
>>: [inaudible] different tasks.
>>: Actually [inaudible] created data set and [inaudible] data set so you could [inaudible] queries [inaudible] because it's [inaudible] approach is much worse than majority voting.
[inaudible] are very strong but [inaudible] very bad, is much worse than majority voting.
>> Qiang Liu: That's possible actually. So if the workers have very close reliabilities then you can actually show majority voting is the best thing you can do it, so probably that's the reason.
>>: [inaudible] the workers have very different approach.
>>: Very different [inaudible] [laughter]
>>: [inaudible] very strong. The [inaudible] workers [inaudible] and the stronger [inaudible] would carry more weight. [inaudible] the labels.
>> Qiang Liu: Uh-huh. I'm not sure if you add some prior to smooth the, all the workers whether you can get some better…
>>: [inaudible] is can you generate Danny’s model back into your model?
>> Qiang Liu: I think yes. No, I don't have model. I just have like a [inaudible].
>>: Well is there some way to take his thing and chop out like per item difficulty and chop out only having two classes? Does it match up in some way?
>>: If you match them you [inaudible] framework [inaudible] our approach [inaudible] approach.
>>: [inaudible]
>>: Yes. [inaudible] that's bad about that approach. Than our approach is [inaudible]
>>: But there's no way to make it reduce to [inaudible]
>> Qiang Liu: So, so…
>>: Zhou work is pretty strong [inaudible] bit of a [inaudible] prior [inaudible] prior on the parameters.
>>: I see.
>>: That's a difference.
>>: Which doesn't naturally happen in [inaudible].
>>: Huh?
>>: Which doesn't naturally happen come out of [inaudible], the [inaudible]
>> Qiang Liu: I think in your framework there are, you have this regularization which is sort of like a prior but it's different prior. It's like putting some Gaussian prior over the exponential parameters instead of beta…
>>: You've already penalized the [inaudible] over empirical [inaudible]
>> Qiang Liu: Yeah, I think the difference is that so here I'm saying taking different models you can do different algorithm to decode the solution even with the same model assumptions. So
EM is one of them you can do something different. I think this is…
>>: [inaudible] has this [inaudible] very, very nice [inaudible] and is at least consistent with them. Is a unique? Does it follow from a accident that you happen to [inaudible] so the question is what happens when [inaudible]
>> Qiang Liu: What do you mean by accent?
>>: [inaudible]
>>: Objectivity.
>>: [inaudible] objectivity sense because [inaudible]
>>: [inaudible] doesn't depend on the [inaudible]
>>: Actually the models [inaudible]
>> Qiang Liu: Sure.
>>: And so the other thing follow from those two principles, right, or at least they are consistent, so the question is if you're trying to set the parameters to yours, does it also
[inaudible] [multiple speakers]
>> Qiang Liu: So they have this first principle and then basically it's used to construct a model, right? So in the Minimax formula you have to choose the [inaudible] that is to match and this exactly is true that some exponential family model so it's, I think it's basically selecting the model but in a more intuitive way, yeah.
>>: [inaudible] actually [inaudible] the work [inaudible] confuse a matrix model, but if you put a prior on the [inaudible] matrix [inaudible] prior on the [inaudible] inference using [inaudible]
propagation and [inaudible] and before [inaudible] work and other work just did a [inaudible] estimation [inaudible] see the difference.
>>: And given that little data you have, it makes sense to integrate.
>>: [inaudible] prior uniform [inaudible] assume that the most guys are good guys. [inaudible] that's why this approach [inaudible] work.
>> Qiang Liu: Okay. Sure. So basically I tried different priors. Once they have this like decreasing shape than it's generally okay. It doesn't matter to use beta 2, 1 or 3, 1 or 10, 1.
>> Dengyong Zhou: Okay. Let's thank the speaker.