Improving Crowd Labeling through Expert Evaluation Faiza Khan Khattak Ansaf Salleb-Aouissi

advertisement
AAAI Technical Report SS-12-06
Wisdom of the Crowd
Improving Crowd Labeling through Expert Evaluation
Faiza Khan Khattak
Ansaf Salleb-Aouissi
Department of Computer Science
Columbia University
New York, NY 10027
fk2224@columbia.edu
Center for Computational Learning Systems
Columbia University
New York, NY 10115
ansaf@ccls.columbia.edu
Abstract
crowd labeler and to the difficulty of each instance. True labels for all instances are then approximated through those
weights along with the crowd labels. We provide an efficient
and effective methodology that demonstrates that injecting a
little expertise in the labeling process, significantly improves
the accuracy of the labeling task even in the presence of a
large proportion of low-quality labelers in the crowd.
The contributions of this paper are as follows:
1. We argue and show that crowd-labeling task can tremendously benefit from including a few expert-labeled instances to the task, if available.
2. We show empirically that our approach is effective and
efficient as compared to other methods and performs as
good as the other state-of-art methods when most of the
crowd-labelers are giving high quality labels. But our
method also performs well even in the extreme case when
the labelers are making more than 80% mistakes in which
case all other state-of-art methods fail.
This paper is organized as follows: we first describe
the related work, then we present our Expert Label Injected Crowd Estimation (ELICE) framework along with its
clustering-based variant. Experiments demonstrating the efficiency and accuracy of our methodology are provided next.
We finally conclude with a summary and future work.
We propose a general scheme for quality-controlled labeling of large-scale data using multiple labels from the
crowd and a “few” ground truth labels from an expert
of the field. Expert-labeled instances are used to assign
weights to the expertise of each crowd labeler and to the
difficulty of each instance. Ground truth labels for all
instances are then approximated through those weights
and the crowd labels. We argue that injecting a little
expertise in the labeling process, will significantly improve the accuracy of the labeling task. Our empirical
evaluation demonstrates that our methodology is efficient and effective as it gives better quality labels than
majority voting and other state-of-the-art methods even
in the presence of a large proportion of low-quality labelers in the crowd.
Introduction
Recently, there has been an increasing interest in Learning
from Crowd (Raykar et al. 2010), where people’s intelligence and expertise can be gathered, combined and leveraged in order to solve different kinds of problems. Specifically, crowd-labeling emerged from the need to label largescale and complex data, often a tedious, expensive and timeconsuming task. While reaching for human cheap-and-fast
labeling (e.g. using Mechanical Turk) is becoming widely
used by researchers and practitioners, the quality and integration of different labels remain an open problem. This is
particularly true when the labelers participating to the task
are of unknown or unreliable expertise and generally motivated by the financial reward of the task (Fort, Adda, and
Cohen 2011).
We propose a framework called Expert Label Injected
Crowd Estimation (ELICE) for accurate labeling of data by
mixing multiple labels from the crowd and a few labels form
the experts of the field. We assume that experts always provide ground truth labels without making any mistakes. The
level of expertise of the crowd will be determined by comparing their labels to the expert labels. Similarly expert labels will also be used to judge the level of difficulty of the
instances labeled by the experts. Hence, expert-labeled instances are used to assign weights to the expertise of each
Related work
Quite a lot of recent work has addressed the topic of learning
from crowd (Raykar et al. 2010). Whitehill et al. (Whitehill
et al. 2009) propose a probabilistic model of the labeling
process which they call G LAD (Generative model of Labels,
Abilities, and Difficulties). Expectation-maximization (EM)
is used to obtain maximum likelihood estimates of the unobserved variables and is shown to outperform “majority voting.” Whitehill et al. (Whitehill et al. 2009) also propose a
variation of GLAD that uses clamping in which it is assumed
that the true labels for some of the instances are known
somehow which are “clamped” into the EM algorithm by
choosing the prior probability of the true labels very high for
one class and very low for the other. A probabilistic framework is also proposed by Yan et al. (Yan et al. 2010) as an
approach to model annotator expertise and build classification models in a multiple label setting. Raykar et al. (Raykar
et al. 2009) proposed a Bayesian framework to estimate the
c 2012, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
27
Bad Labelers
Less than 30%
30 to 70%
More than 70%
Dataset
Instances
Expert labels
Majority Voting
GLAD
GLAD with clamping
ELICE
ELICE with clustering
Majority Voting
GLAD
GLAD with clamping
ELICE
ELICE with clustering
Majority Voting
GLAD
GLAD with clamping
ELICE
ELICE with clustering
Mushroom
8124
20
0.9922
1.000
1.000
0.9998
0.9997
0.3627
0.5002
0.5002
0.9917
0.9947
0.0002
0.0002
0.0002
0.5643
0.7001
Chess
3196
8
0.9882
1.000
1.000
1.000
1.000
0.3702
0.5001
0.5001
0.9932
0.9944
0.0014
0.0039
0.0039
0.5884
0.6481
Tic-Tac-Toe
958
8
0.9987
1.000
1.0000
1.000
1.000
0.3732
0.2510
0.2510
0.9961
0.9968
0.0010
0.0013
0.0013
0.5242
0.6287
Breast Cancer
569
8
0.9978
1.000
1.000
0.9996
1.000
0.4332
0.5011
0.5011
0.9966
0.9979
0.0023
0.0022
0.0022
0.5599
0.6249
IRIS
100
4
1.000
1.000
1.000
1.0000
1.000
0.3675
0.2594
0.5031
0.9931
0.9931
0.0100
0.0125
0.0125
0.5451
0.5777
Table 1: Accuracy of Majority voting, GLAD (with and without clamping) and ELICE (with and without clustering) for different
datasets and different rates of bad labelers.
ground truth and learn a classifier. The main novelty of this
work is the extension of the approach from binary to categorical, ordinal and continuous labels. Another related line
of research is to show that using multiple, noisy labelers is
as good as using fewer expert labelers. Work in that context
includes Sheng et al. (Sheng, Provost, and Ipeirotis 2008),
Snow et al. (Snow et al. 2008) and Sorokin and Forsyth
(Sorokin and Forsyth 2008).
In (Donmez, Carbonell, and Schneider 2009), the authors
propose a method to increase the accuracy of the crowd labeling using the concept of active learning by choosing the
most informative labels. This is done by constructing a confidence interval called “Interval Estimate Threshold” for the
reliability of labeler. More recently, (Yan et al. 2011) have
also developed a probabilistic method based on the idea
of active learning to use the best possible labels from the
crowd. In (Druck, Settles, and McCallum 2009) active learning approach is adopted for labeling based on features instead of the instances and it is shown that this approach outperforms the corresponding passive learning approach based
on features.
In our framework, expert-labeled instances are used to evaluate the crowd expertise and the instance difficulty, to improve the labeling accuracy.
for the labeler who labels everything incorrectly. This is
because subtracting 1 penalizes the expertise of a crowd
labeler when he makes a mistake but it is incremented by
1 when he labels correctly. At the end the sum is divided
by n. Similarly, βi denotes the level of the difficulty of the
instance i, which is calculated by adding 1 when a crowd
labeler labels that particular, instance correctly. Sum is
normalized by dividing by M . βi ’s can have value between
0 and 1, where 0 is for the difficult instance and 1 is for
the easy one. Formulas for calculating αj ’s and βi ’s are as
follows,
n
1X
αj =
[1(Li = lij ) − 1(Li =
6 lij )]
n i=1
βi =
M
1 X
[1(Li = lij )]
M j=1
(1)
(2)
where j = 1, . . . , M , i = 1, . . . , n and 1 is the indicator
variable.
We infer the rest of the (N − n) number of β’s by using α’s.
As the true labels for the rest of instances are not available,
we try to find a temporary approximation of the true labels,
which we name as expected label (EL):
Our Framework
M
1 X
si =
αj ∗lij ,
M j=1
Expert Label Injected Crowd Estimation (ELICE)
Consider a dataset of N instances, which is to be labeled
as positive (label= +1) or negative (label=-1). An expert
of the domain labels a subset of n instances. There are M
crowd labelers who label all N instances. Label given to
the ith instance by the j th labeler is denoted by lij . The
expertise level of a labeler j is denoted by αj , which can
have value between -1 and 1, where 1 is the score for a
labeler who labels all instances correctly and -1 is the score
1
ELi =
−1
if si ≥ 0
if si < 0
(3)
These expected labels are used to approximate the rest of
(N − n) number of β’s ,
βi =
28
M
1 X
[1(ELi = lij )]
M j=1
(4)
Experiments
Next, logistic function is used to calculate the score associated with the correctness of a label, based on the level
of expertise of the crowd labeler and the difficulty of the
instance. This score gives us the approximation of the
true labels called inferred labels (IL) using the following
formulas:
Pi =
M
1 X
1
∗ lij
M j=1 (1 + exp(−αj βi ))
ILi =
1
−1
We conducted experiments on several datasets from the UCI
repository (Frank and Asuncion 2010) including Mushroom,
Chess, Tic-Tac-toe, Breast cancer and IRIS (with this latter
restricted to 2 classes). Table 1 shows a comparison of the
accuracy of different methods as compared to ours across
different datasets for different rates of bad labelers. Bad labelers are assumed to make more than 80% mistakes while
the rest of the labelers are assumed to be good and make at
most 20% mistakes. It can be observed from the table that
our methods outperform majority voting, GLAD and GLAD
with clamping even when the percentage of bad labelers is
increased to more than 70%. The corresponding graph for
Mushroom dataset is shown in Figure 1 (top). Our method
also has the advantage of having a runtime, which is significantly less than that of GLAD and GLAD with clamping
as shown in Figure 1 (bottom). In all these experiments the
number of crowd labelers is 20.
We have also shown that increasing the number of expert
labels increases the accuracy of the result up to a certain
level, as shown in Figure 2 for the Mushroom dataset. In this
case we have assumed that the bad labelers are adversarial
and are making 90% mistakes. Results for other datasets are
given in Table 2 and Table 3.
(5)
if Pi ≥ 0
if Pi < 0
(6)
Expert Label Injected Crowd Estimation with
clustering
ELICE with clustering is a variation of the above method
with the only difference that instead of picking the instances
randomly from the whole dataset for getting expert labels,
clusters are formed and then equal number of instances are
chosen from each cluster and given to the expert to label.
In the following, we will present an empirical evaluation of both ELICE and ELICE with clustering on different
benchmark datasets.
Discussion & Future Work
1
We propose a new method, named ELICE, demonstrating
that injecting a “few” expert labels in a multiple crowd labeling setting improves the estimation of the reliability of
the crowd labelers as well as the estimation of the actual labels.
ELICE is efficient and effective in achieving this task for
any setting where crowd labeler’s quality is heterogeneous,
that is a mix of good and bad labelers. Bad labelers can be
either adversarial in the extreme case or labelers performing a large amount of mislabeling because of their lack of
expertise or the difficulty of the task.
Furthermore, it turned out that our estimation methodology remains reliable even in the presence of a large proportion of such bad labelers in the crowd. We show through
extensive experiments that our method is robust even when
other state-of-the-art methods fail to estimate the true labels.
Identifying adversarial labelers is not new, and is tackled
through an a priori identification of those labelers before the
labeling task starts (e.g. (Paolacci, Chandler, and Ipeirotis
2010)). ELICE has the ability of handling not only adversarial but also bad labelers in general in an integrated way
instead of identifying them separately.
So far, we have assumed that crowd labelers are assigned
all instances to label, which may be impractical and a waste
of resources. Instead, one can think of dividing the instances
into equal-sized subsets and assign a fixed number of subsets to each crowd labeler. An equal number of instances
can then be drawn from each subset to be labeled by the
expert. Recently, an elaborated approach of instance assignment was proposed in (Karger, Oh, and Shah 2011b; 2011a;
2011c). A bipartite graph is used to model the instances and
labelers. The edges between them are used to describe which
0.9
0.8
0.7
Accuracy
0.6
0.5
0.4
0.3
Majority Voting
GLAD
GLAD with Clamping
ELICE
ELICE with Clustering
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Percentage of bad labelers
0.7
0.8
0.9
1
2
10
1
10
0
Time
10
10
10
1
2
Majority Voting
GLAD
GLAD with Clamping
ELICE
ELICE with Clustering
3
10
1000
2000
3000
4000
5000
Number of instances
6000
7000
8000
Figure 1: (Top) Number of bad labelers vs. Accuracy. (Bottom) Number of instances vs. time with 20 expert labels for
ELICE and ELICE with clustering
29
Bad
labelers
50%
70%
90%
Dataset
Instances
Expert labels
2
6
10
14
18
2
6
10
14
18
2
6
10
14
18
Mushroom
8124
Chess
3196
Tic-Tac-Toe
958
0.9974
0.9985
0.9990
0.9979
0.9987
0.9655
0.9835
0.9847
0.9861
0.9854
0.3849
0.7188
0.6955
0.7245
0.7226
0.9987
0.9986
0.9997
0.9984
0.9986
0.9651
0.9829
0.9856
0.9877
0.9851
0.2606
0.6583
0.7119
0.7208
0.7184
0.9945
1.0000
0.9981
0.9981
0.9971
0.9575
0.9819
0.9830
0.9824
0.9886
0.4370
0.6145
0.6595
0.6504
0.6741
Breast Cancer
569
0.9953
1.000
0.9986
0.9938
0.9985
0.9464
0.9808
0.9793
0.9813
0.9826
0.3767
0.6733
0.7006
0.7386
0.7311
IRIS
100
0.9932
1.000
1.000
1.000
1.0000
0.9367
0.9691
0.9789
0.9733
0.9896
0.3643
0.6463
0.6861
0.6622
0.6780
Table 2: Accuracy for ELICE increasing the number of Expert labels
labeler will label which instance. The graph is assumed to be
regular but not complete so as not every instance gets labeled
by every labeler.
Another extension of our method is to acquire expert labels for carefully chosen instances. We propose a variant of
ELICE using clustering as one way to achieve this. By clustering instances, and having the expert label an equal number
of instances from each cluster, we hope to have labels for instances representatives of the whole set. Another possibility
is to ask the expert to identify and label difficult instances on
which crowd will likely make most mistakes.
We also plan on investigating the integration of interannotator agreement as an additional measure to assess the
quality of crowd labeling. This can be achieved through
modeling systematic differences across annotators and annotations as suggested in (Passonneau, Salleb-Aouissi, and
Ide 2009; Bhardwaj et al. 2010).
Finally, our next goal is to study theoretically the balance
between the number of expert-labeled instances to acquire
and crowd labelers in order to find a good compromise between good accuracy and minimum cost.
1
0.9
0.8
Accuracy
0.7
0.6
50% bad labelers
70% bad labelers
90% bad labelers
0.5
0.4
0.3
0.2
0.1
2
4
6
8
10
12
Number of expert labels
14
16
18
20
18
20
1
0.9
Accuracy for Elice with clustering
0.8
Acknowledgments
This project is partially funded by a Research Initiatives in
Science and Engineering (RISE) grant from the Columbia
University Executive Vice President for Research.
50% bad labelers
70% bad labelers
90% bad labelers
0.7
0.6
0.5
0.4
0.3
0.2
0.1
References
0
Bhardwaj, V.; Passonneau, R. J.; Salleb-Aouissi, A.; and Ide,
N. 2010. Anveshan: A framework for analysis of multiple
annotators’ labeling behavior. In Proceedings of the Fourth
Linguistic Annotation Workshop (LAWIV).
Donmez, P.; Carbonell, J. G.; and Schneider, J. 2009. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 259–268.
2
4
6
8
10
12
Number of expert labels
14
16
Figure 2: (Top) Number of expert labelers vs. Accuracy for
ELICE. (Bottom) Number of expert labelers vs. Accuracy
for ELICE with clustering
30
Bad
labelers
50%
70%
90%
Dataset
Instances
Expert labels
2
6
10
14
18
2
6
10
14
18
2
6
10
14
18
Mushroom
8124
Chess
3196
Tic-Tac-Toe
958
0.9951
0.9994
0.9997
0.9991
0.9995
0.9352
0.9871
0.9899
0.9936
0.9914
0.3891
0.8632
0.8536
0.8793
0.8966
0.9966
0.9984
0.9981
0.9987
0.9987
0.9101
0.9884
0.9898
0.9881
0.9887
0.2918
0.8298
0.8503
0.8573
0.8628
0.9964
1.0000
1.0000
1.0000
1.0000
0.9494
0.9881
0.9917
0.9917
0.9948
0.3577
0.8194
0.8588
0.9047
0.9236
Breast Cancer
569
0.9965
1.000
1.0000
1.0000
0.9983
0.9511
0.9885
0.9918
1.0000
0.9962
0.5720
0.8347
0.8648
0.9122
0.9234
IRIS
100
0.9959
1.000
1.0000
1.0000
1.0000
0.9582
0.9920
0.9811
1.0000
1.0000
0.3020
0.8106
0.8544
0.9058
0.9067
Table 3: Accuracy for ELICE with clustering increasing the number of Expert labels
Druck, G.; Settles, B.; and McCallum, A. 2009. Active
learning by labeling features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP), 81–90. ACL Press.
C.; Bogoni, L.; and Moy, L. 2010. Learning from crowds.
Journal of Machine Learning Research 11:1297–1322.
Sheng, V. S.; Provost, F.; and Ipeirotis, P. G. 2008. Get
another label? improving data quality and data mining using multiple, noisy labelers. In KDD ’08: Proceeding of the
14th ACM SIGKDD international conference on Knowledge
discovery and data mining, 614–622. New York, NY, USA:
ACM.
Snow, R.; O’Connor, B.; Jurafsky, D.; and Ng, A. Y. 2008.
Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 254–263. Morristown, NJ, USA:
Association for Computational Linguistics.
Sorokin, A., and Forsyth, D. 2008. Utility data annotation
with amazon mechanical turk. Computer Vision and Pattern
Recognition Workshops.
Whitehill, J.; Ruvolo, P.; fan Wu, T.; Bergsma, J.; and
Movellan, J. 2009. Whose vote should count more: Optimal
integration of labels from labelers of unknown expertise. In
Bengio, Y.; Schuurmans, D.; Lafferty, J.; Williams, C. K. I.;
and Culotta, A., eds., Advances in Neural Information Processing Systems 22. 2035–2043.
Yan, Y.; RoĢmer, R.; Glenn, F.; Mark, S.; Gerardo, H.; Luca,
B.; Linda, M.; and Jennifer G., D. 2010. Modeling annotator
expertise: Learning when everybody knows a bit of something. In Proceedings of the12th International Conference
on Artificial Intelligence and Statistics.
Yan, Y.; Rosales, R.; Fung, G.; and Dy, J. 2011. Active
learning from crowds. In Getoor, L., and Scheffer, T., eds.,
Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, 1161–1168. New
York, NY, USA: ACM.
Fort, K.; Adda, G.; and Cohen, K. B. 2011. Amazon mechanical turk: Gold mine or coal mine? Comput. Linguist.
37:413–420.
Frank, A., and Asuncion, A. 2010. UCI machine learning
repositor – http://archive.ics.uci.edu/ml.
Karger, D. R.; Oh, S.; and Shah, D. 2011a. Budget-optimal
crowdsourcing using low-rank matrix approximations. Proc.
of the Allerton Conf. on Commun., Control and Computing,
Monticello, IL.
Karger, D. R.; Oh, S.; and Shah, D. 2011b. Budget-optimal
task allocation for reliable crowdsourcing systems. CoRR
abs/1110.3564.
Karger, D. R.; Oh, S.; and Shah, D. 2011c. Iterative learning for reliable crowdsourcing systems. Neural Information
Processing Systems (NIPS), Granada, Spain.
Paolacci, G.; Chandler, J.; and Ipeirotis, P. G. 2010. Running experiments on amazon mechanical turk. Judgment and
Decision Making Vol. 5, No. 5:411–419.
Passonneau, R. J.; Salleb-Aouissi, A.; and Ide, N. 2009.
Making sense of word sense variation. In Proceedings of
the NAACL-HLT 2009 Workshop on Semantic Evaluations.
Raykar, V. C.; Yu, S.; Zhao, L. H.; Jerebko, A.; Florin, C.;
Valadez, G. H.; Bogoni, L.; and Moy, L. 2009. Supervised
learning from multiple experts: whom to trust when everyone lies a bit. In ICML ’09: Proceedings of the 26th Annual
International Conference on Machine Learning, 889–896.
New York, NY, USA: ACM.
Raykar, V. C.; Yu, S.; Zhao, L. H.; Valadez, G. H.; Florin,
31
Download