AAAI Technical Report SS-12-06 Wisdom of the Crowd Improving Crowd Labeling through Expert Evaluation Faiza Khan Khattak Ansaf Salleb-Aouissi Department of Computer Science Columbia University New York, NY 10027 fk2224@columbia.edu Center for Computational Learning Systems Columbia University New York, NY 10115 ansaf@ccls.columbia.edu Abstract crowd labeler and to the difficulty of each instance. True labels for all instances are then approximated through those weights along with the crowd labels. We provide an efficient and effective methodology that demonstrates that injecting a little expertise in the labeling process, significantly improves the accuracy of the labeling task even in the presence of a large proportion of low-quality labelers in the crowd. The contributions of this paper are as follows: 1. We argue and show that crowd-labeling task can tremendously benefit from including a few expert-labeled instances to the task, if available. 2. We show empirically that our approach is effective and efficient as compared to other methods and performs as good as the other state-of-art methods when most of the crowd-labelers are giving high quality labels. But our method also performs well even in the extreme case when the labelers are making more than 80% mistakes in which case all other state-of-art methods fail. This paper is organized as follows: we first describe the related work, then we present our Expert Label Injected Crowd Estimation (ELICE) framework along with its clustering-based variant. Experiments demonstrating the efficiency and accuracy of our methodology are provided next. We finally conclude with a summary and future work. We propose a general scheme for quality-controlled labeling of large-scale data using multiple labels from the crowd and a “few” ground truth labels from an expert of the field. Expert-labeled instances are used to assign weights to the expertise of each crowd labeler and to the difficulty of each instance. Ground truth labels for all instances are then approximated through those weights and the crowd labels. We argue that injecting a little expertise in the labeling process, will significantly improve the accuracy of the labeling task. Our empirical evaluation demonstrates that our methodology is efficient and effective as it gives better quality labels than majority voting and other state-of-the-art methods even in the presence of a large proportion of low-quality labelers in the crowd. Introduction Recently, there has been an increasing interest in Learning from Crowd (Raykar et al. 2010), where people’s intelligence and expertise can be gathered, combined and leveraged in order to solve different kinds of problems. Specifically, crowd-labeling emerged from the need to label largescale and complex data, often a tedious, expensive and timeconsuming task. While reaching for human cheap-and-fast labeling (e.g. using Mechanical Turk) is becoming widely used by researchers and practitioners, the quality and integration of different labels remain an open problem. This is particularly true when the labelers participating to the task are of unknown or unreliable expertise and generally motivated by the financial reward of the task (Fort, Adda, and Cohen 2011). We propose a framework called Expert Label Injected Crowd Estimation (ELICE) for accurate labeling of data by mixing multiple labels from the crowd and a few labels form the experts of the field. We assume that experts always provide ground truth labels without making any mistakes. The level of expertise of the crowd will be determined by comparing their labels to the expert labels. Similarly expert labels will also be used to judge the level of difficulty of the instances labeled by the experts. Hence, expert-labeled instances are used to assign weights to the expertise of each Related work Quite a lot of recent work has addressed the topic of learning from crowd (Raykar et al. 2010). Whitehill et al. (Whitehill et al. 2009) propose a probabilistic model of the labeling process which they call G LAD (Generative model of Labels, Abilities, and Difficulties). Expectation-maximization (EM) is used to obtain maximum likelihood estimates of the unobserved variables and is shown to outperform “majority voting.” Whitehill et al. (Whitehill et al. 2009) also propose a variation of GLAD that uses clamping in which it is assumed that the true labels for some of the instances are known somehow which are “clamped” into the EM algorithm by choosing the prior probability of the true labels very high for one class and very low for the other. A probabilistic framework is also proposed by Yan et al. (Yan et al. 2010) as an approach to model annotator expertise and build classification models in a multiple label setting. Raykar et al. (Raykar et al. 2009) proposed a Bayesian framework to estimate the c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 27 Bad Labelers Less than 30% 30 to 70% More than 70% Dataset Instances Expert labels Majority Voting GLAD GLAD with clamping ELICE ELICE with clustering Majority Voting GLAD GLAD with clamping ELICE ELICE with clustering Majority Voting GLAD GLAD with clamping ELICE ELICE with clustering Mushroom 8124 20 0.9922 1.000 1.000 0.9998 0.9997 0.3627 0.5002 0.5002 0.9917 0.9947 0.0002 0.0002 0.0002 0.5643 0.7001 Chess 3196 8 0.9882 1.000 1.000 1.000 1.000 0.3702 0.5001 0.5001 0.9932 0.9944 0.0014 0.0039 0.0039 0.5884 0.6481 Tic-Tac-Toe 958 8 0.9987 1.000 1.0000 1.000 1.000 0.3732 0.2510 0.2510 0.9961 0.9968 0.0010 0.0013 0.0013 0.5242 0.6287 Breast Cancer 569 8 0.9978 1.000 1.000 0.9996 1.000 0.4332 0.5011 0.5011 0.9966 0.9979 0.0023 0.0022 0.0022 0.5599 0.6249 IRIS 100 4 1.000 1.000 1.000 1.0000 1.000 0.3675 0.2594 0.5031 0.9931 0.9931 0.0100 0.0125 0.0125 0.5451 0.5777 Table 1: Accuracy of Majority voting, GLAD (with and without clamping) and ELICE (with and without clustering) for different datasets and different rates of bad labelers. ground truth and learn a classifier. The main novelty of this work is the extension of the approach from binary to categorical, ordinal and continuous labels. Another related line of research is to show that using multiple, noisy labelers is as good as using fewer expert labelers. Work in that context includes Sheng et al. (Sheng, Provost, and Ipeirotis 2008), Snow et al. (Snow et al. 2008) and Sorokin and Forsyth (Sorokin and Forsyth 2008). In (Donmez, Carbonell, and Schneider 2009), the authors propose a method to increase the accuracy of the crowd labeling using the concept of active learning by choosing the most informative labels. This is done by constructing a confidence interval called “Interval Estimate Threshold” for the reliability of labeler. More recently, (Yan et al. 2011) have also developed a probabilistic method based on the idea of active learning to use the best possible labels from the crowd. In (Druck, Settles, and McCallum 2009) active learning approach is adopted for labeling based on features instead of the instances and it is shown that this approach outperforms the corresponding passive learning approach based on features. In our framework, expert-labeled instances are used to evaluate the crowd expertise and the instance difficulty, to improve the labeling accuracy. for the labeler who labels everything incorrectly. This is because subtracting 1 penalizes the expertise of a crowd labeler when he makes a mistake but it is incremented by 1 when he labels correctly. At the end the sum is divided by n. Similarly, βi denotes the level of the difficulty of the instance i, which is calculated by adding 1 when a crowd labeler labels that particular, instance correctly. Sum is normalized by dividing by M . βi ’s can have value between 0 and 1, where 0 is for the difficult instance and 1 is for the easy one. Formulas for calculating αj ’s and βi ’s are as follows, n 1X αj = [1(Li = lij ) − 1(Li = 6 lij )] n i=1 βi = M 1 X [1(Li = lij )] M j=1 (1) (2) where j = 1, . . . , M , i = 1, . . . , n and 1 is the indicator variable. We infer the rest of the (N − n) number of β’s by using α’s. As the true labels for the rest of instances are not available, we try to find a temporary approximation of the true labels, which we name as expected label (EL): Our Framework M 1 X si = αj ∗lij , M j=1 Expert Label Injected Crowd Estimation (ELICE) Consider a dataset of N instances, which is to be labeled as positive (label= +1) or negative (label=-1). An expert of the domain labels a subset of n instances. There are M crowd labelers who label all N instances. Label given to the ith instance by the j th labeler is denoted by lij . The expertise level of a labeler j is denoted by αj , which can have value between -1 and 1, where 1 is the score for a labeler who labels all instances correctly and -1 is the score 1 ELi = −1 if si ≥ 0 if si < 0 (3) These expected labels are used to approximate the rest of (N − n) number of β’s , βi = 28 M 1 X [1(ELi = lij )] M j=1 (4) Experiments Next, logistic function is used to calculate the score associated with the correctness of a label, based on the level of expertise of the crowd labeler and the difficulty of the instance. This score gives us the approximation of the true labels called inferred labels (IL) using the following formulas: Pi = M 1 X 1 ∗ lij M j=1 (1 + exp(−αj βi )) ILi = 1 −1 We conducted experiments on several datasets from the UCI repository (Frank and Asuncion 2010) including Mushroom, Chess, Tic-Tac-toe, Breast cancer and IRIS (with this latter restricted to 2 classes). Table 1 shows a comparison of the accuracy of different methods as compared to ours across different datasets for different rates of bad labelers. Bad labelers are assumed to make more than 80% mistakes while the rest of the labelers are assumed to be good and make at most 20% mistakes. It can be observed from the table that our methods outperform majority voting, GLAD and GLAD with clamping even when the percentage of bad labelers is increased to more than 70%. The corresponding graph for Mushroom dataset is shown in Figure 1 (top). Our method also has the advantage of having a runtime, which is significantly less than that of GLAD and GLAD with clamping as shown in Figure 1 (bottom). In all these experiments the number of crowd labelers is 20. We have also shown that increasing the number of expert labels increases the accuracy of the result up to a certain level, as shown in Figure 2 for the Mushroom dataset. In this case we have assumed that the bad labelers are adversarial and are making 90% mistakes. Results for other datasets are given in Table 2 and Table 3. (5) if Pi ≥ 0 if Pi < 0 (6) Expert Label Injected Crowd Estimation with clustering ELICE with clustering is a variation of the above method with the only difference that instead of picking the instances randomly from the whole dataset for getting expert labels, clusters are formed and then equal number of instances are chosen from each cluster and given to the expert to label. In the following, we will present an empirical evaluation of both ELICE and ELICE with clustering on different benchmark datasets. Discussion & Future Work 1 We propose a new method, named ELICE, demonstrating that injecting a “few” expert labels in a multiple crowd labeling setting improves the estimation of the reliability of the crowd labelers as well as the estimation of the actual labels. ELICE is efficient and effective in achieving this task for any setting where crowd labeler’s quality is heterogeneous, that is a mix of good and bad labelers. Bad labelers can be either adversarial in the extreme case or labelers performing a large amount of mislabeling because of their lack of expertise or the difficulty of the task. Furthermore, it turned out that our estimation methodology remains reliable even in the presence of a large proportion of such bad labelers in the crowd. We show through extensive experiments that our method is robust even when other state-of-the-art methods fail to estimate the true labels. Identifying adversarial labelers is not new, and is tackled through an a priori identification of those labelers before the labeling task starts (e.g. (Paolacci, Chandler, and Ipeirotis 2010)). ELICE has the ability of handling not only adversarial but also bad labelers in general in an integrated way instead of identifying them separately. So far, we have assumed that crowd labelers are assigned all instances to label, which may be impractical and a waste of resources. Instead, one can think of dividing the instances into equal-sized subsets and assign a fixed number of subsets to each crowd labeler. An equal number of instances can then be drawn from each subset to be labeled by the expert. Recently, an elaborated approach of instance assignment was proposed in (Karger, Oh, and Shah 2011b; 2011a; 2011c). A bipartite graph is used to model the instances and labelers. The edges between them are used to describe which 0.9 0.8 0.7 Accuracy 0.6 0.5 0.4 0.3 Majority Voting GLAD GLAD with Clamping ELICE ELICE with Clustering 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Percentage of bad labelers 0.7 0.8 0.9 1 2 10 1 10 0 Time 10 10 10 1 2 Majority Voting GLAD GLAD with Clamping ELICE ELICE with Clustering 3 10 1000 2000 3000 4000 5000 Number of instances 6000 7000 8000 Figure 1: (Top) Number of bad labelers vs. Accuracy. (Bottom) Number of instances vs. time with 20 expert labels for ELICE and ELICE with clustering 29 Bad labelers 50% 70% 90% Dataset Instances Expert labels 2 6 10 14 18 2 6 10 14 18 2 6 10 14 18 Mushroom 8124 Chess 3196 Tic-Tac-Toe 958 0.9974 0.9985 0.9990 0.9979 0.9987 0.9655 0.9835 0.9847 0.9861 0.9854 0.3849 0.7188 0.6955 0.7245 0.7226 0.9987 0.9986 0.9997 0.9984 0.9986 0.9651 0.9829 0.9856 0.9877 0.9851 0.2606 0.6583 0.7119 0.7208 0.7184 0.9945 1.0000 0.9981 0.9981 0.9971 0.9575 0.9819 0.9830 0.9824 0.9886 0.4370 0.6145 0.6595 0.6504 0.6741 Breast Cancer 569 0.9953 1.000 0.9986 0.9938 0.9985 0.9464 0.9808 0.9793 0.9813 0.9826 0.3767 0.6733 0.7006 0.7386 0.7311 IRIS 100 0.9932 1.000 1.000 1.000 1.0000 0.9367 0.9691 0.9789 0.9733 0.9896 0.3643 0.6463 0.6861 0.6622 0.6780 Table 2: Accuracy for ELICE increasing the number of Expert labels labeler will label which instance. The graph is assumed to be regular but not complete so as not every instance gets labeled by every labeler. Another extension of our method is to acquire expert labels for carefully chosen instances. We propose a variant of ELICE using clustering as one way to achieve this. By clustering instances, and having the expert label an equal number of instances from each cluster, we hope to have labels for instances representatives of the whole set. Another possibility is to ask the expert to identify and label difficult instances on which crowd will likely make most mistakes. We also plan on investigating the integration of interannotator agreement as an additional measure to assess the quality of crowd labeling. This can be achieved through modeling systematic differences across annotators and annotations as suggested in (Passonneau, Salleb-Aouissi, and Ide 2009; Bhardwaj et al. 2010). Finally, our next goal is to study theoretically the balance between the number of expert-labeled instances to acquire and crowd labelers in order to find a good compromise between good accuracy and minimum cost. 1 0.9 0.8 Accuracy 0.7 0.6 50% bad labelers 70% bad labelers 90% bad labelers 0.5 0.4 0.3 0.2 0.1 2 4 6 8 10 12 Number of expert labels 14 16 18 20 18 20 1 0.9 Accuracy for Elice with clustering 0.8 Acknowledgments This project is partially funded by a Research Initiatives in Science and Engineering (RISE) grant from the Columbia University Executive Vice President for Research. 50% bad labelers 70% bad labelers 90% bad labelers 0.7 0.6 0.5 0.4 0.3 0.2 0.1 References 0 Bhardwaj, V.; Passonneau, R. J.; Salleb-Aouissi, A.; and Ide, N. 2010. Anveshan: A framework for analysis of multiple annotators’ labeling behavior. In Proceedings of the Fourth Linguistic Annotation Workshop (LAWIV). Donmez, P.; Carbonell, J. G.; and Schneider, J. 2009. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 259–268. 2 4 6 8 10 12 Number of expert labels 14 16 Figure 2: (Top) Number of expert labelers vs. Accuracy for ELICE. (Bottom) Number of expert labelers vs. Accuracy for ELICE with clustering 30 Bad labelers 50% 70% 90% Dataset Instances Expert labels 2 6 10 14 18 2 6 10 14 18 2 6 10 14 18 Mushroom 8124 Chess 3196 Tic-Tac-Toe 958 0.9951 0.9994 0.9997 0.9991 0.9995 0.9352 0.9871 0.9899 0.9936 0.9914 0.3891 0.8632 0.8536 0.8793 0.8966 0.9966 0.9984 0.9981 0.9987 0.9987 0.9101 0.9884 0.9898 0.9881 0.9887 0.2918 0.8298 0.8503 0.8573 0.8628 0.9964 1.0000 1.0000 1.0000 1.0000 0.9494 0.9881 0.9917 0.9917 0.9948 0.3577 0.8194 0.8588 0.9047 0.9236 Breast Cancer 569 0.9965 1.000 1.0000 1.0000 0.9983 0.9511 0.9885 0.9918 1.0000 0.9962 0.5720 0.8347 0.8648 0.9122 0.9234 IRIS 100 0.9959 1.000 1.0000 1.0000 1.0000 0.9582 0.9920 0.9811 1.0000 1.0000 0.3020 0.8106 0.8544 0.9058 0.9067 Table 3: Accuracy for ELICE with clustering increasing the number of Expert labels Druck, G.; Settles, B.; and McCallum, A. 2009. Active learning by labeling features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 81–90. ACL Press. C.; Bogoni, L.; and Moy, L. 2010. Learning from crowds. Journal of Machine Learning Research 11:1297–1322. Sheng, V. S.; Provost, F.; and Ipeirotis, P. G. 2008. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 614–622. New York, NY, USA: ACM. Snow, R.; O’Connor, B.; Jurafsky, D.; and Ng, A. Y. 2008. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP ’08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 254–263. Morristown, NJ, USA: Association for Computational Linguistics. Sorokin, A., and Forsyth, D. 2008. Utility data annotation with amazon mechanical turk. Computer Vision and Pattern Recognition Workshops. Whitehill, J.; Ruvolo, P.; fan Wu, T.; Bergsma, J.; and Movellan, J. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Bengio, Y.; Schuurmans, D.; Lafferty, J.; Williams, C. K. I.; and Culotta, A., eds., Advances in Neural Information Processing Systems 22. 2035–2043. Yan, Y.; RoĢmer, R.; Glenn, F.; Mark, S.; Gerardo, H.; Luca, B.; Linda, M.; and Jennifer G., D. 2010. Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the12th International Conference on Artificial Intelligence and Statistics. Yan, Y.; Rosales, R.; Fung, G.; and Dy, J. 2011. Active learning from crowds. In Getoor, L., and Scheffer, T., eds., Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, 1161–1168. New York, NY, USA: ACM. Fort, K.; Adda, G.; and Cohen, K. B. 2011. Amazon mechanical turk: Gold mine or coal mine? Comput. Linguist. 37:413–420. Frank, A., and Asuncion, A. 2010. UCI machine learning repositor – http://archive.ics.uci.edu/ml. Karger, D. R.; Oh, S.; and Shah, D. 2011a. Budget-optimal crowdsourcing using low-rank matrix approximations. Proc. of the Allerton Conf. on Commun., Control and Computing, Monticello, IL. Karger, D. R.; Oh, S.; and Shah, D. 2011b. Budget-optimal task allocation for reliable crowdsourcing systems. CoRR abs/1110.3564. Karger, D. R.; Oh, S.; and Shah, D. 2011c. Iterative learning for reliable crowdsourcing systems. Neural Information Processing Systems (NIPS), Granada, Spain. Paolacci, G.; Chandler, J.; and Ipeirotis, P. G. 2010. Running experiments on amazon mechanical turk. Judgment and Decision Making Vol. 5, No. 5:411–419. Passonneau, R. J.; Salleb-Aouissi, A.; and Ide, N. 2009. Making sense of word sense variation. In Proceedings of the NAACL-HLT 2009 Workshop on Semantic Evaluations. Raykar, V. C.; Yu, S.; Zhao, L. H.; Jerebko, A.; Florin, C.; Valadez, G. H.; Bogoni, L.; and Moy, L. 2009. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, 889–896. New York, NY, USA: ACM. Raykar, V. C.; Yu, S.; Zhao, L. H.; Valadez, G. H.; Florin, 31