Human Computation: Papers from the 2011 AAAI Workshop (WS-11-11) Beyond Independent Agreement: A Tournament Selection Approach for Quality Assurance of Human Computation Tasks Yu-An Sun* Shourya Roy* *Xerox Innovation Group, 800 Philips Road, Webster NY; IIT Madras Research Park, Chennai, India {yuan.sun; shourya.roy}@xerox.com, Greg D. Little MIT CSAIL 32 Vassar Street, Cambridge, MA glittle@gmail.com generalizable to other difficult tasks which potentially has more than one correct answers. Abstract Quality assurance remains a key topic in human computation research field. Prior work indicates independent agreement is effective for low difficulty tasks, but has limitations. This paper addresses this problem by proposing a tournament selection based quality control process. The experimental results from this paper show that the human are better at identifying the correct answers than producing them themselves. After a discussion of related work, this paper will describe our approach, and experiments we ran on our solution. Then we spend the rest of the paper discussing the results, including sharing some high level insights for future research topics that came from this work Introduction Related Work Human computation is a growing research field. It holds the promise of humans and computers working seamlessly together to implement powerful systems, but it requires quality assurance to identify the correct results produced by the crowd. A common quality control method is redundancy (independent agreement)—let multiple people work on the same tasks and then get the majority one as the correct result. However, this method does not address the situation where most people produce the same incorrect answers, while mixed with a small percentage of better quality work. As observed in [12], the chance of obtaining independent agreement decreases as a question becomes more difficult or has too many possible correct answers. Online labour marketplaces like Amazon Mechanical Turk(AMT) allow companies or individuals posting jobs for a variety of tasks e.g. review collection[1], image labeling[2], user studies[3], word-sense disambiguation[4], machine translation evaluation[5], EDA simulation[6] etc, attracting millions of users from all over the World. While most of the attempts have been successful in generating large volume of data, quality assurance remained one central challenge [10]. Poor judgments from ethical workers, wrong responses from dishonest workers or spammers produce a lot of junk responses in crowdsourcing tasks. In a well written blog post, author has compared AMT with “market for lemons”, following the terminology of a seminal paper by the economist George Akerlof1 . In such a market, owing to lack of assurance of quality, overall payment remains low with the assumption of additional framework to filter out good input from junk ones. In fact, owing to abundance of spammers in marketplaces, most requestor rely on redundancy followed by plurality voting schemes to ensure quality. For example, in [7], authors reported majority voting leading to best result in search result categorization task into four categories viz. off topic, spam, matching, not- This paper addresses this problem by proposing a tournament selection based quality control process. The tournament selection method is commonly used as part of genetic algorithm to mimic survival of the fitness process. By imposing a tournament selection based quality control process on human computation tasks, it allows minority of the good quality work to prevail over the majority that are incorrect in the same way. In this paper, we focus on tasks like translation, problem solving etc., but this is Copyright © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1 http://www.crowdsourcing.org/editorial/Mechanical+Turk,+Low+Wages ,+and+the+Market+for+Lemons/3681 113 Stop Condition: repeat step 2~5 for m steps, m is the stop condition variable to optimize based on the type of human computation tasks. 7. After the stop condition is met, a pool of final selected results are generated, the majority answer from this pool is identified as the final answer. Our approach is based on the hypothesis that human are better at comparing results to pick the correct one than producing the correct results themselves. In all the experiments we described in this paper, variable n is set to be 30 and m is set to be 5. matching. Similar observations were reported in [8] for Natural Language evaluation tasks. However, performance of voting based algorithms can get affected in presence of collusion among participants. In [9], authors presented a novel algorithm to separate worker bias from errors with hidden Gold standard answers and used in post-processing to estimate the true worker error-rate. 6. Most of the quality assurance work in the context of Crowdsourcing has happened for tasks with discrete answers with one clear correct answer. Marketplaces like AMT is best suited for tasks in which there are a bona fide answers, as otherwise users would be able to “game” the system and provide nonsense answers in order to decrease their time spent and thus increase their rate of pay [3]. However, there are plenty of tasks such as translation which do not fall in this category as, due to variability in natural languages, there could be multiple different but perfectly valid translations for a given sentence. Similarly, for puzzle solving task as there could be multiple and possiblydifferent ways of solving a puzzle. For such tasks, we envision an ecosystem, where crowd themselves create and choose the best answers without any human intervention, albeit at a little higher cost and time. In the absence of bona fide answer and to get rid of involvement of Oracles, this is a novel mechanism for identifying the best answers. Techniques relying on aggregation or acceptance/rejection based on response to Gold data are not typically applicable in these cases. Prior work [11] has reported voting using fuzzy matching techniques for translation tasks, but we are apprehensive of generalizability of the same owing to several types of variation in translations. Experiments In this section, we provide details of experimental setup and results of experiments. We consider two tasks viz. (i) translation of Hindi and Chinese to English and (ii) solving brain teasers. Both these tasks produce results which are difficult to verify automatically. We consider 5 Hindi idioms (Fig 1) and 5 Chinese idioms (Fig 2) for translation. The surface meanings of both sets of idioms are different from their true interpretation. For another set of experiments, we used 9 brain teasers which along with their correct solutions are shown in Table 1 in the Appendix. We performed these experiments on Crowdflower2 which facilitates the completion of online micro-tasks among a number of labor channels including AMT. Figure 1: Hindi Idioms with Corresponding Ground truth Our Approach Our approach includes the following steps for quality assurance: 1. 2. 3. 4. 5. Start with one single human computation task Ask n people to do them independently. One person can only work on one task in this set of tasks of size n. n is a variable we can optimize for different type of human computation tasks. Collect all results of total size n, discard duplicate results. Only k unique answers are left (k n) Tournament Selection: Randomly pick two results from the k answers, ask 1 person to give a up-ordown vote on the comparison, the result that gets an up vote goes into the pool of “next generation” Repeat step 4 for n times, this generates a new pool of answers with size n Figure 2: Chinese Idioms with Corresponding Ground truth We performed a series of experiments to validate the hypotheses explained in “Our Approach” section. Rest of this section provides details of the corresponding experimental setup and obtained results. Experimental Setup for Hypothesis Testing Phase 1: CREATE In the first set of experiments we asked crowd members to translate idioms and solve puzzles. For translations, specific instructions were given to capture true 2 114 http://crowdflower.com/ interpretations in English and not to produce literal translations. One unit of crowd task is to translate two idioms and each idiom was translated 30 times (i.e. 30 different crowd members). Similarly, each puzzle was solved 30 times and each task involved solving two puzzles. We deliberately kept redundancy high to bring out the effect of large number of wrong answers in crowd response. Each translation and puzzle task was paid for $0.03. There was no automatic or manual validation of crowd input. Phase 2: COMPARE In this set of experiments, each of the crowd member had to compare and pick the better translation of an idiom (or correct solution of a puzzle), given two options. One of the options is the desired translation and another is one of the 30 responses obtained in phase 1. The objective was to test our hypothesis that it is relative easy for people to decide on the correct answer even when they cannot come up the correct answer. Figure 3: Distribution of correct and wrong responses in Phase 1 for translation of Hindi idioms.3 Figure 6: Distribution of correct and wrong decisions in Phase 2 for translation of Hindi idioms. Figure 4: Distribution of correct and wrong responses in Phase 1 for translation of Chinese idioms. Figure 7: Distribution of correct and wrong decisions in Phase 2 for translation of Chinese idioms. Figure 8: Distribution of correct and wrong decisions in Phase 2 for solving brain teasers Figure 5: Distribution of correct and wrong responses in Phase 1 for solving brain teasers 3 The dotted box shows number of responses exactly matching output from translate.google.com 115 idioms, the number of correct decisions for Chinese idioms was less. Result Analysis Figure 3 shows a distribution of number of correct and wrong responses obtained for Hindi idioms at the end of Phase 1. We manually validated all responses obtained and identified the correct ones. We were lenient in this process and approximately correct ones were also marked as correct. For example, for idiom number-1 in Fig 1, we consider Celebrate your joy, to prosper or to feel extremely jubilant, to feel extremely jubilant, celebrating an event and to rejoice as correct though only one of them matches with the corresponding ground truth. In spite of such leniency, only a small fraction of translations were found to be correct as evident from Fig 3. A large fraction (in dotted box) of wrong responses was literal translations from one of the common automatic translation service, Google Translation, which obviously failed to capture interpretation of idioms (for idiom number-1, the response was burning of fuel). Similar observations were made for Chinese translations in Fig 4. As for solving brain teasers, only few of the responses were correct as showed in Fig 5. As Crowdflower publishes these tasks to multiple platforms, there was slight variation in total number of responses obtained for each puzzle which we believe was due to synchronization issues between multiple platforms. On further analyzing, we found that most of the crowd members who responded to both the experiments were from India and Pakistan, who are expected to be less familiar with Chinese and hence probably will be taking help of automatic translation services. This was evident by higher number of correct responses for Chinese idiom 1, where the correct response was very similar to one of the common automatic translation services (Google Translate). For puzzles as well, proportion of correct ones increased from Phase 1 to Phase 2, but average number of correct decisions were much lower than translation tasks. Analysis revealed that, for some puzzles (e.g. puzzle 7) there was not much difference in effort between these two stages, because people had to solve the puzzle to make a decision between given responses. However, for others (e.g. puzzle 8) it was easy to tell if an answer is correct or not just by looking at the options. Experiments: Tournament Selection The observation from Phase 2 of previous set of experiments was encouraging but the presence of ground truth was a deterrent. For most real world tasks, presence of ground truth is an impractical assumption. However, we had observed that there are a few, albeit minority, responses which are acceptable as correct at the end of Phase 1 when the initial pool size is large enough (30 in our experiments). This result not only shows poor quality of responses obtained but also points to the fact that a plurality voting scheme would lead to wrong results. For all 10 idioms, translation getting maximum number of responses was from the automatic translation service and in most cases much ahead of the second highest one. We also observe the same results for brain teaser solving, for example, for puzzle no. 6, the majority of answers that were produced by the crowd is ’ eleven while the correct answer is nine. Tournament Selection For the experiments on the basis of Tournament Selection approach, we consider one Chinese idiom and its unique translations obtained from Phase 1 for 5 rounds of experiments. At first round, a uniform distribution of unique translations is considered and randomly selected pairs of those are chosen to let crowd member decide on better one for 30 times. The selected translations become seed for next round of selection and the same process is repeated as described in Our Approach section. The objective is to validate our approach does facilitate the emergency of the correct translation as clear winners even if they were significant minority at the initial stage (output of Phase 1). In such difficult tasks an unconstrained crowd would produce a lot more poor quality results. This is due to not only cognitive errors but also deliberate ones as spammers figure out in no time that there is no automatic or manual validation in a task. Hence, it becomes difficult to identify the hidden minority gems of correct entries. However, we believe it is easier for crowd members to pick a better translation when the ground truth is presented along with a wrong translation. The objective of Phase 2 experiments is to validate this hypothesis -- it is relatively easy for people to decide on the correct answer. We presented the ground truth (Fig 2) and one wrong response from Phase 1 for each idiom to 30 crowd members and the result is shown in Figure 7. For example, for idiom 1 the options were Celebrate your joy and Light the lamp. Expectedly, the responses appear much better as significantly higher number of crowd members were able to decide on the right answer. Similar observations were made on the set of Chinese Idioms (Fig 7) though in comparison to Hindi Experiment Results We ran multi-round Tournament Selection algorithm, as explained in “Our Approach”section on Chinese idiom no. 2 in Figure 2. We started with distinct output Phase 1 as shown below, with entry number-4 being the correct translation: 1. Like a dog fails to draw a tiger 116 2. 3. 4. Who are you? none Attempting something beyond one’s ability and fail 5. painted tiger anti-dog 6. to try to draw a tiger and end up with the likeness of a dog The initial sets of pairs were chosen by selecting uniformly at random from this pool and presented to crowd for choosing better one. Each chosen translation is put in a new pool and set of pairs for next round were chosen by selecting uniformly at random from this new pool. This process is repeated. At the end of each round, proportion of each of the above entries was computed and the corresponding plot is shown in Figure 9. The correct translation was a minority to start off but it gradually surpassed all other candidates and emerged as a clear winner within five rounds. It happened without any Oracle’s intervention or presence of ground truth and was completely done by the crowd. Cost It is important to note that the in our experiment, whenever two identical answers are randomly picked from the population, it does not need to be sent out for a human to give an up-or-down vote. This reduces additional cost significantly in the later rounds. For example, in round 5, only 14 human computation voting tasks are sent to Crowdflower for a vote, while the rest are automatic advances since they are identical answers. More work can be done to identify what other cost-saving variations can be added to this approach. References [1] Su, Q., Pavlov, D., Chow, J., and Baker, W.C. 2007. Internetscale collection of human-reviewed data. In Proc. Of the 16th international Conference on World Wide Web (WWW-2007). [2] Sorokin, A. and Forsyth, D. 2008. Utility data annotation with Amazon Mechanical Turk. In Proc. of the First IEEE Workshop on Internet Vision at CVPR-2008. [3] Kittur, A., Chi, E. H., and Suh, B. 2008. Crowdsourcing user studies with Mechanical Turk. In Proc. of the 26th Annual ACM Conference on Human Factors in Computing Systems (CHI2008). [4] Snow, R., O’Connor, B., Jurafsky, D., and Ng, A.Y. 2008. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks . In Proc. Of EMNLP2008. [5] Callison-Burch, C. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk. In Proc. of EMNLP-2009. [6] Andrew DeOrio and Valeria Bertacco. Human computing for EDA. In Proc. Of the 46th Annual Design Automation Conference (DAC-2009) [7] Le, J., Edmonds, A., Hester, V., and Biewald, Lukas. Ensuring quality in crowdsourced search relevance evaluation. Workshop on Crowdsourcing for Search Evaluation. ACM SIGIR 2010 [8] R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast|but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254{263. Association for Computational Linguistics, 2008. [9] P. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In KDD-HCOMP '10, 2010. [10] G. Kazai and N. Milic-Frayling. On the evaluation of the quality of relevance assessments collected through crowdsourcing. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pages 21-22, 2009. [11] Vamshi Ambati and Stephan Vogel. 2010. Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk, pages 62–65. [12] G. Little and Y. Sun, Human OCR: Insights from a Complex Human Computation Process, Workshop on Crowdsourcing and Human Computation, Services, Studies and Platforms, ACM CHI, 2011 Figure 9: Result of Tournament Selection for one Chinese idiom. The correct translation emerges clear winner within 5 rounds. Observation and Future Work Convergence From Fig 9, it is clear that after 3 rounds, the correct answer has emerged to be the clear winner. It only takes 5 rounds for the clear winner to reach 90% of population. We observe that translation is a well suited type of human computation tasks to use our tournament selection method for post-process quality control because it’s relatively easier to pick the right translation when you see it. Further work needs to be done to test translation longer paragraphs since we only experiment with idiom translation. 117 Appendix down at the bar. The bartender tells you a number. And you tell Table 1: Puzzles used and the correct solutions Puzzle him another number. If it’s right, Correct Solution you get a free beer. For example, a 1 If you had an infinite supply of A and B denote 5 and 3 quart customer goes up to the bar and water and a 5 quart and 3 quart pails. Here are the steps : (1) the bartender says, “Six” The pail, how would you measure Fill A (2) Pour 3quart from A customer says, ”Three,” and he exactly 4 quarts? to B (3) Empty B (4) Pour gets his free beer. The second remaining 2 from A to B (5) fellow goes up to the bar, and the Fill A (6) Pour enough (1 bartender says, “Twelve.” The quart) to fill B (7) A now has 4 customer says, “Six,” and he gets quarts his free beer! A third customer sits 2 You’ve got someone working for Cut it into 3 parts where the at the bar, and the bartender says, you for seven days and a gold bar parts are 1, 2 and 4 segments “Fourteen.” The customer says, to pay them. The gold bar is long. Day 1 - Give 1 segment “Eight.” He gets a free beer.You is segmented into seven connected bar. Day 2 - Give 2 segment sitting there. The bartender turns to pieces. You must give them a bar ask for you and says, “Twenty-two.” What piece of gold at the end of every 1 segment bar back. Day 3 - day. If you are only allowed to Give 1 segment bar. Day 4 - 7 A chicken and a half can lay an 24 days. If 1.5 chickens lay 1.5 make two breaks in the gold bar, Give 4 segment bar ask for 1 + egg and a half in a day and a half. eggs in 1.5 days, then 2 how do you pay your worker? 2 back. Day 5 - Give 1 segment How long will it take for two chickens can lay 2 eggs in 1.5 bar. Day 6 - Give 2 segment chickens to lay 32 eggs? days. 32 eggs or 16 * 2 eggs do you say? bar ask for 1 segment bar back. can be laid in 16 * 1.5 days, or Day 7 – Give 1 segment bar. 24 days. (1) Mark the jars with numbers 8 “This is an unusual paragraph. I’m for 1, 2, 3, 4, and 5. (2) Take 1 pill curious as to just how quickly you contaminated pills contained in from jar 1, take 2 pills from jar can find out what is so unusual one jar, where each pill weighs 9 2, take 3 pills from jar 3, take 4 about it. It looks so ordinary and gm. Given a scale, how could you pills from jar 4 and take 5 pills plain that you would think nothing tell which jar had the contaminated from jar 5. (3) Put all of them was wrong with it. In fact nothing pills in just one measurement? on the scale at once and take is wrong with it. It is highly the measurement. (4) Now, unusual though, study it and think subtract the measurement from about it but you still may not find 150 ( 1*10 + 2*10 + 3*10 + anything odd. But if you work at it 4*10 + 5*10) (5) The result a bit you might find out. Try to do will give you the jar number so without any coaching.” What is 3 You have 5 jars of pills. Each pill weighs 10 gram, except which has contaminated pill. There is no “e” in that entire paragraph. unique about the paragraph? 4 You have a bucket of jelly beans. Four - if you end up picking 9 You have a four-ounce glass and a 6 Steps. Fill 9 ounce glass. Fill Some are red, some are blue, and one of each color the first 3 nine-ounce glass. You have an 9 ounce glass, pour into four some times, will endless supply of water. You can ounce and throw this away, closed, pick out 2 of a like color. guarantee you have 2 of some fill or dump either glass. It turns Pour into four ounce and throw How many do you have to grab to color. out, you can measure six ounces of this away, Pour remaining one water using these two glasses. The ounce into four ounce glass. green. With your eyes the 4th one be sure you have 2 of the same? 5 Given a rectangular cake with a Cut at the line joining the question is: What is the fewest Fill nine ounce and pour on top rectangular piece removed (any center of cake and the center of number of steps in which you can of the one ounce in the four size or orientation), how would rectangular piece removed. measure six ounces? ounce glass. What is left in the you cut the remainder of the cake 9 ounce glass is 6 ounces. into two equal halves with one straight cut of a knife? 6 There’s this unusual little bar, where you can get a free beer if 9 because there are nine letters in twenty-two. you know the secret code. The secret code works like this: You sit 118