Beyond Independent Agreement: A Tournament Selection Approach for

advertisement
Human Computation: Papers from the 2011 AAAI Workshop (WS-11-11)
Beyond Independent Agreement: A Tournament Selection Approach for
Quality Assurance of Human Computation Tasks
Yu-An Sun*
Shourya Roy*
*Xerox Innovation Group,
800 Philips Road, Webster NY; IIT Madras Research Park, Chennai, India
{yuan.sun; shourya.roy}@xerox.com,
Greg D. Little
MIT CSAIL
32 Vassar Street, Cambridge, MA
glittle@gmail.com
generalizable to other difficult tasks which potentially has
more than one correct answers.
Abstract
Quality assurance remains a key topic in human computation
research field. Prior work indicates independent agreement is
effective for low difficulty tasks, but has limitations. This
paper addresses this problem by proposing a tournament
selection based quality control process. The
experimental
results from this paper show that the human are better at
identifying the correct answers than producing them
themselves.
After a discussion of related work, this paper will describe
our approach, and experiments we ran on our solution.
Then we spend the rest of the paper discussing the results,
including sharing some high level insights for future
research topics that came from this work
Introduction
Related Work
Human computation is a growing research field. It holds
the promise of humans and computers working seamlessly
together to implement powerful systems, but it requires
quality assurance to identify the correct results produced
by the crowd. A common quality control method is
redundancy (independent agreement)—let multiple people
work on the same tasks and then get the majority one as the
correct result. However, this method does not address the
situation where most people produce the same incorrect
answers, while mixed with a small percentage of better
quality work. As observed in [12], the chance of obtaining
independent agreement decreases as a question becomes
more difficult or has too many possible correct answers.
Online labour marketplaces like Amazon Mechanical
Turk(AMT) allow companies or individuals posting jobs
for a variety of tasks e.g. review collection[1], image
labeling[2], user studies[3], word-sense disambiguation[4],
machine translation evaluation[5], EDA simulation[6] etc,
attracting millions of users from all over the World. While
most of the attempts have been successful in generating
large volume of data, quality assurance remained one
central challenge [10]. Poor judgments from ethical
workers, wrong responses from dishonest workers or
spammers produce a lot of junk responses in
crowdsourcing tasks. In a well written blog post, author
has compared AMT with “market for lemons”, following
the terminology of a seminal paper by the economist
George Akerlof1 . In such a market, owing to lack of
assurance of quality, overall payment remains low with the
assumption of additional framework to filter out good input
from junk ones. In fact, owing to abundance of spammers
in marketplaces, most requestor rely on redundancy
followed by plurality voting schemes to ensure quality.
For example, in [7], authors reported majority voting
leading to best result in search result categorization task
into four categories viz. off topic, spam, matching, not-
This paper addresses this problem by proposing a
tournament selection based quality control process. The
tournament selection method is commonly used as part of
genetic algorithm to mimic survival of the fitness process.
By imposing a tournament selection based quality control
process on human computation tasks, it allows minority of
the good quality work to prevail over the majority that are
incorrect in the same way. In this paper, we focus on tasks
like translation, problem solving etc., but this is
Copyright © 2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1
http://www.crowdsourcing.org/editorial/Mechanical+Turk,+Low+Wages
,+and+the+Market+for+Lemons/3681
113
Stop Condition: repeat step 2~5 for m steps, m is
the stop condition variable to optimize based on
the type of human computation tasks.
7. After the stop condition is met, a pool of final
selected results are generated, the majority answer
from this pool is identified as the final answer.
Our approach is based on the hypothesis that human are
better at comparing results to pick the correct one than
producing the correct results themselves. In all the
experiments we described in this paper, variable n is set to
be 30 and m is set to be 5.
matching. Similar observations were reported in [8] for
Natural Language evaluation tasks. However, performance
of voting based algorithms can get affected in presence of
collusion among participants. In [9], authors presented a
novel algorithm to separate worker bias from errors with
hidden Gold standard answers and used in post-processing
to estimate the true worker error-rate.
6.
Most of the quality assurance work in the context of
Crowdsourcing has happened for tasks with discrete
answers with one clear correct answer. Marketplaces like
AMT is best suited for tasks in which there are a bona fide
answers, as otherwise users would be able to “game” the
system and provide nonsense answers in order to decrease
their time spent and thus increase their rate of pay [3].
However, there are plenty of tasks such as translation
which do not fall in this category as, due to variability in
natural languages, there could be multiple different but
perfectly valid translations for a given sentence. Similarly,
for puzzle solving task as there could be multiple and
possiblydifferent ways of solving a puzzle. For such tasks,
we envision an ecosystem, where crowd themselves create
and choose the best answers without any human
intervention, albeit at a little higher cost and time. In the
absence of bona fide answer and to get rid of involvement
of Oracles, this is a novel mechanism for identifying the
best answers. Techniques relying on aggregation or
acceptance/rejection based on response to Gold data are
not typically applicable in these cases. Prior work [11] has
reported voting using fuzzy matching techniques for
translation tasks, but we are apprehensive of
generalizability of the same owing to several types of
variation in translations.
Experiments
In this section, we provide details of experimental setup
and results of experiments. We consider two tasks viz. (i)
translation of Hindi and Chinese to English and (ii) solving
brain teasers. Both these tasks produce results which are
difficult to verify automatically. We consider 5 Hindi
idioms (Fig 1) and 5 Chinese idioms (Fig 2) for translation.
The surface meanings of both sets of idioms are different
from their true interpretation. For another set of
experiments, we used 9 brain teasers which along with
their correct solutions are shown in Table 1 in the
Appendix. We performed these experiments on
Crowdflower2 which facilitates the completion of online
micro-tasks among a number of labor channels including
AMT.
Figure 1: Hindi Idioms with Corresponding Ground truth
Our Approach
Our approach includes the following steps for quality
assurance:
1.
2.
3.
4.
5.
Start with one single human computation task
Ask n people to do them independently. One
person can only work on one task in this set of
tasks of size n. n is a variable we can optimize for
different type of human computation tasks.
Collect all results of total size n, discard duplicate
results. Only k unique answers are left (k n)
Tournament Selection: Randomly pick two results
from the k answers, ask 1 person to give a up-ordown vote on the comparison, the result that gets
an up vote goes into the pool of “next generation”
Repeat step 4 for n times, this generates a new
pool of answers with size n
Figure 2: Chinese Idioms with Corresponding Ground truth
We performed a series of experiments to validate the
hypotheses explained in “Our Approach” section. Rest of
this section provides details of the corresponding
experimental setup and obtained results.
Experimental Setup for Hypothesis Testing
Phase 1: CREATE
In the first set of experiments we asked crowd members to
translate idioms and solve puzzles. For translations,
specific instructions were given to capture true
2
114
http://crowdflower.com/
interpretations in English and not to produce literal
translations. One unit of crowd task is to translate two
idioms and each idiom was translated 30 times (i.e. 30
different crowd members). Similarly, each puzzle was
solved 30 times and each task involved solving two
puzzles. We deliberately kept redundancy high to bring out
the effect of large number of wrong answers in crowd
response.
Each translation and puzzle task was paid for $0.03. There
was no automatic or manual validation of crowd input.
Phase 2: COMPARE
In this set of experiments, each of the crowd member had
to compare and pick the better translation of an idiom (or
correct solution of a puzzle), given two options. One of the
options is the desired translation and another is one of the
30 responses obtained in phase 1. The objective was to test
our hypothesis that it is relative easy for people to decide
on the correct answer even when they cannot come up the
correct answer.
Figure 3: Distribution of correct and wrong responses in
Phase 1 for translation of Hindi idioms.3
Figure 6: Distribution of correct and wrong decisions in
Phase 2 for translation of Hindi idioms.
Figure 4: Distribution of correct and wrong responses in
Phase 1 for translation of Chinese idioms.
Figure 7: Distribution of correct and wrong decisions in
Phase 2 for translation of Chinese idioms.
Figure 8: Distribution of correct and wrong decisions in
Phase 2 for solving brain teasers
Figure 5: Distribution of correct and wrong responses in
Phase 1 for solving brain teasers
3
The dotted box shows number of responses exactly matching output
from translate.google.com
115
idioms, the number of correct decisions for Chinese idioms
was less.
Result Analysis
Figure 3 shows a distribution of number of correct and
wrong responses obtained for Hindi idioms at the end of
Phase 1. We manually validated all responses obtained and
identified the correct ones. We were lenient in this process
and approximately correct ones were also marked as
correct. For example, for idiom number-1 in Fig 1, we
consider Celebrate your joy, to prosper or to feel extremely
jubilant, to feel extremely jubilant, celebrating an event
and to rejoice as correct though only one of them matches
with the corresponding ground truth. In spite of such
leniency, only a small fraction of translations were found
to be correct as evident from Fig 3. A large fraction (in
dotted box) of wrong responses was literal translations
from one of the common automatic translation service,
Google Translation, which obviously failed to capture
interpretation of idioms (for idiom number-1, the response
was burning of fuel). Similar observations were made for
Chinese translations in Fig 4. As for solving brain teasers,
only few of the responses were correct as showed in Fig 5.
As Crowdflower publishes these tasks to multiple
platforms, there was slight variation in total number of
responses obtained for each puzzle which we believe was
due to synchronization issues between multiple platforms.
On further analyzing, we found that most of the crowd
members who responded to both the experiments were
from India and Pakistan, who are expected to be less
familiar with Chinese and hence probably will be taking
help of automatic translation services. This was evident by
higher number of correct responses for Chinese idiom 1,
where the correct response was very similar to one of the
common automatic translation services (Google Translate).
For puzzles as well, proportion of correct ones increased
from Phase 1 to Phase 2, but average number of correct
decisions were much lower than translation tasks. Analysis
revealed that, for some puzzles (e.g. puzzle 7) there was
not much difference in effort between these two stages,
because people had to solve the puzzle to make a decision
between given responses. However, for others (e.g. puzzle
8) it was easy to tell if an answer is correct or not just by
looking at the options.
Experiments: Tournament Selection
The observation from Phase 2 of previous set of
experiments was encouraging but the presence of ground
truth was a deterrent. For most real world tasks, presence
of ground truth is an impractical assumption. However, we
had observed that there are a few, albeit minority,
responses which are acceptable as correct at the end of
Phase 1 when the initial pool size is large enough (30 in
our experiments).
This result not only shows poor quality of responses
obtained but also points to the fact that a plurality voting
scheme would lead to wrong results. For all 10 idioms,
translation getting maximum number of responses was
from the automatic translation service and in most cases
much ahead of the second highest one. We also observe the
same results for brain teaser solving, for example, for
puzzle no. 6, the majority of answers that were produced
by the crowd is ’ eleven while the correct answer is nine.
Tournament Selection
For the experiments on the basis of Tournament Selection
approach, we consider one Chinese idiom and its unique
translations obtained from Phase 1 for 5 rounds of
experiments. At first round, a uniform distribution of
unique translations is considered and randomly selected
pairs of those are chosen to let crowd member decide on
better one for 30 times. The selected translations become
seed for next round of selection and the same process is
repeated as described in Our Approach section. The
objective is to validate our approach does facilitate the
emergency of the correct translation as clear winners even
if they were significant minority at the initial stage (output
of Phase 1).
In such difficult tasks an unconstrained crowd would
produce a lot more poor quality results. This is due to not
only cognitive errors but also deliberate ones as spammers
figure out in no time that there is no automatic or manual
validation in a task. Hence, it becomes difficult to identify
the hidden minority gems of correct entries. However, we
believe it is easier for crowd members to pick a better
translation when the ground truth is presented along with a
wrong translation. The objective of Phase 2 experiments is
to validate this hypothesis -- it is relatively easy for people
to decide on the correct answer. We presented the ground
truth (Fig 2) and one wrong response from Phase 1 for
each idiom to 30 crowd members and the result is shown in
Figure 7. For example, for idiom 1 the options were
Celebrate your joy and Light the lamp. Expectedly, the
responses appear much better as significantly higher
number of crowd members were able to decide on the right
answer. Similar observations were made on the set of
Chinese Idioms (Fig 7) though in comparison to Hindi
Experiment Results
We ran multi-round Tournament Selection algorithm, as
explained in “Our Approach”section on Chinese idiom no.
2 in Figure 2. We started with distinct output Phase 1 as
shown below, with entry number-4 being the correct
translation:
1. Like a dog fails to draw a tiger
116
2.
3.
4.
Who are you?
none
Attempting something beyond one’s ability and
fail
5. painted tiger anti-dog
6. to try to draw a tiger and end up with the likeness
of a dog
The initial sets of pairs were chosen by selecting uniformly
at random from this pool and presented to crowd for
choosing better one. Each chosen translation is put in a
new pool and set of pairs for next round were chosen by
selecting uniformly at random from this new pool. This
process is repeated. At the end of each round, proportion of
each of the above entries was computed and the
corresponding plot is shown in Figure 9. The correct
translation was a minority to start off but it gradually
surpassed all other candidates and emerged as a clear
winner within five rounds. It happened without any
Oracle’s intervention or presence of ground truth and was
completely done by the crowd.
Cost
It is important to note that the in our experiment, whenever
two identical answers are randomly picked from the
population, it does not need to be sent out for a human to
give an up-or-down vote. This reduces additional cost
significantly in the later rounds. For example, in round 5,
only 14 human computation voting tasks are sent to
Crowdflower for a vote, while the rest are automatic
advances since they are identical answers. More work can
be done to identify what other cost-saving variations can
be added to this approach.
References
[1] Su, Q., Pavlov, D., Chow, J., and Baker, W.C. 2007. Internetscale collection of human-reviewed data. In Proc. Of the 16th
international Conference on World Wide Web (WWW-2007).
[2] Sorokin, A. and Forsyth, D. 2008. Utility data annotation with
Amazon Mechanical Turk. In Proc. of the First IEEE Workshop
on Internet Vision at CVPR-2008.
[3] Kittur, A., Chi, E. H., and Suh, B. 2008. Crowdsourcing user
studies with Mechanical Turk. In Proc. of the 26th Annual ACM
Conference on Human Factors in Computing Systems (CHI2008).
[4] Snow, R., O’Connor, B., Jurafsky, D., and Ng, A.Y. 2008.
Cheap and Fast – But is it Good? Evaluating Non-Expert
Annotations for Natural Language Tasks . In Proc. Of EMNLP2008.
[5] Callison-Burch, C. 2009. Fast, Cheap, and Creative:
Evaluating Translation Quality Using Amazon's Mechanical
Turk. In Proc. of EMNLP-2009.
[6] Andrew DeOrio and Valeria Bertacco. Human computing for
EDA. In Proc. Of the 46th Annual Design Automation
Conference (DAC-2009)
[7] Le, J., Edmonds, A., Hester, V., and Biewald, Lukas.
Ensuring quality in crowdsourced search relevance evaluation.
Workshop on Crowdsourcing for Search Evaluation. ACM SIGIR
2010
[8] R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and
fast|but is it good?: evaluating non-expert annotations for natural
language tasks. In EMNLP '08: Proceedings of the Conference on
Empirical Methods in Natural Language Processing, pages
254{263. Association for Computational Linguistics, 2008.
[9] P. Ipeirotis, F. Provost, and J. Wang. Quality management on
amazon mechanical turk. In KDD-HCOMP '10, 2010.
[10] G. Kazai and N. Milic-Frayling. On the evaluation of the
quality
of
relevance
assessments
collected
through
crowdsourcing. In Proceedings of the SIGIR 2009 Workshop on
the Future of IR Evaluation, pages 21-22, 2009.
[11] Vamshi Ambati and Stephan Vogel. 2010. Can crowds build
parallel corpora for machine translation systems? In Proceedings
of the NAACL HLT Workshop on Creating Speech and
Language Data With Amazon’s Mechanical Turk, pages 62–65.
[12] G. Little and Y. Sun, Human OCR: Insights from a Complex
Human Computation Process, Workshop on Crowdsourcing and
Human Computation, Services, Studies and Platforms, ACM
CHI, 2011
Figure 9: Result of Tournament Selection for one Chinese
idiom. The correct translation emerges clear winner within
5 rounds.
Observation and Future Work
Convergence
From Fig 9, it is clear that after 3 rounds, the correct
answer has emerged to be the clear winner. It only takes 5
rounds for the clear winner to reach 90% of population.
We observe that translation is a well suited type of human
computation tasks to use our tournament selection method
for post-process quality control because it’s relatively
easier to pick the right translation when you see it. Further
work needs to be done to test translation longer paragraphs
since we only experiment with idiom translation.
117
Appendix
down at the bar. The bartender
tells you a number. And you tell
Table 1: Puzzles used and the correct solutions
Puzzle
him another number. If it’s right,
Correct Solution
you get a free beer. For example, a
1 If you had an infinite supply of
A and B denote 5 and 3 quart
customer goes up to the bar and
water and a 5 quart and 3 quart
pails. Here are the steps : (1)
the bartender says, “Six” The
pail, how would you measure
Fill A (2) Pour 3quart from A
customer says, ”Three,” and he
exactly 4 quarts?
to B (3) Empty B (4) Pour
gets his free beer. The second
remaining 2 from A to B (5)
fellow goes up to the bar, and the
Fill A (6) Pour enough (1
bartender says, “Twelve.” The
quart) to fill B (7) A now has 4
customer says, “Six,” and he gets
quarts
his free beer! A third customer sits
2 You’ve got someone working for
Cut it into 3 parts where the
at the bar, and the bartender says,
you for seven days and a gold bar
parts are 1, 2 and 4 segments
“Fourteen.” The customer says,
to pay them. The gold bar is
long. Day 1 - Give 1 segment
“Eight.” He gets a free beer.You is
segmented into seven connected
bar. Day 2 - Give 2 segment
sitting there. The bartender turns to
pieces. You must give them a
bar ask for
you and says, “Twenty-two.” What
piece of gold at the end of every
1 segment bar back. Day 3 -
day. If you are only allowed to
Give 1 segment bar. Day 4 -
7 A chicken and a half can lay an
24 days. If 1.5 chickens lay 1.5
make two breaks in the gold bar,
Give 4 segment bar ask for 1 +
egg and a half in a day and a half.
eggs in 1.5 days, then 2
how do you pay your worker?
2 back. Day 5 - Give 1 segment
How long will it take for two
chickens can lay 2 eggs in 1.5
bar. Day 6 - Give 2 segment
chickens to lay 32 eggs?
days. 32 eggs or 16 * 2 eggs
do you say?
bar ask for 1 segment bar back.
can be laid in 16 * 1.5 days, or
Day 7 – Give 1 segment bar.
24 days.
(1) Mark the jars with numbers
8 “This is an unusual paragraph. I’m
for
1, 2, 3, 4, and 5. (2) Take 1 pill
curious as to just how quickly you
contaminated pills contained in
from jar 1, take 2 pills from jar
can find out what is so unusual
one jar, where each pill weighs 9
2, take 3 pills from jar 3, take 4
about it. It looks so ordinary and
gm. Given a scale, how could you
pills from jar 4 and take 5 pills
plain that you would think nothing
tell which jar had the contaminated
from jar 5. (3) Put all of them
was wrong with it. In fact nothing
pills in just one measurement?
on the scale at once and take
is wrong with it. It is highly
the measurement. (4) Now,
unusual though, study it and think
subtract the measurement from
about it but you still may not find
150 ( 1*10 + 2*10 + 3*10 +
anything odd. But if you work at it
4*10 + 5*10) (5) The result
a bit you might find out. Try to do
will give you the jar number
so without any coaching.” What is
3 You have 5 jars of pills. Each pill
weighs
10
gram,
except
which has contaminated pill.
There is no “e” in that entire
paragraph.
unique about the paragraph?
4 You have a bucket of jelly beans.
Four - if you end up picking
9 You have a four-ounce glass and a
6 Steps. Fill 9 ounce glass. Fill
Some are red, some are blue, and
one of each color the first 3
nine-ounce glass. You have an
9 ounce glass, pour into four
some
times,
will
endless supply of water. You can
ounce and throw this away,
closed, pick out 2 of a like color.
guarantee you have 2 of some
fill or dump either glass. It turns
Pour into four ounce and throw
How many do you have to grab to
color.
out, you can measure six ounces of
this away, Pour remaining one
water using these two glasses. The
ounce into four ounce glass.
green.
With
your
eyes
the
4th
one
be sure you have 2 of the same?
5 Given a rectangular cake with a
Cut at the line joining the
question is: What is the fewest
Fill nine ounce and pour on top
rectangular piece removed (any
center of cake and the center of
number of steps in which you can
of the one ounce in the four
size or orientation), how would
rectangular piece removed.
measure six ounces?
ounce glass. What is left in the
you cut the remainder of the cake
9 ounce glass is 6 ounces.
into two equal halves with one
straight cut of a knife?
6 There’s this unusual little bar,
where you can get a free beer if
9 because there are nine letters
in twenty-two.
you know the secret code. The
secret code works like this: You sit
118
Download