PPTX - Panos Ipeirotis

advertisement
Rewarding Crowdsourced Workers
Panos Ipeirotis
New York University
and
Google
Twitter: @ipeirotis
Joint work with: Jing Wang, Foster Provost,
Josh Attenberg, and Victor Sheng;
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.com
Example: Build an Web Page Classifier
Need a large number of labeled sites for training
 Get people to look at sites and label them as:
G (general audience) PG (parental guidance) R (restricted) X (porn)

Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 Mechanical Turk: 2500 websites/hr, cost: $12/hr
Challenges

We do not know the true category for the objects
–
Available only after (costly) manual inspection

We do not know quality of the workers

We want to label objects with true categories
We want (need?) to know the quality of the workers

Expectation Maximization Estimation
Iterative process to estimate worker error rates
1. Initialize “correct” label for each object (e.g., use majority vote)
2. Estimate error rates for workers (using “correct” labels)
3. Estimate “correct” labels (using error rates, weight worker
votes according to quality)
4. Go to Step 2 and iterate until convergence
Challenge: From Confusion
Matrixes to Quality Scores
Confusion matrix for spammer worker
 P[X → X]=0.847%
 P[G → X]=0.053%
P[X → G]=99.153%
P[G → G]=99.947%
Confusion matrix for good worker
 P[X → X]=99.847%
 P[G → X]=4.053%
P[X → G]=0.153%
P[G → G]=95.947%
How to check if a worker is a spammer
using the confusion matrix?
(hint: error rate not enough)
Challenge 1:
Spammers are lazy and smart!
Confusion matrix for spammer
Confusion matrix for good worker




P[X → X]=0% P[X → G]=100%
P[G → X]=0% P[G → G]=100%
P[X → X]=80%
P[G → X]=20%
P[X → G]=20%
P[G → G]=80%

Spammers figure out how to fly under the radar…

In reality, we have 85% G sites and 15% X sites

Error rate of spammer = 0% * 85% + 100% * 15% = 15%
Error rate of good worker = 85% * 20% + 85% * 20% = 20%

False negatives: Spam workers pass as legitimate
Challenge 2:
Humans are biased!
Error rates for legitimate (but biased) employee
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
 We have 85% G sites, 5% P sites, 5% R sites, 5% X sites
 Error rate of spammer (all G) = 0% * 85% + 100% * 15% = 15%
 Error rate of biased worker = 80% * 85% + 100% * 5% = 73%
False positives: Legitimate workers appear to be spammers
(important note: bias is not just a matter of “ordered” classes)
Solution: Fix bias first, compute
error rate afterwards
Error Rates for legitimate (but biased) employee
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%




P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
When biased worker says G, it is 100% G
When biased worker says P, it is 100% G
When biased worker says R, it is 50% P, 50% R
When biased worker says X, it is 100% X
Small ambiguity for “R-rated” votes but other than that, fine!
Solution: Fix bias first, compute
error rate afterwards
Error Rates for spammer
P[G → G]=100.0%
P[P → G]=100.0%
P[R → G]=100.0%
P[X → G]=100.0%




P[G → P]=0.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=0.0%
P[R → R]=0.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=0.0%
When spammer says G, it is 25% G, 25% P, 25% R, 25% X
When spammer says P, it is 25% G, 25% P, 25% R, 25% X
When spammer says R, it is 25% G, 25% P, 25% R, 25% X
When spammer says X, it is 25% G, 25% P, 25% R, 25% X
[note: assume equal priors]
The results are highly ambiguous. No information provided!
Expected Misclassification Cost
• High cost: probability spread across classes
• Low cost: probability mass concentrated in one class
Assigned Label
Corresponding “Soft” Label
Label Cost
Spammer: G
<G: 25%, P: 25%, R: 25%, X: 25%>
0.75
Good worker: P
<G: 100%, P: 0%, R: 0%, X: 0%>
0.0
[***Assume misclassification cost equal to 1, solution generalizes to
arbitrary misclassification costs across categories]
Quality
A scalar measure of quality
QualityScore:
Score
• A spammer is a worker who assigns labels randomly,
regardless of what the true class is.
Cost ( Worker)
QualityScore( Worker)  1 
Cost (Prior )
• Quality score useful for ranking workers
• Unaffected by systematic biases
• Scalar, so no need to examine confusion matrices
Quality Score
Question:
How to pay workers?
Challenges
• Thresholding has wrong incentive structure:
• Decent (but still useful) workers remain unused
• If you are above the threshold, no need to improve
• Uncertainty: The quality score is not really a fixed number
• Fluctuations in payment are puzzling for workers
• Best to have only increases in payment
Two Types of Workers

Divide workers into two groups
– Qualified Workers
• The quality satisfies the target quality
– Unqualified Workers
• The qualify fails to meet the target quality
levels
A Simple Pricing Model for Qualified Workers
– p: the price paid to all qualified workers
– fW(w): the pdf distribution of worker reservation wage
– FW(w): the cdf distribution of worker reservation wage
– R: the fixed price paid by external client for each qualified
object
Example
Optimal Worker Salary = 21
Revenue
fW(w)
15
fW(w): LogNormal(3,1), Selling price R=50
The Value of Unqualified Workers
 Binary classification (1:1)
 Worker confusion matrix
 “Accept” classification cost <=0.1
Number of Workers
1
3
5
7
9
11
Classification Cost
0.300
0.216
0.163
0.126
0.099
0.079
We need ~9 workers to achieve the
required quality
Value: 1/9
A Pricing Model for Workers
– p: the price paid to all qualified workers
– fW(w): the pdf distribution of worker reservation wage
FW(w): the cdf distribution of worker reservation wage
– Adjust for the presence of “unqualified” workers, and each
unqualified worker “counts” as 1/k of a qualified one
– R: the fixed price paid by external client for each qualified object
Optimal Prices
p*
p*/3
p*/9
Quality Score
Question:
How to pay workers?
Challenges
• Thresholding has wrong incentive structure:
• Decent (but still useful) workers remain unused
• If you are above the threshold, no need to improve
• Uncertainty: The quality score is not really a fixed number
• Fluctuations in payment are puzzling for workers
• Best to have only increases in payment
Bayesian Estimates for Uncertainty
Worker A
P[0 → 0]=Beta(2,1) P[0 → 1]=Beta(1,2)
P[1 → 0]=Beta(1,2) P[0 → 0]=Beta(2,1)
Worker B
P[0 → 0]=Beta(101,1) P[0 → 1]=Beta(1,101)
P[1 → 0]=Beta(1,101) P[0 → 0]=Beta(101,1)
Real-Time Payment and Reimbursement
Example of the piece-rate payment of a worker
# Tasks
10
20
30
40
Infinity
Piece-rate Payment (cents)
11
18
21
23
40
Fair
Payment
Real-Time Payment and Reimbursement
Example of the piece-rate payment of a worker
# Tasks
10
20
30
40
Infinity
Piece-rate Payment (cents)
11
18
21
23
40
Piece-rate Payment
Fair
Payment
Potential
“Bonus”
Payment
10
Number of Tasks
Real-Time Payment and Reimbursement
Example of the piece-rate payment of a worker
# Tasks
10
20
30
40
Infinity
Piece-rate Payment (cents)
11
18
21
23
40
Piece-rate Payment
Fair
Payment
Potential
“Bonus”
Reimbursement
Payment
Payment
10
20
Number of Tasks
Real-Time Payment and Reimbursement
Example of the piece-rate payment of a worker
# Tasks
10
20
30
40
Infinity
Piece-rate Payment (cents)
11
18
21
23
40
Piece-rate Payment
Fair
Payment
Potential
“Bonus”
Reimbursement
Reimbursement
Payment
Payment
Payment
10
20
30
Real-Time Payment and Reimbursement
Example of the piece-rate payment of a worker
# Tasks
10
20
30
40
Infinity
Piece-rate Payment (cents)
11
18
21
23
40
Piece-rate Payment
Fair
Payment
Potential
“Bonus”
Reimbursement
Reimbursement
Reimbursement
Payment
Payment
Payment
Payment
10
20
30
40
Synthetic Experiment Setup
 N=10,000 tasks
 R=200 cents
 Cost <= 0.01
 Labeling Process: Workers
– Arrival frequency: every 600 seconds
– Number of each arrival: 10 workers
– Submitting speed: 30 seconds per task
The evaluation criterion is Unit Time Profit:
(Synthetic) Experimental Setup
Scatter Plot of Confusion Matrix, Reservation Wage
Qualify level and reservation wage
independently distributed
Qualify level and reservation wage
positively correlated
Experimental Results
35
Average Profit per Second
30
25
Quality-Based Price
24.6%
Uniform Price: 4.4(30%-quantile)
20
Uniform Price: 5.7(40%-quantile)
15
159.6%
Uniform Price: 7.4(50%-quantile)
Uniform Price: 9.5(60%-quantile)
10
Uniform Price: 12.5(70%-quantile)
5
0
No Correlation
Positive Correlation
Workers reacting to bad rewards/scores
Score-based feedback leads to strange interactions:
The “angry, has-been-burnt-too-many-times” worker:
 “F*** YOU! I am doing everything correctly and you know
it! Stop trying to reject me with your stupid ‘scores’!”
The overachiever worker:
 “What am I doing wrong?? My score is 92% and I want to
have 100%”
29
National Academy of Sciences
Dec 2010 “Frontiers of Science” conference
Your workers
behave like my
mice!
An unexpected connection…
30
Your workers
behave like my
mice!
Eh?
31
Your workers want to use
only their motor skills,
not their cognitive skills
32
The Biology Fundamentals
33

Brain functions are biologically expensive (20% of
total energy consumption in humans)

Motor skills are more energy efficient than
cognitive skills (e.g., walking)

Brain tends to delegate easy tasks to part of the
neural system that handles motor skills
An unexpected connection at the
NAS “Frontiers of Science” conf.
Your workers want to use
only their motor skills,
not their cognitive skills
Makes sense
34
An unexpected connection at the
NAS “Frontiers of Science” conf.
And here is how
I train my mice
to behave…
35
The Mice Experiment
36
Cognitive
Solve maze
Find pellet
Motor
Push lever three times
Pellet drops
How to Train the Mice?
Confuse motor skills!
Reward cognition!
I should try this the
moment that I get
back to my room
37
Punishing Worker’s Motor Skills

Punish bad answers with frustration of motor
skills (e.g., add delays between tasks)
–
–
–
“Loading image, please wait…”
“Image did not load, press here to reload”
“404 error. Return the HIT and accept again”
→Make this probabilistic to keep feedback
implicit
38
Rewarding (?) Cognitive Effort

Reward good answers by rewarding the
cognitive part of the brain
–
–
–
–
–
39
Introduce variety
Introduce novelty
Give new tasks fast
Show score improvements faster (but not the opposite)
Show optimistic score estimates
40
Experiments



41
Web page classification
Image tagging
Email & URL collection
Experimental Summary (I)

Spammer workers quickly abandon
–
–
–

Good workers unaffected
–
–
–
42
No need to display scores, or ban
Low quality submissions from ~60% to ~3%
Half-life of low-quality from 100+ HITs to less than 5
No significant effect on participation of workers with
good performance
Lifetime of participants unaffected
Longer response times (after removing the
“intervention delays”; that was puzzling)
Experimental Summary (II)

Remember, scheme was for training the mice…

15%-20% of the spammers start submitting good
work!
????
43
Two key questions

Why response time was slower for some good
workers?

Why some low quality workers start working well?
????
44
 System
1:
“Automatic” actions
 System
2:
“Intelligent” actions
45
System 1 Tasks
46
System 2 Tasks
47
Not Performing Well?
Disrupt and Engage
System 2
Performing
Well?
Out
Status:
Usage of
System 1
Status:
Usage of
System 2
(“Automatic”)
(“Intelligent”)
Performing Well?
Check if System 1
can Handle, remove
System 2 stimuli
48
Not Performing Well?
Hell/Slow ban
Thanks!
Q & A?
Download