Amazon Mechanical Turk Requester Round Table

advertisement
WELCOME!
Amazon Mechanical Turk
New York City Meet Up
September 1, 2009
© 2009 Amazon.com, Inc. or its Affiliates.
AGENDA







Welcoming Statements
Introductions
Dolores Labs – Video Directory Use Case
Knewton – Adaptive Learning Use Case
FreedomOSS – Enterprise Integration
New York University – Worker Quality Solution
Panel Questions and Answers
© 2009 Amazon.com, Inc. or its Affiliates.
Amazon Mechanical Turk
Requester Meetup
Howie Liu
Dolores Labs
© 2009 Amazon.com, Inc. or its Affiliates.
Dolores Labs Introduction
 Founded in 2008 by Lukas Biewald, Senior Scientist,
Powerset (MSFT); Yahoo! Search; Stanford AI Lab
– Recognized enormous potential of AMT platform
 Dolores Labs develops quality control technology
(CrowdControl™) to make AMT more accessible and
reliable
© 2009 Amazon.com, Inc. or its Affiliates.
Case Study
A large video directory needed to select relevant
thumbnails for 200k+ videos
© 2009 Amazon.com, Inc. or its Affiliates.
Why Mechanical Turk?
 Size of project and turnover speed made MTurk the
obvious solution
– Given the needs of the client, traditional outsourcing or
hiring employees was not an option
– However, the client was concerned about quality of
results
 Inherent variability of Mechanical Turk workers
– Unlike other Amazon marketplaces, workers are not a
perfect commodity
– Significant variations in quality (accuracy)
– Need to ensure workers diligently completed work
– Intelligently aggregate multiple responses to find the
single best thumbnail for a video
6
© 2009 Amazon.com, Inc. or its Affiliates.
3 Step Process for Optimizing the Task
• Create a custom interactive UI
Baseline
Performance
• 74% result accuracy
• Apply statistical quality control
CrowdControl™
• 90% result accuracy
• Second pass for Turkers to verify results
CrowdControl™
+ 2 pass
• 98% result accuracy
© 2009 Amazon.com, Inc. or its Affiliates.
High Quality on Mechanical Turk:
Best Practices
CrowdControl™ vs Baseline Result Accuracy
CrowdControl™
+ Custom Solutions
CrowdControl™
Baseline Performance
70


75
80
85
90
95
100
Statistical inference algorithms to dynamically assess quality
– …Of each worker, of each result
– …While the task is live
– Smart allocation of worker resources
• Blindly increasing redundancy is expensive
Aggregating all responses from workers with varying quality into a
single “best” answer
White paper with Stanford AI Lab about quality on AMT
http://bit.ly/DLpaper
© 2009 Amazon.com, Inc. or its Affiliates.
Other Insights
 Clear task instructions are crucial for good
results
– Garbage in, garbage out
 Intuitive and efficient task interface makes the
task faster (read—cheaper) and more fun!
 Mechanical Turk is an unprecedented, hyperefficient labor marketplace
– Need to understand its dynamics through
experience in order to harness its power
© 2009 Amazon.com, Inc. or its Affiliates.
Amazon Mechanical Turk
Requester Meetup
Dahn Tamir, Knewton Inc.
© 2009 Amazon.com, Inc. or its Affiliates.
Knewton - Introduction
 Live online GMAT and LSAT prep courses customized for
each student, powered by the world’s most advanced
adaptive learning engine.
 Selected to the 2009 AlwaysOn Global 250 List. Named
Category Winner in the Digital Education field.
© 2009 Amazon.com, Inc. or its Affiliates.
How we use MTurk
 Calibration for computer-adaptive testing
 Quality assurance
 Focus Groups and Surveys
 Database building
 Marketing
© 2009 Amazon.com, Inc. or its Affiliates.
Why Mturk?
 Speed
 Cost
 Appropriate worker population for each task
 Quality
© 2009 Amazon.com, Inc. or its Affiliates.
What We Learned
 Turkers are a diverse and capable population
 Use qualification tests
 Invest in building good HITs
 Hesitate to reject work (but not cheaters)
 Meet Turker Nation
© 2009 Amazon.com, Inc. or its Affiliates.
Thank you!
--Questions?
dahn@knewton.com
978-KNEWTON
© 2009 Amazon.com, Inc. or its Affiliates.
Amazon Mechanical Turk
Requester Meet-up
(Max Yankelevich, Chief Architect– Freedom OSS)
© 2009 Amazon.com, Inc. or its Affiliates.
Freedom OSS- Introduction










Freedom OSS is a professional services organization with a focus on Practical
Implementations using Cloud Computing & Open Source Technologies
International Firm
– US Offices: PA,NYC, GA, KC ,NV, WA,NC
– 4 Large Solution Centers in Eastern Europe (Russia, Belarus, Ukraine and
Lithuania)
Practical Approach to Cloud Computing – most successfully completed
Enterprise Cloud Computing projects in the Industry
Key Cloud Computing Partnerships
– Top Amazon AWS Enterprise System Integrator
– Top Eucalyptus Enterprise Partner
Key Open Source Partnerships
– Top Red Hat Advanced Business Partner
– #1 JBoss Advanced Business Partner in US
2008 “JBoss SOA Innovation” Award Winner
2007-08 “Practical SOA” Award Winner
2008 “Red Hat Extensive Ecosystem” Award Winner
Leading technology partner for many Fortune 2000 companies
Freedom is a privately held corporation
© 2009 Amazon.com, Inc. or its Affiliates.
MTurk and Enterprise Integration





Most Legacy systems are not architected to include the human
intervention
Providing a technological interface to maintain the workflow while
inserting human intelligence and building self adjudicating business
flows
Leveraging Mechanical Turk programmatically in your everyday
systems
Freedom OSS has leveraged the power of Enterprise Service Bus
(ESB) & Practical Service Oriented Architecture (SOA) to make the
process of on-boarding and managing MTurk workers a rapid and cost
effective process
Using its Professional Open Source ESB – freeESB , Freedom has
developed many powerful Connectors for some of the most used
Enterprise Systems and Technologies such as SAP, Mainframe,
Siebel, Java/J2EE, Oracle , IBM MQ ,etc
© 2009 Amazon.com, Inc. or its Affiliates.
Master Data Cleansing & Validation Use Case
 Keeping Master Customer Data File (Master
Data Management)
– Record de-duping
– Contact information validation
 Traditional MDM tactics
– Expensive software
– Big Bang approach
– Invasive Code Changes to Legacy
Applications
 Clean and consistent customer data
© 2009 Amazon.com, Inc. or its Affiliates.
Business Applications
Real-time
access
API
Real-time
Events
AWS Cloud
Master Data
freeESB
Routing , Transformation, Connectivity, QoS
Business Process
Orchestration &
Workflow
First Turk Task –
Simple Data
Checking
Second Turk
Task – Deeper
Data Checking
Third Turk Task
– Data
Edit/Trusted
Task
Business Rules
Engine
Legacy
Applications
Mainframe, Client-Server, Oracle, .NET,
SAP, Siebel ,etc
© 2009 Amazon.com, Inc. or its Affiliates.
Outcome
 Low operational costs
 Non-invasive data integration
 High-degree of accuracy due to multi-task
distribution
 Some Best Practices when integrating MTurk within
an Enterprise
– Deliver value incrementally
– Inversion of Control
© 2009 Amazon.com, Inc. or its Affiliates.
Thank you!
--Questions?
© 2009 Amazon.com, Inc. or its Affiliates.
Amazon Mechanical Turk
Requester Meetup
(Panos Ipeirotis – New York University)
© 2009 Amazon.com, Inc. or its Affiliates.
Panos Ipeirotis - Introduction
 New York University, Stern School of Business
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: panos@nyu.edu
© 2009 Amazon.com, Inc. or its Affiliates.
Example: Build an Adult Web Site Classifier
 Need a large number of hand-labeled sites
 Get people to look at sites and classify them as:
G (general), PG (parental guidance), R (restricted), X (porn)
Cost/Speed Statistics
 Undergrad intern: 200 websites/hr, cost: $15/hr
 MTurk: 2500 websites/hr, cost: $12/hr
© 2009 Amazon.com, Inc. or its Affiliates.
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general audience)
© 2009 Amazon.com, Inc. or its Affiliates.
Improve Data Quality through Repeated Labeling
 Get multiple, redundant labels using multiple workers
 Pick the correct label based on majority vote
11 workers
93% correct
1 worker
70% correct
 Probability of correctness increases with number of workers
 Probability of correctness increases with quality of workers
© 2009 Amazon.com, Inc. or its Affiliates.
But Majority Voting is Expensive
Single Vote Statistics
 MTurk: 2500 websites/hr, cost: $12/hr
 Undergrad: 200 websites/hr, cost: $15/hr
11-vote Statistics
 MTurk: 227 websites/hr, cost: $12/hr
 Undergrad: 200 websites/hr, cost: $15/hr
© 2009 Amazon.com, Inc. or its Affiliates.
Using redundant votes, we can infer worker quality
 Look at our spammer friend ATAMRO447HWJQ
together with other 9 workers
 We can compute error rates for each worker
Our “friend” ATAMRO447HWJQ
P[X → G]=90.153%
mainly marked sites as G.
Obviously a spammer…
P[G → G]=99.947%
Error rates for ATAMRO447HWJQ


P[X → X]=9.847%
P[G → X]=0.053%
© 2009 Amazon.com, Inc. or its Affiliates.
Rejecting spammers and Benefits
Random answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 45.2%


P[X → X]=9.847%
P[G → X]=0.053%
P[X → G]=90.153%
P[G → G]=99.947%
Action: REJECT and BLOCK
Results:
 Over time you block all spammers
 Spammers learn to avoid your HITS
 You can decrease redundancy, as quality of workers is higher
© 2009 Amazon.com, Inc. or its Affiliates.
After rejecting spammers, quality goes up




Spam keeps quality down
Without spam, workers are of higher quality
Need less redundancy for same quality
Same quality of results for lower cost
Without spam
5 workers
94% correct
Without spam
1 worker
With spam
80% correct
11 workers
93% correct
With spam
1 worker
70% correct
© 2009 Amazon.com, Inc. or its Affiliates.
Correcting biases
 Classifying sites as G, PG, R, X
 Sometimes workers are careful but biased
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%


P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%
Classifies G → P and P → R
Average error rate for ATLJIK76YH1TF: 45.0%
Is ATLJIK76YH1TF a spammer?
© 2009 Amazon.com, Inc. or its Affiliates.
Correcting biases
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0%
P[P → G]=0.0%
P[R → G]=0.0%
P[X → G]=0.0%
P[G → P]=80.0%
P[P → P]=0.0%
P[R → P]=0.0%
P[X → P]=0.0%
P[G → R]=0.0%
P[P → R]=100.0%
P[R → R]=100.0%
P[X → R]=0.0%
P[G → X]=0.0%
P[P → X]=0.0%
P[R → X]=0.0%
P[X → X]=100.0%

For ATLJIK76YH1TF, we simply need to compute the “nonrecoverable” error-rate (technical details omitted)

Non-recoverable error-rate for ATLJIK76YH1TF: 9%
© 2009 Amazon.com, Inc. or its Affiliates.
Too much theory?
Open source implementation available at:
http://code.google.com/p/get-another-label/
 Input:
– Labels from Mechanical Turk
– Cost of incorrect labelings (e.g., XG costlier than GX)
 Output:
– Corrected labels
– Worker error rates
– Ranking of workers according to their quality
 Alpha version, more improvements to come!
 Suggestions and collaborations welcomed!
© 2009 Amazon.com, Inc. or its Affiliates.
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com/
Email: panos@nyu.edu
© 2009 Amazon.com, Inc. or its Affiliates.
Download