Task and Workflow Design I KSE 801 Uichin Lee TurKit: Human Computation Algorithms on Mechanical Turk Greg Little, Lydia B. Chilton, Rob Miller, and Max Goldman (MIT CSAIL) UIST 2010 Workflow in M-Turk HIT HIT Requester posts HIT Groups to Mechanical Turk HIT HIT HIT HIT Data Collected in CSV File Data Exported for Use Workflow: Pros & Cons • Easy to run simple, parallelized tasks. • Not so easy to run tasks in which turkers improve on or validate each others’ work. • TurKit to the rescue! The TurKit Toolkit • Arrows indicate the flow of information. • Programmer writes 2 sets of source code: – HTML files for web servers – JavaScript executed by TurKit • Output is retrieved via a JavaScript database. Turkers Mechanical Turk Web Server *.html TurKit JavaScript Database *.js Programmer Crash-and-rerun programming model • Observation: local computation is cheap, but the external class cost money • Managing states over a long running program is challenging – Examples: Computer restarts? Errors? • Solution: store states in the database (in case) • If an error happens, just crash the program and re-run by following the history in DB – Throw a “crash” exception; the script is automatically re-run. • New keyword “once”: – Remove non-determinism – Don’t need to re-execute an expensive operation (when re-run) • But why should we re-run??? Example: quicksort Parallelism • First time the script runs, HITs A and C will be created • For a given forked branch, if a task fails (e.g., HIT A), TurKit crashes the forked branch (and re-run) • Synchronization w/ join() MTurk Functions • Prompt(message, # of people) – mturk.prompt("What is your favorite color?", 100) • Voting(message, options) • Sort(message, items) VOTE() SORT() TurKit: Implementation • TurKit: Java using Rhino to interpret JavaScript code, and E4X2 to handle XML results from MTurk • IDE: Google App Engine3 (GAE) Online IDE Exploring Iterative and Parallel Human Computation Processes Greg Little, Lydia B. Chilton Max Goldman, Robert C. Miller HCOMP 2010 HC Task Model • Dimension: – Dependent (iterative) or independent (parallel) tasks – Creation and decision tasks • Task model examples Creation tasks (creating new content): e.g., writing ideas, imagery solutions, etc. Decision tasks (voting/rating): e.g., rating the quality of a description of an image HC Task Model • Combining tasks: iterative and parallel tasks Iterative pattern: a sequence of creation tasks where the result of each task feeds into the next one, followed by a comparison task Parallel pattern: a set of creation tasks executed in parallel, followed by a task of choosing the best Experiment: Writing Image Description • Iterative vs. parallel; each 6 creation tasks ($0.02), followed by rating tasks (1-10 scale, $0.01) Experiment: Writing Image Description • Turkers in iterative condition gave better description while parallel condition always shows an empty text area. Experiment: Writing Image Description • Average rating after n iterations – After six iterations: 7.9 vs. 7.4, t-test T29=2.1, p=0.04 iterative parallel Experiment: Writing Image Description • The two outliers (circled) represent instances of text copied from the Internet (with superficial description) Rating • Length vs. rating: positive correlation Length (characters) Experiment: Writing Image Description • Work Quality: – 31% mainly append content at the end, and make only minor modifications (if any) to existing content; – 27% modify/expand existing content, but it is evident that they use the provided description as a basis; – 17% seem to ignore the provided description entirely and start over; – 13% mostly trim or remove content; – 11% make very small changes (adding a word, fixing a misspelling, etc); – 1% copy-paste superficially related content found on the internet. • Creating vs. improving (takes about the same time, avg. 211 seconds) Experiment: Brainstorming Experiment: Brainstorming • Iterative work: higher average rating – Biased thinking: e.g., tech -> xxtech -> yytech • Parallel work: diversity, higher deviation (rating) – No iteration for brainstorming Avg. Rating iterative parallel Iteration Rating Example: Blurry Text Recognition Example: Blurry Text Recognition Accuracy • Iterative performs better than parallel Iteration Summary • TurKit: a flexible programming tool for m-turk • Various work-flow can be designed; e.g., iterative, parallel, and hybrid • Iterative performs better than parallel in several cases (e.g., image description, brainstorming, text recognition)