Crowd Algorithms Talk

advertisement
Crowd Algorithms
Scoop — The Stanford – Santa Cruz Project for Cooperative
Computing with Algorithms, Data, and People
Hector Garcia-Molina, Stephen Guo,
Aditya Parameswaran, Hyunjung Park,
Alkis Polyzotis, Petros Venetis, Jennifer Widom
Stanford and UC Santa Cruz
The Goal
Design Fundamental Algorithms for Human Computation
Latency
•
•
•
•
Which questions do I ask?
When do I ask the questions?
When do I stop?
How do I combine the answers?
Uncertainty
Cost
2
The Problems
Crowd- Sort / Max
: Difficult!
Crowd- GraphSearch : Difficult!
Crowd- Categorize : Difficult!
Crowd- Filter
: Difficult!
[VLDB 2011]
Summaries of the rest
Progress!
The focus of this talk.
Latency
Uncertainty
Cost
3
Filters
Is this image that of Bytes Café ?
Predicate 1
Predicate 2
Dataset of Items
……
Is the image blurry?
Filtered
Does it show
people’s faces?
Dataset
Predicate k
 Given:
— Error Probability (FP/FN) & Selectivity for each predicate
— Desired Overall Error Probability
 To: Compose a filtering strategy
•
•
•
•
Which questions do I ask?
When do I ask the questions?
When do I stop?
How do I combine the answers?
— Minimize Overall Cost (# of questions)
4
Single Filter
 Surprisingly difficult!
 Need to meet an overall error threshold
— Say, up to 10% of my images may be wrongly filtered
 Minimize overall expected number of questions
 Boils down to the following:
— Take one item
— Ask some questions
• Results in a certain number of (Y, N) for a given item
— Do I stop (if so, what do I return), or do I continue
asking?
Dataset of
Items
Predicate 1
Filtered
Dataset
5
Hasn’t this been done before?
 Solutions from statistics guarantee the same error
per item
— Important on contexts like:
• Automobile testing
• Diagnosis
 We’re worried about aggregate error over all items:
a uniquely data-oriented problem
— I don’t care if every image is perfect as long as the
overall error is met.
— As we will see, results in $$$ savings
6
Strategies
Reformulated Task:
YES
Answers
YES = 5, NO = 6
Return “Passed”
YES = 3, NO = 5
Continue
For each point in grid :
Return Pass/Fail/Cont.
Equivalently,
the
YES =Find
3, NO
= 7best shape and
Return “Failed”
color it!
Start here, with
no questions
NO
Answers
7
Common Strategies
 Always ask X questions, return most likely answer
— The triangle shape
 If you get X YES, return “Pass” or Y NO, return
“Fail”, else keep asking.
— Rectangular shape
 Ask until |#YES - #NO| > X, or at most Y questions
— Chopped off rectangle
— Anhai’s work on MOBS
8
Summary of Results
 A characterization of which “shapes” are optimal
 A optimal PTIME “probabilistic” approach
— LP leveraging the inherent DP structure
— Optimal: Strategy with minimum overall cost
• for given parameters and requirements
— Probabilistic: Probability of “Pass” “Fail” “Continue”
9
Empirical Results
Generate
Parameters
Brute Force
Deterministic
Other Algorithms
COST1
>>
COST2
Optimal
Probabilistic
>>
COST3
 Evaluation on 10000 synthetic scenarios
 Tested:
— Optimal, Brute Force, Statistical, 5 Heuristic Algorithms
 Optimal Probabilistic issues fewer questions overall
— 15% savings on average compared to brute forceTranslates to $$$
for many items !!
• 32% savings when optimal wins
— 22% savings on average compared to the statistics approach
• 49% savings when optimal wins
10
Crowd-Max/Sort
 The problem(s):
— Find the strategy of sorting n items
• Given: Probability of error for a comparison
• Given: Desired threshold on error,#questions,#rounds
Ask all pairs a total of
2k/n times
Tournament, with k
repetitions at each
level
One question in each
round
Decreasing Parallelism
More Accuracy
 Sorting automatically given evidence
— NP-Hard even for a simple probability of error model
— Related work in the area of voting theory, economics
 Which r questions do we ask next?
11
Crowd-GraphSearch
Image Categorization Example
To attach:
image of a honda car
Is image one of
vehicle?
vehicle
car
nissan
maxima
honda
sentra
target node
Is the image one of X?
YES!
Is image one of
toyota?
NO!
toyota
Is image one of
honda?
YES!
=
intended category
= Is the target node reachable from X?
Find the target node by asking minimum number of search questions.
12
Crowd-Categorize
Dataset of Items




…….
k buckets, n items
Categorize every item, overall error < threshold
For k = 1, same as filters problem
Two versions:
— Discrete
• Independent (like in the filters case)
• Dependent buckets (e.g., colors, GraphSearch)
— Continuous (e.g., age)
13
Questions?
14
Download