PPT - Panos Ipeirotis

advertisement
Get Another Label?
Improving Data Quality and Data Mining
Using Multiple, Noisy Labelers
Victor Sheng
Foster Provost
Panos Ipeirotis
New York University
Stern School
Outsourcing KDD preprocessing

Traditionally, data mining teams have invested substantial
internal resources in data formulation, information extraction,
cleaning, and other preprocessing
–
Raghu from his Innovation Lecture
“the best you can expect are noisy labels”


2
Now, we can outsource preprocessing tasks, such as labeling,
feature extraction, verifying information extraction, etc.
–
using Mechanical Turk, Rent-a-Coder, etc.
–
quality may be lower than expert labeling (much?)
–
but low costs can allow massive scale
The ideas may apply also to focusing user-generated tagging,
crowdsourcing, etc.
ESP Game (by Luis von Ahn)
3
Other “free” labeling schemes


Open Mind initiative (www.openmind.org)
Other gwap games
–
–
–

Tag a Tune
Verbosity (tag words)
Matchin (image ranking)
Web 2.0 systems?
–
Can/should tagging be directed?
Noisy labels can be problematic
Many tasks rely on high-quality labels for objects:
–
–
–
–
–

5
learning predictive models
searching for relevant information
finding duplicate database records
image recognition/labeling
song categorization
Noisy labels can lead to degraded task
performance
Here, labels are values for target variable
Quality and Classification Performance
Labeling quality increases  classification quality increases
P = 1.0
100
P = 0.8
Accuracy
90
80
P = 0.6
70
60
P = 0.5
50
6
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
1
40
Number of examples (Mushroom)
Summary of results





Repeated labeling can improve data quality
and model quality (but not always)
When labels are noisy, repeated labeling can
be preferable to single labeling even when
labels aren’t particularly cheap
When labels are relatively cheap, repeated
labeling can do much better (omitted)
Round-robin repeated labeling does well
Selective repeated labeling improves
substantially
Majority Voting and Label Quality

Ask multiple labelers, keep majority label as “true” label

Quality is probability of being correct
1
P is probability
of individual labeler
being correct
Integrated quality
0.9
P=1.0
P=0.9
0.8
P=0.8
0.7
P=0.7
0.6
P=0.6
0.5
P=0.5
0.4
P=0.4
0.3
0.2
9
1
3
5
7
9
Number of labelers
11
13
Tradeoffs for Modeling


Get more labels  Improve label quality  Improve classification
Get more examples  Improve classification
P = 1.0
100
P = 0.8
Accuracy
90
80
P = 0.6
70
60
P = 0.5
50
10
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
1
40
Number of examples (Mushroom)
Basic Labeling Strategies


Single Labeling
–
Get as many data points as possible
–
one label each
Round-robin Repeated Labeling
–
Fixed Round Robin (FRR)

–
11
keep labeling the same set of points
Generalized Round Robin (GRR)

repeatedly-label data points, giving next label to point with
fewest so far
Fixed Round Robin vs. Single Labeling
FRR
(100 examples)
SL
p= 0.6, labeling quality
#examples =100
12 With high noise, repeated labeling better than single labeling
Fixed Round Robin vs. Single Labeling
Single
FRR
(50 examples)
p= 0.8, labeling quality
#examples =50
13
With low noise, more (single labeled) examples better
Gen. Round Robin vs. Single Labeling
100
P=0.6, k=5
Accuracy
90
P: labeling quality
k: #labels
GRR
80
SL
70
60
Repeated labeling is better
than single labeling
50
80
1680
3280
4880
6480
Data acquisition cost (mushroom, p=0.6)
Tradeoffs for Modeling


Get more labels  Improve label quality  Improve classification
Get more examples  Improve classification
P = 1.0
100
P = 0.8
Accuracy
90
80
P = 0.6
70
60
P = 0.5
50
15
80
10
0
12
0
14
0
16
0
18
0
20
0
22
0
24
0
26
0
28
0
30
0
60
40
20
1
40
Number of examples (Mushroom)
Selective Repeated-Labeling

We have seen:
–
With enough examples and noisy labels, getting multiple
labels is better than single-labeling
–
When we consider costly preprocessing, the benefit is
magnified (omitted -- see paper)

Can we do better than the basic strategies?

Key observation: we have additional information to
guide selection of data for repeated labeling
–
16

the current multiset of labels
Example: {+,-,+,+,-,+} vs. {+,+,+,+}
Natural Candidate: Entropy

Entropy is a natural measure of label uncertainty:
| S |
| S | | S |
| S |
E(S )  
log2

log2
|S|
|S| |S|
|S|


E({+,+,+,+,+,+})=0
E({+,-, +,-, +,- })=1
| S  |: positive | S  |: negative
Strategy: Get more labels for examples with highentropy label multisets
17
What Not to Do: Use Entropy
0.95
Improves at first,
hurts in long run
Labeling quality
0.9
0.85
0.8
0.75
0.7
ENT ROPY
GRR
0.65
0.6
0
400
800
1200
1600
Number of labels (waveform, p=0.6)
18
2000
Why not Entropy

In the presence of noise, entropy will be high
even with many labels

Entropy is scale invariant
(3+ , 2-) has same entropy as (600+ , 400-)
19
Estimating Label Uncertainty (LU)

Observe +’s and –’s and compute Pr{+|obs} and Pr{-|obs}

Label uncertainty = tail of beta distribution
Beta probability density function
SLU
20
0.0
0.5
1.0
Label Uncertainty




21
p=0.7
5 labels
(3+, 2-)
Entropy ~ 0.97
CDFb=0.34
Label Uncertainty




22
p=0.7
10 labels
(7+, 3-)
Entropy ~ 0.88
CDFb=0.11
Label Uncertainty




23
p=0.7
20 labels
(14+, 6-)
Entropy ~ 0.88
CDFb=0.04
Labeling quality
Label Uncertainty vs. Round Robin
1
0.9
0.8
GRR
LU
0.7
0.6
0
24
400
800
1200
1600
2000
Number of labels (waveform, p=0.6)
similar results across a dozen data sets
Recall:
Gen. Round Robin vs. Single Labeling
100
P=0.6, k=5
Accuracy
90
P: labeling quality
k: #labels
GRR
80
SL
70
60
Multi-labeling is better
than single labeling
50
80
1680
3280
4880
6480
Data acquisition cost (mushroom, p=0.6)
Labeling quality
Label Uncertainty vs. Round Robin
1
0.9
0.8
GRR
LU
0.7
0.6
0
26
400
800
1200
1600
2000
Number of labels (waveform, p=0.6)
similar results across a dozen data sets
Another strategy:
Model Uncertainty (MU)

Learning a model of the data provides
an alternative source of information
about label certainty

Model uncertainty: get more labels for
instances that cannot be modeled well
?
- -- +
+
+
++ - - - - - + +
-- + ++
 Intuition?
+
- -- - + + +
+ +
- - ---– for data quality, low-certainty “regions” may
+
+
-+
be due to incorrect labeling of corresponding
-+instances
27
–
for modeling: why improve training data
quality if model already is certain there?
?
Yet another strategy:
Label & Model Uncertainty (LMU)

Label and model uncertainty (LMU): avoid
examples where either strategy is certain
S LMU 
28
S LU  S MU
Comparison
Model Uncertainty
Label & Model
alone also improves
Label
Uncertainty
quality
Uncertainty
1
Labeling quality
0.95
0.9
0.85
0.8
GRR
0.75
GRR
MU
LU
LMU
0.7
0.65
0.6
0
29
400
800
1200
Number of labels (waveform, p=0.6)
1600
2000
Across 12 domains, LMU is always better
than GRR. LMU is statistically significantly
better than LU and MU.
Comparison: Model Quality
Label & Model
Uncertainty
90
Accuracy
85
80
75
GRR
MU
LU
LMU
70
65
60
30
0
400
800
1200
Number of labels (spambase, p=0.6)
1600
2000
Summary of results

Micro-task outsourcing (e.g., MTurk, RentaCoder
ESP game) has changed the landscape for data
formulation

Repeated labeling can improve data quality and
model quality (but not always)
When labels are noisy, repeated labeling can be
preferable to single labeling even when labels aren’t
particularly cheap
When labels are relatively cheap, repeated labeling
can do much better (omitted)
Round-robin repeated labeling can do well
Selective repeated labeling improves substantially




Opens up many new directions…
32

Strategies using “learning-curve gradient”

Estimating the quality of each labeler

Example-conditional quality

Increased compensation vs. labeler quality

Multiple “real” labels

Truly “soft” labels

Selective repeated tagging
Thanks!
Q & A?
Download