goadrich.thesis.ppt

advertisement
Learning Ensembles of FirstOrder Clauses That Optimize
Precision-Recall Curves
Mark Goadrich
Computer Sciences Department
University of Wisconsin - Madison
Ph. D. Defense
August 13th, 2007
Biomedical Information Extraction
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
*image courtesy of SEER Cancer Training Site
Structured
Database
Biomedical Information Extraction
http://www.geneontology.org
Biomedical Information Extraction
NPL3 encodes a nuclear protein with an RNA
recognition motif and similarities to a family of
proteins involved in RNA metabolism.
ykuD was transcribed by SigK RNA polymerase from
T4 of sporulation.
Mutations in the COL3A1 gene have been implicated
as a cause of type IV Ehlers-Danlos syndrome, a
disease leading to aortic rupture in early adult life.
Outline
Biomedical Information Extraction
 Inductive Logic Programming
 Gleaner
 Extensions to Gleaner

– GleanerSRL
– Negative Salt
– F-Measure Search
– Clause Weighting (time permitting)
Inductive Logic Programming

Machine Learning
– Classify data into categories
– Divide data into train and test sets
– Generate hypotheses on train set and then
measure performance on test set

In ILP, data are Objects …
– person, block, molecule, word, phrase, …

and Relations between them
– grandfather, has_bond, is_member, …
Seeing Text as Relational Objects
verb(…)
alphanumeric(…)
phrase_child(…, …)
Word
internal_caps(…)
Phrase
noun_phrase(…)
phrase_parent(…, …)
Sentence
long_sentence(…)
Protein Localization Clause
prot_loc(Protein,Location,Sentence) :phrase_contains_some_alphanumeric(Protein,E),
phrase_contains_some_internal_cap_word(Protein,E),
phrase_next(Protein,_),
different_phrases(Protein,Location),
one_POS_in_phrase(Location,noun),
phrase_contains_some_arg2_10x_word(Location,_),
phrase_previous(Location,_),
avg_length_sentence(Sentence).
ILP Background

Seed Example
– A positive example that our clause must cover

Bottom Clause
– All predicates which are true about seed
prot_loc(P,L,S)
example
seed
prot_loc(P,L,S):- alphanumeric(P)
prot_loc(P,L,S):- alphanumeric(P),leading_cap(L)
Clause Evaluation
Prediction vs Actual
Positive or Negative
True or False

prediction
actual

T
P
FP
FN
TN
Focus on positive examples
Recall =
Precision =
TP
TP + FN
TP
TP + FP
F1 Score =
2PR
P +R
Protein Localization Clause
prot_loc(Protein,Location,Sentence) :phrase_contains_some_alphanumeric(Protein,E),
phrase_contains_some_internal_cap_word(Protein,E),
phrase_next(Protein,_),
different_phrases(Protein,Location),
one_POS_in_phrase(Location,noun),
phrase_contains_some_arg2_10x_word(Location,_),
phrase_previous(Location,_),
avg_length_sentence(Sentence).
0.15 Recall
0.51 Precision
0.23 F1 Score
Aleph (Srinivasan ‘03)

Aleph learns theories of clauses
– Pick positive seed example
– Use heuristic search to find best clause
– Pick new seed from uncovered positives
and repeat until threshold of positives
covered
Sequential learning is time-consuming
 Can we reduce time with ensembles?
 And also increase quality?

Outline
Biomedical Information Extraction
 Inductive Logic Programming
 Gleaner
 Extensions to Gleaner

– GleanerSRL
– Negative Salt
– F-Measure Search
– Clause Weighting
Gleaner (Goadrich et al. ‘04, ‘06)

Definition of Gleaner
– One who gathers grain left
behind by reapers

Key Ideas of Gleaner
– Use Aleph as underlying ILP clause engine
– Search clause space with Rapid Random
Restart
– Keep wide range of clauses usually discarded
– Create separate theories for diverse recall
Precision
Gleaner - Learning
Recall

Create B Bins

Generate Clauses

Record Best per Bin
Gleaner - Learning
Seed K
.
.
.
Seed 3
Seed 2
Seed 1
Recall
Gleaner - Ensemble
Clauses from bin 5
Pos
ex1: prot_loc(…)
12
ex2: prot_loc(…)
47
ex3: prot_loc(…)
55
Neg
Pos
ex1: prot_loc(…)
ex2:
12
47
.
ex598: prot_loc(…)
5
ex599: prot_loc(…)
14
ex600: prot_loc(…)
2
ex601: prot_loc(…)
18
Pos
Neg
.
.
.
Neg
Pos
.
Gleaner - Ensemble
Score
Precision Recall
pos3: prot_loc(…)
55
1.00
0.05
neg28: prot_loc(…)
52
0.50
0.05
pos2: prot_loc(…)
47
0.66
0.10
neg4: prot_loc(…)
18
0.12
0.85
neg475: prot_loc(…)
17
pos9: prot_loc(…)
17
0.13
0.90
.
neg15: prot_loc(…)
.
1.0
Precision
Examples
Recall
16
0.12
0.90
1.0
Gleaner - Overlap
For each bin, take the topmost curve
Precision

Recall
How to Use Gleaner (Version 1)


Precision

Recall = 0.50
Precision = 0.70
Recall
Generate Tuneset Curve
User Selects Recall Bin
Return Testset
Classifications
Ordered By Their Score
Gleaner Algorithm
Divide space into B bins
 For K positive seed examples

– Perform RRR search with precision x recall
heuristic
– Save best clause found in each bin b

For each bin b
– Combine clauses in b to form theoryb
– Find L of K threshold for theorym which
performs best in bin b on tuneset

Evaluate thresholded theories on testset
Aleph Ensembles (Dutra et al ‘02)

Compare to ensembles of theories

Ensemble Algorithm
– Use K different initial seeds
– Learn K theories containing C rules
– Rank examples by the number of theories
YPD Protein Localization

Hand-labeled dataset (Ray & Craven ’01)
– 7,245 sentences from 871 abstracts
– Examples are phrase-phrase combinations
 1,810 positive & 279,154 negative

1.6 GB of background knowledge
– Structural, Statistical, Lexical and Ontological
– In total, 200+ distinct background predicates

Performed five-fold cross-validation
Evaluation Metrics
Area Under PrecisionRecall Curve (AUC-PR)
– All curves standardized
to cover full recall range
– Averaged AUC-PR
over 5 folds

1.0
Precision

Number of clauses
considered
– Rough estimate of time
Recall
1.0
PR Curves - 100,000 Clauses
Protein Localization Results
Other Relational Datasets

Genetic Disorder (Ray & Craven ’01)
– 233 positive & 103,959 negative

Protein Interaction (Bunescu et al ‘04)
– 799 positive & 76,678 negative

Advisor (Richardson and Domingos ‘04)
– Students, Professors, Courses, Papers, etc.
– 113 positive & 2,711 negative
Genetic Disorder Results
Protein Interaction Results
Advisor Results
Gleaner Summary

Gleaner makes use of clauses that are not
the highest scoring ones for improved
speed and quality

Issues with Gleaner
– Output is PR curve, not probability
– Redundant clauses across seeds
– L of K clause combination
Outline
Biomedical Information Extraction
 Inductive Logic Programming
 Gleaner
 Extensions to Gleaner

– GleanerSRL
– Negative Salt
– F-Measure Search
– Clause Weighting
Estimating Probabilities - SRL
Given highly skewed relational datasets
 Produce accurate probability estimates
 Gleaner only produces PR curves
Precision

Recall
Gleaner Algorithm
GleanerSRL
Algorithm
(Goadrich ‘07)
Divide space into B bins
 For K positive seed examples

– Perform RRR search with precision x recall
heuristic
– Save
best
found
in each bin b
 Create
For
each
propositional
binclause
b
feature-vectors
– Combine
clauses
b to or
form
theoryb
 Learn
scores
with in
SVM
other
– Find L of K threshold
theorym which
propositional
learningfor
algorithms
performs best in bin b on tuneset
 Calibrate scores into probabilities
 Evaluate thresholded theories on testset
 Evaluate probabilities with Cross Entropy
GleanerSRL Algorithm
Precision
Learning with Gleaner
Recall

Generate Clauses

Create B Bins

Record Best per Bin

Repeat for K seeds
Creating Feature Vectors
Clauses from bin 5
K Boolean
Pos
1
Neg
0
1 Binned
Pos
Pos
.
.
.
Neg
ex1: prot_loc(…)
12
1
1
.
.
.
0
Learning Scores via SVM
Calibrating Probabilities

Use Isotonic Regression (Zadrozny & Elkan
‘03) to transform SVM scores into
probabilities
0.50
0.66
1.00
Probability
0.00
Class 0
0
1
0
1
1
0
1
1
Score -2
-0.4
0.2
0.4
0.5
0.9
1.3
1.7
15
Examples
GleanerSRL Results for Advisor
(Davis et al. 05)
(Davis et al. 07)
Outline
Biomedical Information Extraction
 Inductive Logic Programming
 Gleaner
 Extensions to Gleaner

– GleanerSRL
– Negative Salt
– F-Measure Search
– Clause Weighting
Diversity of Gleaner Clauses
Negative Salt

Seed Example
– A positive example that our clause must cover

Salt Example
– A negative example that our clause should
prot_loc(P,L,S)
avoid
seed
salt
Gleaner Algorithm
Divide space into B bins
 For K positive seed examples

– Select
PerformNegative
RRR search
Salt example
with precision x recall
– heuristic
Perform RRR search with salt-avoiding
– heuristic
Save best clause found in each bin b
– Save
best
found in each bin b
 For
each
binclause
b
 For
each bin
b
– Combine
clauses
in b to form theoryb
– Combine
Find L of K
clauses
threshold
in bfor
to theory
form theory
m which
b
best
in bin bfor
ontheory
tuneset
– performs
Find L of K
threshold
m which
performsthresholded
best in bin b theories
on tuneseton testset
 Evaluate

Evaluate thresholded theories on testset
Diversity of Negative Salt
Effect of Salt on Theorym Choice
Negative Salt AUC-PR
Outline
Biomedical Information Extraction
 Inductive Logic Programming
 Gleaner
 Extensions to Gleaner

– GleanerSRL
– Negative Salt
– F-Measure Search
– Clause Weighting
Gleaner Algorithm
Divide space into B bins
 For K positive seed examples

– Perform RRR search with F
precision
Measurex heuristic
recall
– heuristic
Save best clause found in each bin b

For each bin b
– Combine clauses in b to form theoryb
– Find L of K threshold for theorym which
performs best in bin b on tuneset

Evaluate thresholded theories on testset
RRR Search Heuristic
Heuristic function directs RRR search
 Can provide direction through F Measure

Low values for
 High values for

encourage Precision
encourage Recall
F0.01 Measure Search
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
F1 Measure Search
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
F100 Measure Search
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
F Measure AUC-PR Results
Genetic Disorder
Localization
Protein
Weighting Clauses
Alter the L of K combination in Gleaner
 Within Single Theory

– Cumulative weighting schemes successful
– Precision highest-scoring scheme

Within Gleaner
– Precision beats Equal Wgt’ed and Naïve
Bayes
– Significant results on genetic-disorder dataset
Clauses from bin 5
Weighting Clauses
Pos
Cumulative
W
11
Neg
Pos
∑(precision of each matching clause)
∑(recall of each matching clause)
00
∑(F1 measure of each matching clause)
W
13
ex1: prot_loc(…)
W
14
Pos
.
.
.
Naïve Bayes and TAN
00
learn probability for example
Ranked List
max(precision of each matching clause)
Weighted Vote
Neg
ave(precision of each matching clause)
Dominance Results
Statistically significant dominance in i,j
 Precision is never dominated
 Naïve Bayes competitive with cumulative

Weighting Gleaner Results
Conclusions and Future Work

Gleaner is a flexible and fast ensemble
algorithm for highly skewed ILP datasets
 Other Work
– Proper interpolation of PR Space (Goadrich et al. ‘04, ‘06)
– Relationship of PR and ROC Curves (Davis and Goadrich
‘06)

Future Work
– Explore Gleaner on propositional datasets
– Learn heuristic function for diversity (Oliphant and Shavlik
‘07)
Acknowledgements






USA DARPA Grant F30602-01-2-0571
USA Air Force Grant F30602-01-2-0571
USA NLM Grant 5T15LM007359-02
USA NLM Grant 1R01LM07050-01
UW Condor Group
Jude Shavlik, Louis Oliphant, David Page,
Vitor Santos Costa, Ines Dutra, Soumya Ray,
Marios Skounakis, Mark Craven, Burr Settles,
Patricia Brennan, AnHai Doan, Jesse Davis,
Frank DiMaio, Ameet Soni, Irene Ong, Laura
Goadrich, all 6th Floor MSCers
Download