Learning to Construct and Reason with a Large KB of Extracted Information

advertisement
Learning to Construct and
Reason with a Large KB of
Extracted Information
William W. Cohen
Machine Learning Dept
and Language Technology Dept
joint work with:
Tom Mitchell, Ni Lao, William Wang, Kathryn Rivard Mazaitis,
Richard Wang, Frank Lin, Ni Lao, Estevam Hruschka, Jr., Burr
Settles, Partha Talukdar, Derry Wijaya, Edith Law, Justin
Betteridge, Jayant Krishnamurthy, Bryan Kisiel, Andrew
Carlson, Weam Abu Zaki , Bhavana Dalvi, Malcolm Greaves,
Lise Getoor, Jay Pujara, Hui Miao, …
Outline
• Background: information extraction and NELL
• Key ideas in NELL
– Coupled learning
– Multi-view, multi-strategy learning
• Inference in NELL
– Inference as another learning strategy
• Learning in graphs
• Path Ranking Algorithm
• ProPPR
– Promotion as inference
• Conclusions & summary
But first….some backstory
..and an unrelated project…
..called SimStudent…
SimStudent will learn rules to solve a problem
step-by-step and guide a student through
how solve problems step-by-step
Quinlan’s
FOIL
Summary of SimStudent
• Possible for a human author (eg middle school
teacher) to build an ITS system
– by building a GUI, then demonstrating problem solving and
having the system learn how from examples
• The rules learned by SimStudent can be used to
construct a “student model”
– with parameter tuning this can predict how well individual
students will learn
– better than state-of-the-art in some cases!
• AI problem solving with a cognitively predictive
model … and ILP is a key component!
Information Extraction
• Goal:
– Extract facts about the world automatically by
reading text
– IE systems are usually based on learning how to
recognize facts in text
• .. and then (sometimes) aggregating the results
• Latest-generation IE systems need not require large
amounts of training
• … and IE does not necessarily require subtle analysis of
any particular piece of text
Never Ending Language Learning
(NELL)
• NELL is a broad-coverage IE system
– Simultaneously learning 500-600 concepts and relations
(person, celebrity, emotion, aquiredBy, locatedIn,
capitalCityOf, ..)
– Starting point: containment/disjointness relations between
concepts, types for relations, and O(10) examples per
concept/relation
– Uses 500M web page corpus + live queries
– Running (almost) continuously for over three years
– Has learned over 50M beliefs, over 1M high-confidence ones
• about 85% of high-confidence beliefs are correct
Demo
• http://rtw.ml.cmu.edu/rtw/
NELL Screenshots
NELL Screenshots
NELL Screenshots
More examples of what NELL knows
Outline
• Background: information extraction and NELL
• Key ideas in NELL
– Coupled learning
– Multi-view, multi-strategy learning
• Inference in NELL
– Inference as another learning strategy
• Learning in graphs
• Path Ranking Algorithm
• ProPPR
– Promotion as inference
• Conclusions & summary
Bootstrapped SSL learning
of lexical patterns
it’s underconstrained!!
Extract cities:
Paris
Pittsburgh
Seattle
Cupertino
San Francisco
Austin
denial
mayor of arg1
live in arg1
anxiety
selfishness
Berlin
arg1 is home of
traits such as arg1
Given: four seed examples of the class “city”
One Key to Accurate Semi-Supervised Learning
teamPlaysSport(t,s)
person
playsForTeam(a,t)
sport
coach(NP)
NP
Krzyzewski coaches the Blue Devils.
hard (underconstrained)
semi-supervised learning
problem
athlete
coach
NP1
team
playsSport(a,s)
coachesTeam(c,t)
NP2
Krzyzewski coaches the Blue Devils.
much easier (more constrained)
semi-supervised learning problem
1. Easier to learn many interrelated tasks than one isolated task
2. Also easier to learn using many different types of information
Outline
• Background: information extraction and NELL
• Key ideas in NELL
– Coupled learning
– Multi-view, multi-strategy learning
• Inference in NELL
– Inference as another learning strategy
• Learning in graphs
• Path Ranking Algorithm
• ProPPR
– Promotion as inference
• Conclusions & summary
Another key idea: use multiple types of information
evidence integration
CBL
SEAL
Morph
PRA
text
extraction
patterns
HTML
extraction
patterns
Morphology
based
extractor
learned
inference
rules
Ontology
and
populated
KB
the Web
Outline
• Background: information extraction and NELL
• Key ideas in NELL
– Coupled learning
– Multi-view, multi-strategy learning
• Inference in NELL
– Inference as another learning strategy
• Background: Learning in graphs
• Path Ranking Algorithm
• ProPPR
– Promotion as inference
• Conclusions & summary
Background: Personal Info Management
as Similarity Queries on a Graph
[SIGIR 2006, EMNLP
2008, TOIS 2010]
NSF
Sent
To
Term In
Subject
Einat Minkov, Univ Haifa
William
graph
proposal
CMU
6/17/07
6/18/07
einat@cs.cmu.edu
Learning about graph similarity
• Personalized PageRank aka Random Walk with Restart:
– Similarity measure for nodes in a graph, analogous
to TFIDF for text in a WHIRL database
– natural extension to PageRank
– amenable to learning parameters of the walk
(gradient search, w/ various optimization metrics):
• Toutanova, Manning & NG, ICML2004; Nie et al,
WWW2005; Xi et al, SIGIR 2005
– or: reranking, etc
– queries:
Given type t* and node x, find y:T(y)=t* and y~x
Given type t* and nodes X, find y:T(y)=t* and y~X
Many tasks can be reduced to similarity queries
Person name
disambiguation
[ term “andy”
file
msgId ]
“person”
Threading
 What are the adjacent messages in
this thread?
 A proxy for finding “more
messages like this one”
Alias finding
What are the email-addresses of
Jason ?...
[ file msgId ]
“file”
[ term Jason ]
“email-address”
Meeting
attendees finder
Which email-addresses (persons)
should I notify about this meeting?
[ meeting mtgId ]
“email-address”
Learning about graph similarity:
the next generation
• Personalized PageRank aka Random Walk with Restart:
– Given type t* and nodes X, find y:T(y)=t* and y~X
• Ni Lao’s thesis (2012): New, better learning methods
– richer parameterization
– faster PPR inference
– structure learning
• Other tasks:
– relation-finding in parsed text
– information management for biologists
– inference in large noisy knowledge bases
Lao: A learned random walk strategy is a weighted set
of random-walk “experts”, each of which is a walk
constrained by a path (i.e., sequence of relations)
Recommending papers to cite in a paper being prepared
1) papers co-cited with on-topic papers
6) approx. standard IR retrieval
7,8) papers cited during the past two years
12-13) papers published during the past two years
Another study:
learning inference
rules for a noisy KB
(Lao, Cohen, Mitchell 2011)
(Lao et al, 2012)
AthletePlays
ForTeam
HinesWard
TeamPlays
InLeague
Steelers
AthletePlaysInLeague
?
NFL
IsA
PlaysIn
American
isa-1
Random walk
interpretation is crucial
i.e. 10-15 extra points in MRR
Synonyms of the query team
Another key idea: use multiple types of information
evidence integration
CBL
SEAL
Morph
PRA
text
extraction
patterns
HTML
extraction
patterns
Morphology
based
extractor
learned
inference
rules
Ontology
and
populated
KB
the Web
Outline
• Background: information extraction and NELL
• Key ideas in NELL
• Inference in NELL
– Inference as another learning strategy
• Background: Learning in graphs
• Path Ranking Algorithm
• PRA + FOL: ProPPR and joint learning for inference
– Promotion as inference
• Conclusions & summary
How can you extend PRA to
•
•
•
•
Non-binary predicates?
Paths that include constants?
Recursive rules?
…. ?
• Current direction: using ideas from PRA in a
general first-order logic: ProPPR
A limitation
• Paths are learned separately for each relation
type, and one learned rule can’t call another
• PRA can learn this….
athletePlaySportViaRule(Athlete,Sport) 
onTeamViaKB(Athlete,Team), teamPlaysSportViaKB(Team,Sport)
teamPlaysSportViaRule(Team,Sport) 
memberOfViaKB(Team,Conference),
hasMemberViaKB(Conference,Team2),
playsViaKB(Team2,Sport).
teamPlaysSportViaRule(Team,Sport) 
onTeamViaKB(Athlete,Team), athletePlaysSportViaKB(Athlete,Sport)
A limitation
• Paths are learned separately for each relation
type, and one learned rule can’t call another
• But PRA can’t learn this…..
athletePlaySport(Athlete,Sport) 
onTeam(Athlete,Team), teamPlaysSport(Team,Sport)
athletePlaySport(Athlete,Sport)  athletePlaySportViaKB(Athlete,Sport)
teamPlaysSport(Team,Sport) 
memberOf(Team,Conference),
hasMember(Conference,Team2),
plays(Team2,Sport).
teamPlaysSport(Team,Sport) 
onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport)
teamPlaysSport(Team,Sport)  teamPlaysSportViaKB(Team,Sport)
Solution: a major extension from PRA to include
large subset of Prolog
athletePlaySport(Athlete,Sport) 
onTeam(Athlete,Team), teamPlaysSport(Team,Sport)
athletePlaySport(Athlete,Sport)  athletePlaySportViaKB(Athlete,Sport)
teamPlaysSport(Team,Sport) 
memberOf(Team,Conference),
hasMember(Conference,Team2),
plays(Team2,Sport).
teamPlaysSport(Team,Sport) 
onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport)
teamPlaysSport(Team,Sport)  teamPlaysSportViaKB(Team,Sport)
Sample ProPPR program….
Horn rules
features of rules
(vars from head ok)
D’oh! This is a graph!
.. and search
space…
• Score for a query soln (e.g., “Z=sport” for “about(a,Z)”)
depends on probability of reaching a ☐ node*
•
• learn transition probabilities based on features of the rules
• implicit “reset” transitions with (p≥α) back to query node
Looking for answers supported by many short proofs
“Grounding” size is O(1/αε) … ie
independent of DB size  fast approx
incremental inference
(Reid,Lang,Chung, 08)
Learning: supervised variant of
personalized PageRank
(Backstrom & Leskovic, 2011)
*Exactly as in Stochastic Logic Programs
[Cussens, 2001]
Sample Task: Citation Matching
• Task:
• citation matching (Alchemy: Poon & Domingos).
• Dataset:
• CORA dataset, 1295 citations of 132 distinct papers.
• Training set: section 1-4.
• Test set: section 5.
• ProPPR program:
• translated from corresponding Markov logic network
(dropping non-Horn clauses)
• # of rules: 21.
Task: Citation Matching
Time: Citation Matching
vs Alchemy
“Grounding” is independent of DB size
Accuracy: Citation Matching
Our rules
UW rules
AUC scores: 0.0=low, 1.0=hi
w=1 is before learning
It gets better…..
• Learning uses many example queries
• e.g: sameCitation(c120,X) with
X=c123+, X=c124-, …
• Each query is grounded to a separate
small graph (for its proof)
• Goal is to tune weights on these edge
features to optimize RWR on the querygraphs.
• Can do SGD and run RWR separately
on each query-graph
• Graphs do share edge features, so
there’s some synchronization needed
Learning can be parallelized by splitting on the separate “groundings” of each query
Back to NELL……
evidence integration
CBL
SEAL
Morph
PRA
text
extraction
patterns
HTML
extraction
patterns
Morphology
based
extractor
learned
inference
rules
Ontology
and
populated
KB
the Web
Experiment:
•Take top K paths for each predicate learned by Lao’s PRA
• (I don’t know how to do structure learning for ProPPR yet)
•Convert to a mutually recursive ProPPR program
•Train weights on entire program
athletePlaySport(Athlete,Sport) 
onTeam(Athlete,Team), teamPlaysSport(Team,Sport)
athletePlaySport(Athlete,Sport)  athletePlaySportViaKB(Athlete,Sport)
teamPlaysSport(Team,Sport) 
memberOf(Team,Conference),
hasMember(Conference,Team2),
plays(Team2,Sport).
teamPlaysSport(Team,Sport) 
onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport)
teamPlaysSport(Team,Sport)  teamPlaysSportViaKB(Team,Sport)
More details
• Train on NELL’s KB as of iteration 713
• Test on new facts from later iterations
• Try three “subdomains” of NELL
– pick a seed entity S
– pick top M entities nodes in a (simple untyped
RWR) from S
– project KB to just these M entities
– look at three subdomains, six values of M
Outline
• Background: information extraction and NELL
• Key ideas in NELL
– Coupled learning
– Multi-view, multi-strategy learning
• Inference in NELL
– Inference as another learning strategy
• Learning in graphs
• Path Ranking Algorithm
• ProPPR
– Promotion as inference
• Conclusions & summary
More detail on NELL
• For iteration i=1,….,715,…:
– For each view (lexical patterns, …, PRA):
• Distantly-train for that view using KBi
• Propose new “candidate beliefs” based on the learned
view-specific classifier
– Hueristically find the “best” candidate beliefs and
“promote” them into KBi+1
Not obvious how to promote in a principled way …
Promotion: identifying new correct
extractions from a pool of noisy extractions
• Many types of noise are possible:
• co-referent entities
• missing or spurious labels
• missing or spurious relations
• violations of ontology (e.g., an athlete that is not a person)
• Identifying true extractions requires joint reasoning, e.g.
• Pooling information about co-referent entities
• Enforcing mutual exclusion of labels and relations
Problem: How can we integrate extractions from multiple sources in
the presence of ontological constraints at the scale of millions of
extractions?
An example
Sample Extractions:
Lbl(Kyrgyzstan, bird)
Lbl(Kyrgyzstan, country)
Lbl(Kyrgyz Republic, country)
A knowledge graph view of NELL’s extractions
Kyrgyzstan
SameEnt
Kyrgyz Republic
Rel(Kyrgyz Republic, Bishkek,
hasCapital)
Ontology:
country
Dom(hasCapital, country)
Mut(country, bird)
bird
Entity Resolution:
Bishkek
SameEnt(Kyrgyz Republic,
Kyrgyzstan)
What you want
country
Lbl
Kyrgyzstan
Kyrgyz Republic
Rel(hasCapital)
Bishkek
Knowledge graph
graph identification
Representation as a noisy knowledge graph
Kyrgyzstan
SameEnt
Kyrgyz Republic
country
bird
Bishkek
Lise Getoor, Jay Pujara,
and Hui Miao @ UMD
After Knowledge Graph Identification
Kyrgyzstan
Lbl
country
Kyrgyz Republic
Rel(hasCapital)
Bishkek
Graph Identification as Joint Reasoning:
Probabilistic Soft Logic (PSL)
• Templating language for hinge-loss MRFs, much more scalable!
• Model specified as a collection of logical formulas
– Formulas are ground by substituting literal values
– Truth values of atoms relaxed to [0,1] interval
– Truth values of formulas derived from Lukasiewicz t-norm
• Each ground rule, r, has a weighted potential, ϕr corresponding
to a distance to satisfaction
• PSL defines a probability distribution over atom truth value
assignments, I:

1
P(I)  exp rR w r?r (I) p
Z

• Most probable explanation (MPE) inference is convex
• Running time scales linearly with grounded rules (|R|)
PSL Representation of Heuristics for Promotion
Promote any candidate
Promote “hints” (old promotion strategy)
Be consistent about labels for duplicate entities
PSL Representation of Ontological Rules
Be consistent with constraints from ontology
Too expressive for ProPPR
Adapted from Jiang et al., ICDM 2012
Datasets & Results
• Evaluation on NELL dataset from iteration 165:
• 1.7M candidate facts
• 70K ontological constraints
• Predictions on 25K facts from a 2-hop neighborhood around
test data
• Beats other methods, runs in just 10 seconds!
Baseline
NELL
MLN (Jiang, 12)
KGI-PSL
F1
.828
.673
.836
.853
AUC
.873
.765
.899
.904
Summary
• Background: information extraction and NELL
• Key ideas in NELL
– Coupled learning
– Multi-view, multi-strategy learning
• Inference in NELL
– Inference as another learning strategy
• Learning in graphs
• Path Ranking Algorithm
• ProPPR
– Promotion as inference
• Conclusions & summary
Download