Slides

advertisement
Statistical Relational Learning
for Knowledge Extraction
from the Web
Hoifung Poon
Dept. of Computer Science & Eng.
University of Washington
1
“Drowning in Information,
Starved for Knowledge”
WWW
2
Great Vision:
Knowledge Extraction from Web
Craven et al., “Learning to Construct Knowledge Bases from
the World Wide Web," Artificial Intelligence, 1999.

Also need:



Knowledge representation and reasoning
Close the loop: Apply knowledge to extraction
Machine reading [Etzioni et al., 2007]
3
Machine Reading: Text  Knowledge
……
4
Rapidly Growing Interest




AAAI-07 Spring Symposium on Machine Reading
DARPA Machine Reading Program (2009-2014)
NAACL-10 Workshop on Learning By Reading
Etc.
5
Great Impact







Scientific inquiry and commercial applications
Literature-based discovery, robot scientists
Question answering, semantic search
Drug design, medical diagnosis
Breach knowledge acquisition bottleneck for
AI and natural language understanding
Automatically semantify the Web
Etc.
6
This Talk



Statistical relational learning offers promising
solutions to machine reading
Markov logic is a leading unifying framework
A success story: USP


Unsupervised, end-to-end machine reading
Extracts five times as many correct answers as
state of the art, with highest accuracy of 91%
7
USP: Question-Answer Example
Interestingly, the DEX-mediated IkappaBalpha induction
was completely inhibited by IL-2, but not IL-4, in Th1
cells, while the reverse profile was seen in Th2 cells.
Q: What does IL-2 control?
A: The DEX-mediated IkappaBalpha induction
8
Overview





Machine reading: Challenges
Statistical relational learning
Markov logic
USP: Unsupervised Semantic Parsing
Research directions
9
Key Challenges




Complexity
Uncertainty
Pipeline accumulates errors
Supervision is scarce
10
Languages Are Structural
governments
lm$pxtm
(Hebrew: according to their families)
IL-4 induces CD11B
Involvement of p70(S6)-kinase
activation in IL-10 up-regulation
in human monocytes by gp41......
George Walker Bush was the
43rd President of the United
States.
……
Bush was the eldest son of
President G. H. W. Bush and
Babara Bush.
…….
In November 1977, he met
Laura Welch at a barbecue.1111
Languages Are Structural
S
govern-ment-s
l-m$px-t-m
VP
NP
(Hebrew: according to their families)
V
NP
IL-4 induces CD11B
Involvement of p70(S6)-kinase
activation in IL-10 up-regulation
in human monocytes by gp41......
involvement
Theme
Cause
up-regulation
Theme
IL-10
Cause
gp41
Site
activation
Theme
human
p70(S6)-kinase
monocyte
George Walker Bush was the
43rd President of the United
States.
……
Bush was the eldest son of
President G. H. W. Bush and
Babara Bush.
…….
In November 1977, he met
Laura Welch at a barbecue.1122
Knowledge Is Heterogeneous

Individuals
E.g.: Socrates is a man

Types
E.g.: Man is mortal

Inference rules
E.g.: Syllogism

Ontological relations
MAMMAL
ISA
HUMAN

Etc.
FACE
ISPART
EYE
13
Complexity




Can handle using first-order logic
Trees, graphs, dependencies, hierarchies, etc.
easily expressed
Inference algorithms (satisfiability testing,
theorem proving, etc.)
But … logic is brittle with uncertainty
14
Languages Are Ambiguous
I saw the man with the telescope
NP
I saw the man with the telescope
NP
ADVP
I saw the man with the telescope
Microsoft buys Powerset
Microsoft acquires Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
……
Here in London, Frances Deek is a retired teacher …
In the Israeli town …, Karen London says …
Now London says …
London  PERSON or LOCATION?
G. W. Bush ……
…… Laura Bush ……
Mrs. Bush ……
Which one?
15
Knowledge Has Uncertainty



We need to model correlations
Our information is always incomplete
Our predictions are uncertain
16
Uncertainty

Statistics provides the tools to handle this








Mixture models
Hidden Markov models
Bayesian networks
Markov random fields
Maximum entropy models
Conditional random fields
Etc.
But … statistical models assume i.i.d. data
(independently and identically distributed)
objects  feature vectors
17
Pipeline is Suboptimal

E.g., NLP pipeline:
Tokenization  Morphology  Chunking  Syntax  …


Accumulates and propagates errors
Wanted: Joint inference


Across all processing stages
Among all interdependent objects
18
Supervision is Scarce


Tons of text … but most is not annotated
Labeling is expensive (Cf. Penn-Treebank)
 Need to leverage indirect supervision
19
Redundancy




Key source of indirect supervision
State-of-the-art systems depend on this
E.g., TextRunner [Banko et al., 2007]
But … Web is heterogeneous: Long tail
Redundancy only present in head regime
20
Overview





Machine reading: Challenges
Statistical relational learning
Markov logic
USP: Unsupervised Semantic Parsing
Research directions
21
Statistical Relational Learning





Burgeoning field in machine learning
Offers promising solutions for machine reading
Unify statistical and logical approaches
Replace pipeline with joint inference
Principled framework to leverage both
direct and indirect supervision
22
Machine Reading: A Vision
Challenge: Long tail
23
Machine Reading: A Vision
24
Challenges in Applying
Statistical Relational Learning



Learning is much harder
Inference becomes a crucial issue
Greater complexity for user
25
Progress to Date



Probabilistic logic [Nilsson, 1986]
Statistics and beliefs [Halpern, 1990]
Knowledge-based model construction
[Wellman et al., 1992]





Stochastic logic programs [Muggleton, 1996]
Probabilistic relational models [Friedman et al., 1999]
Relational Markov networks [Taskar et al., 2002]
Markov logic [Domingos & Lowd, 2009]
Etc.
26
Progress to Date



Probabilistic logic [Nilsson, 1986]
Statistics and beliefs [Halpern, 1990]
Knowledge-based model construction
[Wellman et al., 1992]





Stochastic logic programs [Muggleton, 1996]
Probabilistic relational models [Friedman et al., 1999]
Leading unifying framework
Relational Markov networks [Taskar et al., 2002]
Markov logic [Domingos & Lowd, 2009]
Etc.
27
Overview





Machine reading
Statistical relational learning
Markov logic
USP: Unsupervised Semantic Parsing
Research directions
28
Markov Networks

Undirected graphical models
Smoking
Cancer
Asthma

Cough
Log-linear model:
1


P( x)  exp  wi f i ( x) 
Z
 i

Weight of Feature i
Feature i
 1 if  Smoking  Cancer
f1 (Smoking, Cancer )  
 0 otherwise
w1  1.5
29
First-Order Logic



Constants, variables, functions, predicates
E.g.: Anna, x, MotherOf(x), Friends(x,y)
Grounding: Replace all variables by constants
E.g.: Friends (Anna, Bob)
World (model, interpretation):
Assignment of truth values to all ground
predicates
30
Markov Logic




Intuition: Soften logical constraints
Syntax: Weighted first-order formulas
Semantics: Feature templates for Markov
networks
A Markov Logic Network (MLN) is a set of
pairs (Fi, wi) where
Number of
 Fi is a formula in first-order logic
true groundings
of Fi
 wi is a real number
1


P( x)  exp   wi  ni ( x) 
Z
 i

31
Example: Friends & Smokers
Smoking causes cancer.
Friends have similar smoking habits.
32
Example: Friends & Smokers
x Sm okes( x )  Cancer( x)
x, y Friends( x, y )  Sm okes( x )  Sm okes( y ) 
33
Example: Friends & Smokers
1.5 x Sm okes( x )  Cancer( x)
1.1 x, y Friends( x, y )  Sm okes( x )  Sm okes( y ) 
34
Example: Friends & Smokers
1.5 x Sm okes( x )  Cancer( x)
1.1 x, y Friends( x, y )  Sm okes( x )  Sm okes( y ) 
Two constants: Anna (A) and Bob (B)
Probabilistic graphical
models
and
Friends(A,B)
first-order logic are special cases
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
35
MLN Algorithms:
The First Three Generations
Problem
MAP
inference
Marginal
inference
Weight
learning
Structure
learning
First
generation
Weighted
satisfiability
Gibbs
sampling
Pseudolikelihood
Inductive
logic progr.
Second
generation
Lazy
inference
MC-SAT
Voted
perceptron
ILP + PL
(etc.)
Third
generation
Cutting
planes
Lifted
inference
Scaled conj.
gradient
Clustering +
pathfinding
36
Efficient Inference




Logical or statistical inference already hard
But … can do approximate inference
Suffice to perform well in most cases
Combine ideas from both camps
E.g., MC-SAT  MCMC  SAT solver
More: Poon & Domingos, “Sound and Efficient Inference with
Probabilistic and Deterministic Dependencies”, in Proc. AAAI-2006.

Can also leverage sparsity in relational domains
More: Poon, Domingos & Sumner, “A General Method for Reducing
the Complexity of Relational Inference and its Application to MCMC”,
37
in Proc. AAAI-2008.
Weight Learning




Probability model P(X)
X: Observable in training data
Maximize likelihood of observed data
Regularization to prevent overfitting
38
Weight Learning

Gradient descent
Requires inference

log P ( x)  ni ( x)  Ex  ni ( x) 
wi
No. of times clause i is true in data
Expected no. times clause i is true according to MLN


Use MC-SAT for inference
Can also leverage second-order information
[Lowd & Domingos, 2007]
39
Unsupervised Learning: How?


I.I.D. learning: Sophisticated model requires
more labeled data
Statistical relational learning: Sophisticated
model may require less labeled data




Ambiguities vary among objects
Joint inference  Propagate information from
unambiguous objects to ambiguous ones
One formula is worth a thousand labels
Small amount of domain knowledge 
large-scale joint inference
40
Unsupervised Weight Learning




Probability model P(X,Z)
X: Observed in training data
Z: Hidden variables
E.g., clustering with mixture models



Z: Cluster assignment
X: Observed features
P ( X , Z )  P(Z )  P( X | Z )
Maximize likelihood of observed data by
summing out hidden variables Z
41
Unsupervised Weight Learning

Gradient descent

log P ( x)  Ez| x  ni ( x, z )   Ex , z  ni ( x, z ) 
wi
Sum over z, conditioned on observed x
Summed over both x and z


Use MC-SAT to compute both expectations
May also combine with contrastive estimation
More: Poon, Cherry, & Toutanova, “Unsupervised Morphological
Segmentation with Log-Linear Models”, in Proc. NAACL-2009.
42
42
Best Paper Award
Markov Logic

Unified inference and learning algorithms
 Can handle millions of variables, billions of features,
ten of thousands of parameters


Easy-to-use software: Alchemy
Many successful applications
E.g.: Information extraction, coreference
resolution, semantic parsing, ontology induction
43
Pipeline  Joint Inference

Combine segmentation and entity resolution
for information extraction
More: Poon & Domingos, “Joint Inference for Information
Extraction”, in Proc. AAAI-2007.

Extract complex and nested bio-events from
PubMed abstracts
More: Poon & Vanderwende, “Joint Inference for Knowledge
Extraction from Biomedical Literature”, in Proc. NAACL-2010.
44
Unsupervised Learning: Example
Coreference resolution: Accuracy comparable
to previous supervised state of the art
More: Poon & Domingos, “Joint Unsupervised Coreference
Resolution with Markov Logic”, in Proc. EMNLP-2008.
45
Overview





Machine reading: Challenges
Statistical relational learning
Markov logic
USP: Unsupervised Semantic Parsing
Research directions
46
Unsupervised Semantic Parsing

USP [Poon & Domingos, EMNLP-09]




Best Paper Award
First unsupervised approach for semantic parsing
End-to-end machine reading system
Read text, answer questions
OntoUSP  USP  Ontology Induction
[Poon & Domingos, ACL-10]

Encoded in a few Markov logic formulas
47
Semantic Parsing
Goal
Microsoft buys Powerset
BUY(MICROSOFT,POWERSET)
Challenge Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
48
Limitations of Existing Approaches



Manual grammar or supervised learning
Applicable to restricted domains only
For general text



Not clear what predicates and objects to use
Hard to produce consistent meaning annotation
Also, often learn both syntax and semantics


Fail to leverage advanced syntactic parsers
Make semantic parsing harder
49
USP: Key Idea # 1


Target predicates and objects can be learned
Viewed as clusters of syntactic or lexical variations
of the same meaning
BUY(-,-)
 buys, acquires, ’s purchase of, …
 Cluster of various expressions for acquisition
MICROSOFT
 Microsoft, the Redmond software giant, …
 Cluster of various mentions of Microsoft
50
USP: Key Idea # 2


Relational clustering  Cluster relations with
same objects
USP  Recursively cluster arbitrary expressions
with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
51
USP: Key Idea # 2


Relational clustering  Cluster relations with
same objects
USP  Recursively cluster arbitrary expressions
with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster same forms at the atom level
52
USP: Key Idea # 2


Relational clustering  Cluster relations with
same objects
USP  Recursively cluster arbitrary expressions
with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster forms in composition with same forms
53
USP: Key Idea # 2


Relational clustering  Cluster relations with
same objects
USP  Recursively cluster arbitrary expressions
with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster forms in composition with same forms
54
USP: Key Idea # 2


Relational clustering  Cluster relations with
same objects
USP  Recursively cluster arbitrary expressions
with similar subexpressions
Microsoft buys Powerset
Microsoft acquires semantic search engine Powerset
Powerset is acquired by Microsoft Corporation
The Redmond software giant buys Powerset
Microsoft’s purchase of Powerset, …
Cluster forms in composition with same forms
55
USP: Key Idea # 3




Start directly from syntactic analyses
Focus on translating them to semantics
Leverage rapid progress in syntactic parsing
Much easier than learning both
56
Joint Inference in USP



Forms canonical meaning representation by
recursively clustering synonymous expressions
Text  Logical form in this representation
Induces ISA hierarchy among clusters and
applies hierarchical smoothing (shrinkage)
57
USP: System Overview





Input: Dependency trees for sentences
Converts dependency trees into quasi-logical
forms (QLFs)
Starts with QLF clusters at atom level
Recursively builds up clusters of larger forms
Output:


Probability distribution over QLF clusters and
their composition
MAP semantic parses of sentences
58
Generating Quasi-Logical Forms
buys
nsubj
Microsoft
dobj
Powerset
Convert each node into an unary atom
59
Generating Quasi-Logical Forms
buys(n1)
nsubj
Microsoft(n2)
dobj
Powerset(n3)
n1, n2, n3 are Skolem constants
60
Generating Quasi-Logical Forms
buys(n1)
nsubj
Microsoft(n2)
dobj
Powerset(n3)
Convert each edge into a binary atom
61
Generating Quasi-Logical Forms
buys(n1)
nsubj(n1,n2)
Microsoft(n2)
dobj(n1,n3)
Powerset(n3)
Convert each edge into a binary atom
62
A Semantic Parse
buys(n1)
nsubj(n1,n2)
Microsoft(n2)
dobj(n1,n3)
Powerset(n3)
Partition QLF into subformulas
63
A Semantic Parse
buys(n1)
nsubj(n1,n2)
Microsoft(n2)
dobj(n1,n3)
Powerset(n3)
Subformula  Lambda form:
Replace Skolem constant not in unary atom
with a unique lambda variable
64
A Semantic Parse
buys(n1)
λx2.nsubj(n1,x2)
Microsoft(n2)
λx3.dobj(n1,x3)
Powerset(n3)
Subformula  Lambda form:
Replace Skolem constant not in unary atom
with a unique lambda variable
65
A Semantic Parse
Core form
buys(n1)
Argument form
λx2.nsubj(n1,x2)
Microsoft(n2)
Argument form
λx3.dobj(n1,x3)
Powerset(n3)
Core form: No lambda variable
Argument form: One lambda variable
66
A Semantic Parse
buys(n1)
λx2.nsubj(n1,x2) λx3.dobj(n1,x3)
 BUY
Microsoft(n2)
 MICROSOFT
Powerset(n3)
 POWERSET
Assign subformula to object cluster
67
Object Cluster: BUY
buys(n1)
0.1
One formula in MLN
acquires(n1) 0.2
……
Learn weights for each pair of
cluster and core form
Distribution over core forms
68
Object Cluster: BUY
buys(n1)
0.1
acquires(n1) 0.2
BUYER
BOUGHT
……
PRICE
……
May contain variable number of
property clusters
69
Property Cluster: BUYER
λx2.nsubj(n1,x2) 0.5
MICROSOFT 0.2
Zero 0.1
λx2.agent(n1,x2) 0.4
GOOGLE 0.1
One 0.8
Three MLN formulas
……
……
……
Distributions over
argument forms, clusters, and number
70
Probabilistic Model


Exponential prior on number of parameters
Cluster mixtures:
Object Cluster: BUY
buys
0.1
acquires 0.4
Property Cluster: BUYER
nsubj 0.5
GOOGLE
0.1
Zero 0.1
One 0.8
……
…
…
…
…
agent 0.4
MICROSOFT 0.2
71
E.g., picking MICROSOFT as BUYER
Probabilistic
Model
argument depends
not only on BUY, but
also on its ISA ancestors


Exponential prior on number of parameters
Cluster mixtures with hierarchical smoothing:
Object Cluster: BUY
buys
0.1
acquires 0.4
Property Cluster: BUYER
nsubj 0.5
GOOGLE
0.1
Zero 0.1
One 0.8
……
…
…
…
…
agent 0.4
MICROSOFT 0.2
72
Abstract Lambda Form
buys(n1)
 λx2.nsubj(n1,x2)
λx3.dobj(n
3)
Finallogical
form is1,x
obtained
via lambda reduction
BUYS(n1)
 λx2.BUYER(n1,x2)
 λx3.BOUGHT(n1,x3)
73
Challenge: State Space Too Large
Potential cluster number  exp(token-number)
 Also, meaning units and clusters often small
 Use combinatorial search

74
Inference: Find MAP Parse
induces
Initialize
nsubj
dobj
protein
CD11B
nn
IL-4
Lambda reduction
protein
Search
Operator
nn
IL-4
protein
nn
IL-4
75
Learning:
Greedily Maximize Posterior
Initialize
Search
Operators
induces 1.0
enhances 1.0
MERGE
induces 1.0
enhances 1.0
induces 0.2
enhances 0.8
amino 1.0
acid 1.0
……
COMPOSE
amino 1.0
acid 1.0
amino acid 1.0
76
Operator: Abstract
induces
INDUCE
induces
enhances
inhibits
suppresses
0.6
up-regulates
MERGE
with
REGULATE?
0.2
…
…
ISA
INHIBIT
inhibits
0.4
suppresses 0.2
0.3
0.1
0.2
0.1
ISA
INHIBIT
INDUCE
induces
0.6
0.2
Captures substantial similarities
…
…
…
up-regulates
inhibits
0.4
suppresses 0.2
77
Experiments




Apply to machine reading:
Extract knowledge from text and answer questions
Evaluation: Number of answers and accuracy
GENIA dataset: 1999 Pubmed abstracts
Use simple factoid questions, e.g.:


What does anti-STAT1 inhibit?
What regulates MIP-1 alpha?
78
USP extracted five times
as many
correct
answers
as TextRunner
Total
and
Correct
Answers
500
Highest precision of 91%
400
300
200
100
0
79
KW-SYN
TextRunner
RESOLVER
DIRT
USP
Qualitative Analysis


Resolve many nontrivial variations
Argument forms that mean the same, e.g.,
expression of X  X expression
X stimulates Y  Y is stimulated with X



Active vs. passive voices
Synonymous expressions
Etc.
80
Clusters And Compositions

Clusters in core forms
 investigate, examine, evaluate, analyze, study, assay 
 diminish, reduce, decrease, attenuate 
 synthesis, production, secretion, release 
 dramatically, substantially, significantly 
……

Compositions
amino acid, t cell, immune response, transcription factor,
initiation site, binding site …
81
Question-Answer Example
Interestingly, the DEX-mediated IkappaBalpha induction
was completely inhibited by IL-2, but not IL-4, in Th1
cells, while the reverse profile was seen in Th2 cells.
Q: What does IL-2 control?
A: The DEX-mediated IkappaBalpha induction
82
Overview





Machine reading
Statistical relational learning
Markov logic
USP: Unsupervised Semantic Parsing
Research directions
83
Web-Scale Joint Inference


Challenge: Efficiently identify the relevant
Key: Induce and leverage an ontology



Ontology  Capture essential properties &
Abstract away unimportant variations
Upper-level nodes  Skip irrelevant branches
Wanted: Combine the following


Probabilistic ontology induction (e.g., USP)
Coarse-to-fine learning and inference
[Felzenszwalb & McAllester, 2007; Petrov, Ph.D. Thesis]
84
Knowledge Reasoning


Most facts/rules are not explicitly stated
“Dark matter” in the natural language universe
kale contains calcium  calcium prevent osteoporosis
 kale prevents osteoporosis

Keys:



Induce generic reasoning patterns
Incorporate reasoning in extraction
Additional sources of indirect supervision
85
Harness Social Computing

Bootstrap online community
Knowledge Base
86
Harness Social Computing
Bootstrap online community
 Incorporate human & end tasks in the loop
“Tell me everything
about dicer applied
to synapse …”

Knowledge Base
87
Harness Social Computing


Bootstrap online community“Your extraction from
my paper
correct
Incorporate human & end tasks
in the is
loop
except for blah …”
Knowledge Base
88
Harness Social Computing



Bootstrap online community
Incorporate human & end tasks in the loop
Form positive feedback loop
Knowledge Base
89
Acknowledgments


Pedro Domingos, Colin Cherry, Kristina
Toutanova, Lucy Vanderwende, Oren Etzioni,
Dan Weld, Matt Richardson, Parag Singla,
Stanley Kok, Daniel Lowd, Marc Sumner
ARO, AFRL, ONR, DARPA, NSF
90
Summary


Statistical relational learning offers
promising solutions for machine reading
Markov logic provides a language for this



Syntax: Weighted first-order logical formulas
Semantics: Feature templates of Markov nets
Open-source software: Alchemy
alchemy.cs.washington.edu

A success story: USP
alchemy.cs.washington.edu/papers/poon09

Three key research directions
91
Download