Extending Bayesian Logic Programs for Plan Recognition and Machine Reading

advertisement
Extending Bayesian Logic Programs
for Plan Recognition and Machine Reading
Sindhu V. Raghavan
Advisor: Raymond Mooney
PhD Proposal
May 12th, 2011
1
Plan Recognition in
Intelligent User Interfaces
$ cd my-dir
$ cp test1.txt test-dir
$ rm test1.txt
What task is the user performing?
move-file
Which files and directories are
involved?
test1.tx and test-dir
Can the task be performed more
efficiently?
Data is relational in nature - several files and directories
and several relations between them
2
Characteristics of Real World Data
Relational or structured data
 Several entities in the domain
 Several relations between entities
 Do not always follow the i.i.d assumption
Presence of noise or uncertainty
 Uncertainty in the types of entities
 Uncertainty in the relations
Traditional approaches like first-order logic or
probabilistic models can handle either structured
data or uncertainty, but not both.
3
Statistical Relational Learning (SRL)
Integrates first-order logic and probabilistic
graphical models [Getoor and Taskar, 2007]
– Combines strengths of both first-order logic and
probabilistic models
SRL formalisms
–
–
–
–
Stochastic Logic Programs (SLPs) [Muggleton, 1996]
Probabilistic Relational Models (PRMs) [Friedman et al., 1999]
Bayesian Logic Programs (BLPs) [Kersting and De Raedt, 2001]
Markov Logic Networks (MLNs) [Richardson and Dominogs, 2006]
4
Statistical Relational Learning (SRL)
Integrates first-order logic and probabilistic
graphical models [Getoor and Taskar, 2007]
– Combines strengths of both first-order logic and
probabilistic models
SRL formalisms
–
–
–
–
Stochastic Logic Programs (SLPs) [Muggleton, 1996]
Probabilistic Relational Models (PRMs) [Friedman et al., 1999]
Bayesian Logic Programs (BLPs) [Kersting and De Raedt, 2001]
Markov Logic Networks (MLNs) [Richardson and Dominogs, 2006]
5
Bayesian Logic Programs (BLPs)
Integrate first-order logic and Bayesian networks
Why BLPs?
 Efficient grounding mechanism that includes only those
variables that are relevant to the query
 Easy to extend by incorporating any type of logical
inference to construct networks
 Well suited for capturing causal relations in data
6
Objectives
BLPs for Plan Recognition
BLPs for Machine Reading
7
Objectives
Plan recognition involves predicting the top-level
plan of an agent based on its actions
BLPs for Machine Reading
8
Objectives
BLPs for Plan Recognition
Machine Reading involves automatic extraction of
knowledge from natural language text
9
Outline
 Motivation
Background
 First-order logic
 Logical Abduction
 Bayesian Logic Programs (BLPs)
Completed Work
 Part 1 – Extending BLPs for Plan Recognition
 Part 2 – Extending BLPs for Machine Reading
Proposed Work
Conclusions
10
First-order Logic
Terms
 Constants – individual entities like anna, bob
 Variables – placeholders for objects like X, Y
Predicates
 Relations over entities like worksFor, capitalOf
Literal – predicate or its negation applied to terms
 Atom – Positive literal like worksFor(X,Y)
 Ground literal – literal with no variables like
worksFor(anna,bob)
Clause – disjunction of literals
 Horn clause has at most one positive literal
 Definite clause has exactly one positive literal
11
First-order Logic
Quantifiers
 Universal quantification - true for all objects in the domain

 Existential quantification - true for some objects in the
domain 

Logical Inference
 Forward Chaining– For every implication pq, if p is true,
then q is concluded to be true
 Backward Chaining – For a query literal q, if an implication
pq is present and p is true, then q is concluded to be
true, otherwise backward chaining tries to prove p
12
Logical Abduction
Abduction
 Process of finding the best explanation for a set of
observations
Given
 Background knowledge, B, in the form of a set of (Horn)
clauses in first-order logic
 Observations, O, in the form of atomic facts in first-order
logic
Find
 A hypothesis, H, a set of assumptions (atomic facts) that
logically entail the observations given the theory:
B  H  O
 Best explanation is the one with the fewest assumptions
13
Bayesian Logic Programs (BLPs)
[Kersting and De Raedt, 2001]
Set of Bayesian clauses a|a1,a2,....,an
 Definite clauses that are universally quantified
 Head of the clause - a
 Body of the clause - a1, a2, …, an
 Range-restricted, i.e variables{head}  variables{body}
 Associated conditional probability table (CPT)
o P(head|body)
Bayesian predicates a, a1, a2, …, an have finite
domains
 Combining rule like noisy-or for mapping multiple CPTs
into a single CPT.
14
Logical Inference in BLPs
SLD Resolution
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
Proof
alarm(james)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
alarm(james)
Example from Ngo and Haddawy, 1997
15
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
alarm(james)
Example from Ngo and Haddawy, 1997
16
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james)
alarm(james)
Example from Ngo and Haddawy, 1997
17
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james)
alarm(james)
Example from Ngo and Haddawy, 1997
18
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james)
alarm(james)
Example from Ngo and Haddawy, 1997
19
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james)
alarm(james)
Example from Ngo and Haddawy, 1997
20
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
lives(james,Y) tornado(Y)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james)
alarm(james)
Example from Ngo and Haddawy, 1997
21
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
lives(james,Y) tornado(Y)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james) lives(james,yorkshire)
alarm(james)
Example from Ngo and Haddawy, 1997
22
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
lives(james,Y) tornado(Y)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james) lives(james,yorkshire)
alarm(james)
tornado(yorkshire)
Example from Ngo and Haddawy, 1997
23
Logical Inference in BLPs
SLD Resolution
Proof
BLP
lives(james,yorkshire).
lives(stefan,freiburg).
neighborhood(james).
tornado(yorkshire).
alarm(james)
burglary(james)
lives(james,Y) tornado(Y)
burglary(X) | neighborhood(X).
alarm(X) | burglary(X).
alarm(X) | lives(X,Y), tornado(Y).
Query
neighborhood(james) lives(james,yorkshire)
alarm(james)
tornado(yorkshire)
Example from Ngo and Haddawy, 1997
24
Bayesian Network Construction
neighborhood(james)
burglary(james)
lives(james,yorkshire)
tornado(yorkshire)
alarm(james)
Each ground atom becomes a node (random variable) in the Bayesian
network
Edges are added from ground atoms in the body of a clause to the
ground atom in the head
Specify probabilistic parameters using the CPTs associated with
Bayesian clauses
25
Bayesian Network Construction
neighborhood(james)
burglary(james)
lives(james,yorkshire)
tornado(yorkshire)
alarm(james)
Each ground atom becomes a node (random variable) in the Bayesian
network
Edges are added from ground atoms in the body of a clause to the
ground atom in the head
Specify probabilistic parameters using the CPTs associated with
Bayesian clauses
26
Bayesian Network Construction
neighborhood(james)
burglary(james)
lives(james,yorkshire) tornado(yorkshire)
alarm(james)
Each ground atom becomes a node (random variable) in the Bayesian
network
Edges are added from ground atoms in the body of a clause to the
ground atom in the head
Specify probabilistic parameters using the CPTs associated with
Bayesian clauses
27
Bayesian Network Construction
✖
neighborhood(james)
burglary(james)
lives(stefan,freiburg)
lives(james,yorkshire)
tornado(yorkshire)
alarm(james)
Each ground atom becomes a node (random variable) in the Bayesian
network
Edges are added from ground atoms in the body of a clause to the
ground atom in the head
Specify probabilistic parameters using the CPTs associated with
Bayesian clauses
Use combining rule to combine multiple CPTs into a single CPT
28
Probabilistic Inference and Learning
Probabilistic inference
 Marginal probability given evidence
 Most Probable Explanation (MPE) given evidence
Learning [Kersting and De Raedt, 2008]
 Parameters
o Expectation Maximization
o Gradient-ascent based learning
 Structure
o Hill climbing search through the space of possible
structures
29
Part 1
Extending BLPs for Plan Recognition
[Raghavan and Mooney, 2010]
30
Plan Recognition
Predict an agent’s top-level plans based on the
observed actions
Abductive reasoning involving inference of
cause from effect
Applications
 Story Understanding
o Recognize character’s motives or plans based on its
actions to answer questions about the story
 Strategic planning
o Predict other agents’ plans so as to work co-operatively
 Intelligent User Interfaces
o Predict the task that the user is performing so as to
provide valuable tips to perform the task more efficiently31
Related Work
First-order logic based approaches [Kautz and Allen, 1986;
Ng and Mooney, 1992]
 Knowledge base of plans and actions
 Default reasoning or logical abduction to predict the best
plan based on the observed actions
 Unable to handle uncertainty in data or estimate likelihood
of alternative plans
Probabilistic graphical models [Charniak and Goldman, 1989;
Huber et al., 1994; Pynadath and Wellman, 2000; Bui, 2003; Blaylock and Allen, 2005]
 Encode the domain knowledge using Bayesian networks,
abstract hidden Markov models, or statistical n-gram
models
 Unable to handle relational/structured data
32
Related Work
Markov Logic based approaches [Kate and Mooney, 2009;
Singla and Mooney, 2011]
 Logical inference in MLNs is deductive in nature
 MLN-PC [Kate and Mooney, 2009]
o Add reverse implications to handle abductive reasoning
o Does not scale to large domains
 MLN-HC [Singla and Mooney, 2011]
o Improves over MLN-PC by adding hidden causes
o Still does not scale to large domains
 MLN-HCAM [Singla and Mooney, 2011] uses logical abduction to
constrain grounding
o Incorporates ideas from the BLP approach for plan
recognition [Raghavan and Mooney, 2010]
o MLN-HCAM performs better than MLN-PC and MLN-HC
33
BLPs for Plan Recognition
Why BLPs ?
 Directed model captures causal relationships well
 Efficient grounding process results in smaller networks,
unlike in MLNs
SLD resolution is deductive inference, used for
predicting observed actions from top-level plans
Plan recognition is abductive in nature and
involves predicting the top-level plan from
observed actions
BLPs cannot be used as is for plan recognition
34
Extending BLPs for Plan Recognition
Logical
Abduction
BLPs
BALPs
BALPs – Bayesian Abductive Logic Programs
35
Logical Abduction in BALPs
Given
 A set of observation literals O = {O1, O2,….On} and a
knowledge base KB
Compute a set abductive proofs of O using
Stickel’s abduction algorithm [Stickel 1988]
 Backchain on each Oi until it is proved or assumed
 A literal is said to be proved if it unifies with a fact or the
head of some rule in KB, otherwise it is said to be
assumed
Construct a Bayesian network using the
resulting set of proofs as in BLPs.
36
Example – Intelligent User Interfaces
Top-level plans predicates
 copy-file, move-file, remove-file
Actions predicates
 cp, rm
Knowledge Base (KB)
 cp(filename,destdir) | copy-file(filename,destdir)
 cp(filename,destdir) | move-file(filename,destdir)
 rm(filename) | move-file(filename,destdir)
 rm(filename) | remove-file(filename)
Observed actions
 cp(Test1.txt, Mydir)
 rm(Test1.txt)
37
Abductive Inference
Assumed literal
copy-file(Test1.txt,Mydir)
cp(Test1.txt,Mydir)
cp(filename,destdir) | copy-file(filename,destdir)
38
Abductive Inference
Assumed literal
copy-file(Test1.txt,Mydir) move-file(Test1.txt,Mydir)
cp(Test1.txt,Mydir)
cp(filename,destdir) | move-file(filename,destdir)
39
Abductive Inference
Match existing assumption
copy-file(Test1.txt,Mydir) move-file(Test1.txt,Mydir)
cp(Test1.txt,Mydir)
rm(Test1.txt)
rm(filename) | move-file(filename,destdir)
40
Abductive Inference
Assumed literal
copy-file(Test1.txt,Mydir) move-file(Test1.txt,Mydir) remove-file(Test1)
cp(Test1.txt,Mydir)
rm(Test1.txt)
rm(filename) | remove-file(filename)
41
Structure of Bayesian network
copy-file(Test1.txt,Mydir) move-file(Test1.txt,Mydir) remove-file(Test1)
cp(Test1.txt,Mydir)
rm(Test1.txt)
42
Probabilistic Inference
Specifying probabilistic parameters
 Noisy-and
o Specify the CPT for combining the evidence from
conjuncts in the body of the clause
 Noisy-or
o Specify the CPT for combining the evidence from
disjunctive contributions from different ground clauses
with the same head
o Models “explaining away”
 Noisy-and and noisy-or models reduce the number of
parameters learned from data
43
Probabilistic Inference
copy-file(Test1,txt,Mydir) move-file(Test1,txt,Mydir) remove-file(Test1)
cp(Test1.txt,Mydir)
rm(Test1.txt)
44
Probabilistic Inference
copy-file(Test1,txt,Mydir) move-file(Test1,txt,Mydir) remove-file(Test1)
θ1
θ2
cp(Test1.txt,Mydir)
2 parameters
θ3
θ4
rm(Test1.txt)
2 parameters
Noisy models require parameters linear in the
number of parents
45
Probabilistic Inference
Most Probable Explanation (MPE)
 For multiple plans, compute MPE, the most likely
combination of truth values to all unknown literals given
this evidence
Marginal Probability
 For single top level plan prediction, compute marginal
probability for all instances of plan predicate and pick the
instance with maximum probability
 When exact inference is intractable, SampleSearch [Gogate
and Dechter, 2007], an approximate inference algorithm for
graphical models with deterministic constraints is used
46
Probabilistic Inference
copy-file(Test1,txt,Mydir) move-file(Test1,txt,Mydir) remove-file(Test1)
Noisy-or
cp(Test1.txt,Mydir)
Noisy-or
rm(Test1.txt)
47
Probabilistic Inference
copy-file(Test1,txt,Mydir) move-file(Test1,txt,Mydir) remove-file(Test1)
Noisy-or
cp(Test1.txt,Mydir)
Noisy-or
rm(Test1.txt)
Evidence
48
Probabilistic Inference
Query variables
copy-file(Test1,txt,Mydir) move-file(Test1,txt,Mydir) remove-file(Test1)
Noisy-or
cp(Test1.txt,Mydir)
Noisy-or
rm(Test1.txt)
Evidence
49
Probabilistic Inference
MPE
Query variables
FALSE
FALSE
TRUE
copy-file(Test1,txt,Mydir) move-file(Test1,txt,Mydir) remove-file(Test1)
Noisy-or
cp(Test1.txt,Mydir)
Noisy-or
rm(Test1.txt)
Evidence
50
Probabilistic Inference
MPE
Query variables
FALSE
FALSE
TRUE
copy-file(Test1,txt,Mydir)
move-file(Test1,txt,Mydir)
Noisy-or
cp(Test1.txt,Mydir)
remove-file(Test1)
Noisy-or
rm(Test1.txt)
Evidence
51
Parameter Learning
Learn noisy-or/noisy-and parameters using the
EM algorithm adapted for BLPs [Kersting and De Raedt,
2008]
Partial observability
 In plan recognition domain, data is partially observable
 Evidence is present only for observed actions and toplevel plans; sub-goals, noisy-or, and noisy-and nodes are
not observed
Simplify learning problem
 Learn noisy-or parameters only
 Used logical-and instead of noisy-and to combine
evidence from conjuncts in the body of a clause
52
Experimental Evaluation
Monroe (Strategic planning)
Linux (Intelligent user interfaces)
Story Understanding (Story understanding)
53
Monroe and Linux
[Blaylock and Allen, 2005]
Task
 Monroe involves recognizing top level plans in an
emergency response domain (artificially generated using
HTN planner)
 Linux involves recognizing top-level plans based on linux
commands
 Single correct plan in each example
Data
No.
examples
Avg.
observations
/ example
Total top-level
plan
predicates
Total observed
action predicates
Monroe
1000
10.19
10
30
Linux
457
6.1
19
43
54
Monroe and Linux
Methodology
 Manually encoded the knowledge base
 Learned noisy-or parameters using EM
 Computed marginal probability for plan instances
Systems compared
 BALPs
 MLN-HCAM [Singla and Mooney, 2011]
o MLN-PC and MLN-HC do not run on Monroe and Linux due to scaling issues
 Blaylock and Allen’s system [Blaylock and Allen, 2005]
Performance metric
 Convergence score - measures the fraction of examples
for which the plan schema was predicted correctly
55
Convergence Score
Results on Monroe
94.2 *
BALPs
MLN-HCAM Blaylock & Allen
* - Differences are statistically significant wrt BALPs
56
Convergence Score
Results on Linux
36.1 *
BALPs
MLN-HCAM Blaylock & Allen
* - Differences are statistically significant wrt BALPs
57
Story Understanding
[Charniak and Goldman, 1991; Ng and Mooney, 1992]
Task
 Recognize character’s top level plans based on actions
described in narrative text
 Multiple top-level plans in each example
Data
 25 examples in development set and 25 examples in test
set
 12.6 observations per example
 8 top-level plan predicates
58
Story Understanding
Methodology
 Knowledge base was created for ACCEL [Ng and Mooney, 1992]
 Parameters set manually
o Insufficient number of examples in the development set
to learn parameters
 Computed MPE to get the best set of plans
Systems compared
 BALPs
 MLN-HCAM [Singla and Mooney, 2011]
o Best performing MLN model
 ACCEL-Simplicity [Ng and Mooney, 1992]
 ACCEL-Coherence [Ng and Mooney, 1992]
o Specific for Story Understanding
59
Results on Story Understanding
* - Differences are statistically significant wrt BALPs
60
Results on Story Understanding
*
* - Differences are statistically significant wrt BALPs
*
61
Summary of BALPs
for Plan Recognition
Extend BLPs for plan recognition by employing
logical abduction to construct Bayesian networks
Automatic learning of model parameters using
EM
Empirical results on all benchmark datasets
demonstrate advantages over existing methods
62
Part 2
Extending BLPs for Machine Reading
63
Machine Reading
Machine reading involves automatic extraction
of knowledge from natural language text
Information extraction (IE) systems extract
factual information like entities and relations
between entities that occur in text [Cohen, 1999; Craven et
al., 2000; Bunescu and Mooney, 2007; Etzioni et al, 2008]
 Extracted information can be used for answering queries
automatically
64
Limitations of IE Systems
Extract information that is stated explicitly in text
 Natural language text is not necessarily complete
 Commonsense information is not explicitly stated in text
 Well known facts are omitted from the text
Missing information cannot be inferred from text
 Human brain performs deeper inference using
commonsense knowledge
 IE systems have no access to commonsense knowledge
Errors in extraction process result in some facts
not being extracted
65
Our Objective
Improve performance of IE
 Learn general knowledge or commonsense information in
the form of rules using the facts extracted by the IE system
 Infer additional facts that are not stated explicitly in text
using the learned rules
66
Example
“Malaysian Prime Minister Mahathir
Mohamad Wednesday announced for the first
time that he has appointed his deputy
Abdullah Ahmad Badawi as his successor.’’
67
Example
“Malaysian Prime Minister Mahathir
Mohamad Wednesday announced for the first
time that he has appointed his deputy
Abdullah Ahmad Badawi as his successor."
Extracted facts
nationState(malaysian)
Person(mahathir-mohamad)
isLedBy(malaysian,mahathir-mohamad)
68
Example
“Malaysian Prime Minister Mahathir
Mohamad Wednesday announced for the first
time that he has appointed his deputy
Abdullah Ahmad Badawi as his successor."
Extracted facts
Query
nationState(malaysian)
Person(mahathir-mohamad)
isLedBy(malaysian,mahathir-mohamad) isLedBy(X,Y)
69
Example
“Malaysian Prime Minister Mahathir
Mohamad Wednesday announced for the first
time that he has appointed his deputy
Abdullah Ahmad Badawi as his successor."
Extracted facts
Query
nationState(malaysian)
X = malaysian
Person(mahathir-mohamad)
isLedBy(malaysian,mahatir-mohamad)
Y = mahathir-mohamad isLedBy(X,Y)
70
Example
“Malaysian Prime Minister Mahathir
Mohamad Wednesday announced for the first
time that he has appointed his deputy
Abdullah Ahmad Badawi as his successor."
Extracted facts
Query
citizenOf(mahathir-mohamad,Y)
nationState(malaysian)
Person(mahathir-mohamad)
isLedBy(malaysian,mahathir-mohamad)
71
Example
“Malaysian Prime Minister Mahathir
Mohamad Wednesday announced for the first
time that he has appointed his deputy
Abdullah Ahmad Badawi as his successor."
Extracted facts
Query
citizenOf(mahathir-mohamad,Y)
nationState(malaysian)
Y=?
Person(mahathir-mohamad)
isLedBy(malaysian,mahatir-mohamad)
72
Related Work
Learning propositional rules [Nahm and Mooney, 2000]
 Learn propositional rules from the output of an IE system
on computer-related job postings
 Perform deductive inference to infer new facts
Learning first-order rules [Carlson at el., 2010, Schoenmackers et
al., 2010; Doppa et al., 2010]
 Carlson et al. and Doppa et al. modify existing rule
learners like FOIL and FARMER to learn probabilistic rules
o Perform deductive inference to infer additional facts
 Schoenmackers et al. develop a new first-order rule
learner based on statistical relevance
o Use MLN based approach to infer additional facts
73
Our Approach
Use an off-the shelf IE system to extract facts
Learn commonsense knowledge from the
extracted facts in the form of first-order-rules
Infer additional facts based on the learned rules
using BLPs
 Pure logical deduction results in inferring a large number
of facts
 Inference using BLPs is probabilistic in nature, i.e inferred
facts are assigned probabilities. Probabilities can be used
to filter out facts with high confidence from the rest
 Efficient grounding mechanism in BLPs enables our
approach to scale to natural language text
74
Extending BLPs for Machine Reading
SLD resolution does not result in inference of
new facts
 Given a query, SLD resolution will generated deductive
proofs that prove the query
Employ forward chaining to infer additional
facts
 Forward chain on extracted facts to infer new facts
Use the deductive proofs from forward
chaining to construct Bayesian network for
probabilistic inference
75
Learning First-order Rules
“Barack Obama is the current President of
USA……. Obama was born on August 4,
1961, in Hawaii, USA…….’’
Extracted facts
nationState(USA)
Person(BarackObama)
isLedBy(USA,BarackObama)
hasBirthPlace(BarackObama,USA)
citizenOf(BarackObama,USA)
76
Learning Patterns from Extracted Facts
nationState(USA) ∧ isLedBy(USA,BarackObama) 
citizenOf(BarackObama,USA)
nationState(USA) ∧ isLedBy(USA,BarackObama) 
hasBirthPlace(BarackObama,USA)
hasBirthPlace(BarackObama,USA) 
citizenOf(BarackObama,USA)
.
.
.
.
77
Generalizing Patterns to Rules
nationState(Y) ∧ isLedBy(Y,X)  citizenOf(X,Y)
nationState(Y) ∧ isLedBy(Y,X)  hasBirthPlace(X,Y)
hasBirthPlace(X,Y)  citizenOf(X,Y)
78
Inductive Logic Programming (ILP)
[Muggleton, 1992]
Positive instances
citizenOf(BarackObama,
USA)
citizenOf(GeorgeBush,
USA)
citizenOf(IndiraGandhi,I
ndia)
.
.
Negative instances
citizenOf(BarackObama,
India)
citizenOf(GeorgeBush,
India)
citizenOf(IndiraGandhi,U
SA)
.
.
Target relation
citizenOf(X,Y)
ILP
Rule Learner
KB
hasBirthPlace(BarackObama,USA)
person(BarackObama)
nationState(USA)
nationState(India)
.
.
Rules
nationState(Y)
∧ isLedBy(Y,X)
citizenOf(X,Y)
.
.
79
Example
“Malaysian Prime Minister Mahathir Mohamad
Wednesday announced for the first time that he has
appointed his deputy Abdullah Ahmad Badawi as his
successor.’’
Extracted facts
nationState(malaysian)
Person(mahathir-mohamad)
isLedBy(malaysian,mahathir-mohamad)
Learned rule
nationState(B) ∧ isLedBy(B,A)  citizenOf(A,B)
80
Inference of Additional Facts
nationState(B) ∧ isLedBy(B,A)  citizenOf(A,B)
nationState(malaysian
)
isLedBy(malaysian,mahathirmohamad)
citizenOf(mahathir-mohamad, malaysian)
81
Experimental Evaluation
Data
 DARPA’s intelligent community (IC) data set
 Consists of news articles on politics, terrorism, and others
international events
 2000 documents, split into training and test in the ratio 9:1
 Approximately 30,000 sentences
Information Extraction
 Information extraction using SIRE, IBM’s information
extraction system [Florin et al., 2004]
82
Experimental Evaluation
Learning first-order rules using LIME [McCreath and
Sharma, 1998]
 Learns rules from only positive instances
 Handles noise in data
Models
 BLP-Pos-Only
o Learn rules using only positive instances
 BLP-Pos-Neg
o Learn rules using both positive and negative instances
o Apply closed world assumption to automatically
generate negative instances
 BLP-All
o Includes rules learned from BLP-Pos-Only and BLP-Pos83
Neg
Experimental Evaluation
Specifying probabilistic parameters
 Manually specified parameters
 Logical-and model to combine evidence from conjuncts in
the body of the clause
 Set noisy-or parameters to 0.9
 Set priors to maximum likelihood estimates
84
Experimental Evaluation
Methodology
 Learned rules for 13 relations including hasCitizenship,
hasMember, attendedSchool
 Baseline
o Perform pure logical deduction to infer all possible
additional facts
 BLP-0.9
o Facts inferred using the BLP approach with marginal
probability greater or equal to 0.9
 BLP-0.95
o Facts inferred using the BLP approach with marginal
probability greater or equal to 0.95
85
Performance Evaluation
Precision
 Evaluate if the inferred facts are correct, i.e if they can be
inferred from the document
o No ground truth information available
o Perform manual evaluation
o Sample 20 documents randomly from the test set and
evaluate inferred facts manually for all models
86
Precision
(preliminary results)
Model
Baseline – Purely
logical approach [No. facts
BLP-ALL
BLP-Pos-Neg
BLP-Pos-Only
3813
75
1849
24.94
16.00
31.80
inferred]
Precision
87
Precision
(preliminary results)
Model
Baseline – Purely
logical approach [No. facts
BLP-ALL
BLP-Pos-Neg
BLP-Pos-Only
3813
75
1849
Precision
24.94
16.00
31.80
BLP-0.9 [No. facts inferred]
1208
66
1158
Precision
34.68
13.63
35.49
inferred]
88
Precision
(preliminary results)
Model
Baseline – Purely
logical approach [No. facts
BLP-ALL
BLP-Pos-Neg
BLP-Pos-Only
3813
75
1849
Precision
24.94
16.00
31.80
BLP-0.9 [No. facts inferred]
1208
66
1158
Precision
34.68
13.63
35.49
56
1
0
91.07
100
Nil
inferred]
BLP-0.95 [No. of facts inferred]
Precision
89
Performance Evaluation
Estimated recall
 For each target relation, eliminate instances of it from the
extracted facts in the test set
 Try to infer the eliminated instances correctly based on the
remaining facts using the BLP approach
o For each threshold point (0.1 to 1), compute fraction of
eliminated instances that were inferred correctly
o True recall cannot be calculated because of lack of
ground truth
90
Estimated Recall
Estimated recall
(preliminary results)
Confidence threshold
91
Estimated Recall
Estimated recall
(preliminary results)
Confidence threshold
92
Estimated Recall
Estimated recall
(preliminary results)
Confidence threshold
93
Estimated Recall
Estimated recall
(preliminary results)
Confidence threshold
94
First-order Rules from LIME
 politicalParty(A) ^ employs(A,B)  hasMemberPerson(A,B)
 building(B) ^ eventLocation(A,B) ^ bombing(A) 
thingPhysicallyDamaged(A,B)
 employs(A,B)  hasMember(A,B)
 citizenOf(A,B)  hasCitizenship(A,B)
 nationState(B) ^ person(A) ^ employs(B,A)  hasBirthPlace(A,B)
95
Proposed Work
Learning parameters automatically from data
 EM or gradient-ascent based algorithm adapted for BLPs
by Kersting and De Raedt [2008]
Evaluation
 Human evaluation for inferred facts using Amazon
Mechanical Turk [Callison-Burch and Dredze, 2010]
 Use inferred facts to answer queries constructed for
evaluation in the DARPA sponsored machine reading
project
 Comparison to existing approaches like MLNs
Develop a structure learner for incomplete and
uncertain data
96
Future Extensions
Multiple predicate learning
 Learn rules for multiple relations at the same time
Discriminative parameter learning for BLPs
 EM and gradient-ascent based algorithms for learning BLP
parameters optimize likelihood of the data
 No algorithm that learns parameters discriminatively for
BLPs has been developed to date
Comparison of BALPs to other probabilistic
logics for plan recognition
 Comparison to PRISM [Sato, 1995], Poole’s Horn Abduction
[Poole, 1993], Abductive Stochastic Logic Programs [TammadoniNezhad et al., 2006]
97
Conclusions
Extended BLPs to abductive plan recognition
 Enhance BLPs with logical abduction, resulting in
Bayesian Abductive Logic Programs (BALPs)
 BALPs outperform state-of-the-art approaches to plan
recognition on several benchmark data sets
Extended BLPs to machine reading
 Preliminary results on the IC data set are encouraging
 Perform an extensive evaluation in immediate future
 Develop a first-order rule learner that learns from
incomplete and uncertain data in future
98
Questions
99
Backup
100
Timeline
 Additional experiments comparing MLN-HCAM and
BALPs for plan recognition [Target completion – August
2011]
 Use Sample Search to perform inference in MLN-HCAM
 Evaluation of BLPs for Machine Reading [Target
completion – November 2011]
 Learn parameters automatically
 Evaluation using Amazon Mechanical Turk
 Evaluation of BLPs for answering queries
 Comparison of BLPs with MLNs for machine reading
 Developing structure learner from positive and uncertain
examples [Target completion – May 2012]
 Proposed defense and graduation [August 2012]
101
Completeness in First-order Logic
Completeness - If a sentence is entailed by a
KB, then it is possible to find the proof that
entails it
Entailment in first-order logic is semidecidable,
i.e it is not possible to know if a sentence is
entailed by a KB or not
Resolution is complete in first-order logic
 If a set of sentences is unsatisfiable, then it is possible to
find a contradiction
102
Forward chaining
For every implication pq, if p is true, then q is
concluded to be true
Results in addition of a new fact to KB
Efficient, but incomplete
Inference can explode and forward chaining may
never terminate
 Addition of new facts might result in rules being satisfied
It is data-driven, not goal-driven
 Might result in irrelevant conclusions
103
Backward chaining
For a query literal q, if an implication pq is
present and p is true, then q is concluded to be
true, otherwise backward chaining tries to prove
p
Efficient, but not complete
May never terminate, might get stuck in infinite
loop
Exponential
Goal-driven
104
Herbrand Model Semantics
Herbrand universe
 All constants in the domain
Herbrand base
 All ground atoms atoms over Herbrand universe
Herbrand interpretation
 A set of ground atoms from Herbrand base that are true
Herbrand model
 Herbrand interpretation that satisfies all clauses in the
knowledge base
105
Advantages of SRL models
over vanilla probabilistic models
Compactly represent domain knowledge in firstorder logic
Employ logical inference to construct ground
networks
Enables parameter sharing
106
Parameter sharing in SRL
father(john)
θ1
parent(john)
father(mary)
θ2
parent(mary)
father(alice)
θ3
parent(alice)
dummy
107
Parameter sharing in SRL
father(X)  parent(X)
father(john)
θ
parent(john)
θ
father(mary)
θ
parent(mary)
father(alice)
θ
parent(alice)
dummy
108
Noisy-and Model
Several causes ci have to occur simultaneously if
event e has to occur
 ci fails to trigger e with probability pi
 inh accounts for some unknown cause due to
which e has failed to trigger
P(e) = (I – inh) Πi(1-pi)^(1-ci)
109
Noisy-or Model
Several causes ci cause event e has to occur
 ci independently triggers e with probability pi
 leak accounts for some unknown cause due to
which e has triggered
P(e) = 1 – [(I – inh) Πi (1-pi)^(1-ci)]
Models explaining away
 If there are several causes of an event, and if there is
evidence for one of the causes, then the probability that
the other causes have caused the event goes down
110
Noisy-and And Noisy-or Models
neighborhood(james)
burglary(james)
lives(james,yorkshire)
tornado(yorkshire)
alarm(james)
111
Noisy-and And Noisy-or Models
neighborhood(james)
Noisy/logical-and
burglary(james)
lives(james,yorkshire)
Noisy/logical-and
tornado(yorkshire)
Noisy/logical-and
dummy2
dummy1
Noisy-or
alarm(james)
112
Inference in BLPs
[Kersting and De Raedt, 2001]
Logical inference
 Given a BLP and a query, SLD resolution is used to
construct proofs for the query
Bayesian network construction
 Each ground atom is a random variable
 Edges are added from ground atoms in the body to the
ground atom in head
 CPTs specified by the conditional probability distribution for
the corresponding clause
 P(X) =  P(Xi | Pa(Xi))
i
Probabilistic inference
 Marginal probability given evidence

 Most Probable Explanation (MPE) given evidence
113
Learning in BLPs
[Kersting and De Raedt, 2008]
Parameter learning
 Expectation Maximization
 Gradient-ascent based learning
 Both approaches optimize likelihood
Structure learning
 Hill climbing search through the space of possible
structures
 Initial structure obtained from CLAUDIEN [De Raedt and
Dehaspe, 1997]
 Learns from only positive examples
114
Expectation Maximization for
BLPs/BALPs
• Perform logical inference to construct a ground
Bayesian network for each example
• Let r denote rule, X denote a node, and Pa(X)
denote parents of X
• E Step
*
• The inner sum is over all groundings of rule r
• M Step
* From SRL tutorial at ECML 07
*
115
115
Decomposable Combining Rules
Express different influences using separate
nodes
These nodes can be combined using a
deterministic function
116
Combining Rules
neighborhood(james)
Logical-and
burglary(james)
lives(james,yorkshire)
Logical-and
tornado(yorkshire)
Logical-and
dummy2
dummy1
Noisy-or
alarm(james)
117
Decomposable Combining Rules
neighborhood(james)
Logical-and
burglary(james)
lives(james,yorkshire)
tornado(yorkshire)
Logical-and
Logical-and
dummy2
dummy1
Noisy-or
Noisy-or
dummy1-new
dummy2 -new
Logical-or
alarm(james)
118
BLPs vs. PLPs
Differences in representation
 In BLPs, Bayesian atoms take finite set of values, but in
PLPs, each atom is logical in nature and it can take true or
false
 Instead of having neighborhood(x) = bad, in PLPs, we
have neighborhood(x,bad)
 To compute probability of a query alarm(james), PLPs
have to construct one proof tree for all possible values for
all predicates
 Inference is cumbersome
BLPs subsume PLPs
119
BLPs vs. Poole's Horn Abduction
Differences in representation
 For example, if P(x) and R(x) are two competing
hypothesis, then either P(x) could be true or R(x) could be
true
 Prior probabilities of P(x) and R(x) should sum to 1
 Restrictions of these kind are are not these in BLPs
 PLPs and hence BLPs are more flexible and have a richer
representation
120
BLPs vs. PRMs
BLPs subsume PRMs
PRMs use entity-relationship models to
represent knowledge and they use KBMC-like
construction to construct a ground Bayesian
network
 Each attribute becomes a random variable in the ground
network and relations over the entities are logical
constraints
 In BLP, each attribute becomes a Bayesian atom and
relations become logical atoms
 Aggregator functions can be transformed into combining
rules
121
BLPs vs. RBNs
BLPs subsume RBNs
 In RBNs, each node in BN is a predicate and
probability formulae are used to specify
probabilities
 Combining rules can be used to represent these
probability formulae in BLPs.
122
BALPs vs. BLPs
Like BLPs, BALPs use logic programs as
templates for constructing Bayesian networks
Unlike BLPs, BALPs uses logical abduction
instead of deduction to construct the network
123
Monroe
[Blaylock and Allen, 2005]
Task
 Recognize top level plans in an emergency response
domain (artificially generated using HTN planner)
 Plans include set-up-shelter, clear-road-wreck, providemedical-attention
 Single correct plan in each example
 Domain consists of several entities and sub-goals
 Test the ability to scale to large domains
Data
 Contains 1000 examples
124
Monroe
Methodology
 Knowledge base constructed based on the domain
knowledge encoded in planner
 Learn noisy-or parameters using EM
 Compute marginal probability for instances of top level
plans and pick the one with the highest marginal
probability
 Systems compared
o BALPs
o MLN-HCAM [Singla and Mooney, 2011]
o Blaylock and Allen’s system [Blaylock and Allen, 2005]
 Convergence score - measures the fraction of examples
for which the plan schema was predicted correctly
125
Learning Results - Monroe
MW
Conv Score 98.4
Acc-100
79.16
Acc-75
46.06
Acc-50
20.67
Acc-25
7.2
MW-Start
98.4
79.16
44.63
20.26
7.33
126
Rand-Start
98.4
79.86
44.73
19.7
10.46
Linux
Task
[Blaylock and Allen, 2005]
 Recognize top level plans based on Linux commands
 Human users asked to perform tasks in Linux and
commands were recorded
 Top-level plans include find-file-by-ext, remove-file-by-ext,
copy-file, move-file
 Single correct plan in each example
 Tests the ability to handle noise in data
o Users indicate success even when they have not
achieved the task correctly
o Some top-level plans like find-file-by-ext and file-file-byname have identical actions
Data
 Contains 457 examples
127
Linux
Methodology
 Knowledge base constructed based on the knowledge of
Linux commands
 Learn noisy-or parameters using EM
 Compute marginal probability for instances of top level
plans and pick the one with the highest marginal
probability
 Systems compared
o BALPs
o MLN-HCAM [Singla and Mooney, 2011]
o Blaylock and Allen’s system [Blaylock and Allen, 2005]
 Convergence score - measures the fraction of examples
for which the plan schema was predicted correctly
128
o
Accuracy
Learning Results - Linux
Partial Observability
Conv Score
MW
MW-Start
Rand-Start
39.82
46.6
41.57
129
Story Understanding
[Charniak and Goldman, 1991; Ng and Mooney, 1992]
Task
 Recognize character’s top level plans based on actions
described in narrative text
 Logical representation of actions literals provided
 Top-level plans include shopping, robbing, restaurant
dining, partying
 Multiple top-level plans in each example
 Tests the ability to predict multiple plans
Data
 25 development examples
 25 test examples
130
Story Understanding
Methodology
 Knowledge base constructed for ACCEL by Ng and
Mooney [1992]
 Insufficient number of examples to learn parameters
o Noisy-or parameters set to 0.9
o Noisy-and parameters set to 0.9
o Priors tuned on development set
 Compute MPE to get the best set of plans
 Systems compared
o BALPs
o MLN-HCAM [Singla and Mooney, 2011]
o ACCEL-Simplicity [Ng and Mooney, 1992]
o ACCEL-Coherence [Ng and Mooney, 1992]
– Specific for Story Understanding
131
Other Applications of BALPs
Medical diagnosis
Textual entailment
Computational biology
 Inferring gene relations based on the output of micro-array
experiments
Any application that requires abductive
reasoning
132
ACCEL
[Ng and Mooney, 1992]
First-order logic based system for plan
recognition
Simplicity metric selects explanations that have
the least number of assumptions
Coherence metric selects explanations that
connect maximum number of observations
 Measures explanatory coherence
 Specific to text interpretation
133
System by Blaylock and Allen
[2005]
Statistical n-gram models to predict plans based
on observed actions
Performs plan recognition in two phases
 Predicts the plan schema first
 Predicts arguments based on the predicted schema
134
Machine Reading
Machine reading involves automatic extraction
of knowledge from natural language text
Approaches to machine reading
 Extract factual information like entities and relations
between entities that occur in text [Cohen, 1999; Craven et al., 2000;
Bunescu and Mooney, 2007; Etzioni et al, 2008]
 Extract commonsense knowledge about the domain
[Nahm and Mooney, 2000; Carlson at el., 2010, Schoenmackers et al., 2010;
Doppa et al., 2010]
135
Natural Language Text
Natural language text is not necessarily
complete
 Commonsense information is not explicitly stated in text
 Well known facts are omitted from the text
Missing information has to be inferred from text
 Human brain performs deeper inference using
commonsense knowledge
 IE systems have no access to commonsense knowledge
IE systems are limited to extracting information stated
explicitly in text
136
IC ontology
 57 entities
 79 relations
137
Estimated Precision
(preliminary results)
Model
Baseline [No. facts inferred]
Estimated precision
BLP-ALL
BLP-Pos-Neg
BLP-Pos-Only
38,151
1,510
18,501
24.94
16.00
31.80
(951/3813)*
(12/75)*
(588/1849)*
Total facts extracted – 12,254
* - x out of y inferred facts were correct. These inferred facts are from the 20 randomly sampled
138
documents from test set for manual evaluation
Estimated Precision
(preliminary results)
Model
Baseline [No. facts inferred]
Estimated precision
BLP-0.9 [No. facts inferred]
Estimated precision
BLP-ALL
BLP-Pos-Neg
BLP-Pos-Only
38,151
1,510
18,501
24.94
16.00
31.80
(951/3813)*
(12/75)*
(588/1849)*
12,823
880
11,717
34.68
13.63
35.49
(419/1208)*
(9/66)*
(411/1158)*
Total facts extracted – 12,254
* - x out of y inferred facts were correct. These inferred facts are from the 20 randomly sampled
139
documents from test set for manual evaluation
Estimated Precision
(preliminary results)
Model
Baseline [No. facts inferred]
Estimated precision
BLP-0.9 [No. facts inferred]
Estimated precision
BLP-0.95 [No. of facts inferred]
Estimated precision
BLP-ALL
BLP-Pos-Neg
BLP-Pos-Only
38,151
1,510
18,501
24.94
16.00
31.80
(951/3813)*
(12/75)*
(588/1849)*
12,823
880
11,717
34.68
13.63
35.49
(419/1208)*
(9/66)*
(411/1158)*
826
119
49
91.07
100
Nil
(51/56)*
(1/1)*
(0/0)*
Total facts extracted – 12,254
* - x out of y inferred facts were correct. These inferred facts are from the 20 randomly sampled
140
documents from test set for manual evaluation
Estimated Recall
“Barack Obama is the current President of
USA……. Obama was born on August 4, 1961, in
Hawaii, USA……."
Extracted facts
nationState(USA)
hasBirthPlace(BarackObama,USA)
Person(BarackObama)
citizenOf(BarackObama,USA)
isLedBy(USA,BarackObama)
Rule
hasBirthPlace(A,B)  citizenOf(A,B)
141
Estimated Recall for citizenOf
Extracted facts
nationState(USA)
hasBirthPlace(BarackObama,USA)
Person(BarackObama)
citizenOf(BarackObama,USA)
isLedBy(USA,BarackObama)
hasBirthPlace(A,B)  citizenOf(A,B)
hasBirthPlace(BarackObama,USA)
citizenOf(BarackObama,USA)
142
Estimated Recall for hasBirthPlace
Extracted facts
nationState(USA)
hasBirthPlace(BarackObama,USA)
Person(BarackObama)
citizenOf(BarackObama,USA)
isLedBy(USA,BarackObama)
hasBirthPlace(A,B)  citizenOf(A,B)
?
143
Proposed Work
Learn first-order rules from incomplete and
uncertain data
 Facts extracted from natural language text are incomplete
o Existing rule/structure learners assume that data is
complete
 Extractions from IE system have associated confidence
scores
o Existing structure learners cannot use these extraction
probabilities
 Absence of negative instances
o Most rule learners require both positive and negative
instances
o Closed world assumption does not hold for machine
reading
144
Download