Discriminative Learning for Markov Logic Networks PhD Proposal Tuyen N. Huynh

advertisement
Discriminative Learning for
Markov Logic Networks
Tuyen N. Huynh
Adviser: Prof. Raymond J. Mooney
PhD Proposal
October 9th, 2009
Some slides are taken from [Domingos, 2007], [Mooney, 2008]
Motivation


Most machine learning methods assume
independent and identically distributed (i.i.d.)
examples represented as feature vectors.
Most of data in the real world are not i.i.d. and
also cannot be effectively represented as feature
vectors and
 Biochemical
data
 Social network data
 Multi-relational data
…
2
Biochemical data
Predicting mutagenicity
[Srinivasan et. al, 1995]
3
Web-KB dataset [Slattery & Craven, 1998]
Characteristics of these structured data


Contains multiple objects/entities and relationships
among them
There are a lot of uncertainties in the data:
 Uncertainty
about the attributes of an object
 Uncertainty about the type of an object
 Uncertainty about relationships between objects
5
Statistical Relational Learning (SRL)


SRL attempts to integrate methods from first-order
logic and probabilistic graphical models to handle
such noisy structured/relational data.
Some proposed SRL models:
Stochastic Logic Programs (SLPs) [Muggleton, 1996]
 Probabilistic Relational Models (PRMs) [Friedman et al., 1999]
 Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001]
 Relational Markov networks (RMNs) [Taskar et al., 2002]
 Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

6
Statistical Relational Learning (SRL)


SRL attempts to integrate methods from first-order
logic and probabilistic graphical models to handle
such noisy structured/relational data.
Some proposed SRL models:
Stochastic Logic Programs (SLPs) [Muggleton, 1996]
 Probabilistic Relational Models (PRMs) [Friedman et al., 1999]
 Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001]
 Relational Markov networks (RMNs) [Taskar et al., 2002]
 Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

7
Discriminative learning


Generative learning: learn a joint model over all
variables
Discriminative learning: learn a conditional model of
the output variables given the input variables
directly learn a model for predicting the outputs
 has better predictive performance on the outputs in general

Most problems in structured/relational data are
discriminative: make predictions based on some
evidence (observable data).
 Discriminative learning is more suitable

8
Discriminative Learning for
Markov Logic Networks
9
Outline




Motivation
Background
Discriminative learning for MLNs with non-recursive
clause [Huynh & Mooney, 2008]
Max-margin weight learning for MLNs [Huynh &
Mooney, 2009]


Future work
Conclusion
10
First-Order Logic







Constants: Anna, Bob
Variables: x, y
Function: fatherOf(x)
Predicate: binary functions
E.g: Smoke(x), Friends(x,y)
Literals: Predicates or its negation
Grounding: Replace all variables by constants
E.g.: Friends (Anna, Bob)
World (model, interpretation): Assignment of truth
values to all ground literals
11
First-Order Clauses


Clause: A disjunction of literals
¬Smoke(x) v Cancer(x)
Can be rewritten as a set of implication rules
Smoke(x) => Cancer(x)
¬Cancer(x) => ¬Smoke(x)
12
Markov Networks [Pearl, 1988]

Undirected graphical models
Smoking
Cancer
Asthma

Cough
Potential function: function defined over a clique ( a
complete sub-graph)
Smoking
Cancer
False
False
4.5
False
True
4.5
True
False
2.7
True
True
4.5
Ф(S,C)
1
P( x)    c ( xc )
Z c
Z    c ( xc )
x
c
13
Markov Networks [Pearl, 1988]

Undirected graphical models
Smoking
Cancer
Asthma

Cough
Log-linear model:
1


P( x)  exp   wi f i ( x) 
Z
 i

Weight of Feature i
Feature i
 1 if  Smoking  Cancer
f1 (Smoking, Cancer )  
 0 otherwise
w1  1.5
14
Markov Logic Networks
[Richardson & Domingos, 2006]




Set of weighted first-order clauses.
Larger weight indicates stronger belief that the clause
should hold.
The clauses are called the structure of the MLN.
MLNs are templates for constructing Markov networks for a
given set of constants
MLN Example: Friends & Smokers
1.5 x Smokes( x)  Cancer ( x)
1.1 x, y Friends ( x, y )  Smokes( x)  Smokes( y ) 
15
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
16
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
17
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
18
Example: Friends & Smokers
1.5 x Smokes( x )  Cancer ( x )
1.1 x, y Friends ( x, y )  Smokes( x )  Smokes( y ) 
Two constants: Anna (A) and Bob (B)
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
19
Probability of a possible world
a possible world
1


P( X  x)  exp   wi ni ( x) 
Z
 i

Weight of formula i
No. of true groundings of formula i in x


Z   exp   wi ni ( x) 
x
 i

A possible world becomes exponentially less likely as the total weight of
all the grounded clauses it violates increases.
20
Inference in MLNs

MAP/MPE inference: find the most likely state of all
unknown grounding literals given the evidence
yMAP  arg max yY P( y | x)  arg max yY  wi ni ( x, y )
i
MaxWalkSAT algorithm [Kautz et al., 1997]
 Cutting Plane Inference algorithm [Riedel, 2008]


Computing the marginal conditional probability of a
set of grounding literals: P(Y=y|x)
MC-SAT algorithm [Poon & Domingos, 2006]
 Lifted first-order belief propagation [Singla & Domingos, 2008]

21
Existing structure learning methods for MLNs

Top-down approach: MSL[Kok & Domingos 05], [Biba etal.,
2008]


Start from unit clauses and search for new clauses
Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL
[Kok & Domingos 09]

Use data to generate candidate clauses
22
Existing weight learning methods in MLNs

Generative: maximize the (Pseudo) Log-Likelihood
[Richardson & Domingos, 2006]

Discriminative : maximize the Conditional LogLikelihood (CLL)

[Singla & Domingos, 2005]


Structured Perceptron [Collins, 2002]
[Lowd & Domingos, 2007]


First and second-order methods to optimize the CLL
Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best
23
Outline






Motivation
Background
Discriminative learning for MLNs with non-recursive
clause
Max-margin weight learning for MLNs
Future work
Conclusion
24
Drug design for Alzheimer’s disease

Comparing different analogues of Tacrine drug for Alzheimer’s
disease on four biochemical properties:




Maximization of inhibition of amine re-uptake
Minimization of toxicity
Maximization of acetyl cholinesterase inhibition
Maximization of the reversal of scopolamine-induced memory
impairment
Tacrine drug
Template for the proposed drugs
25
Inductive Logic Programming


Use first-order logic to represent background
knowledge and examples
Automated learning of logic rules from examples
and background knowledge
26
Inductive Logic Programming systems





GOLEM [Muggleton and Feng, 1992]
FOIL [Quinlan, 1993]
PROGOL [Muggleton, 1995]
CHILLIN [Zelle and Mooney, 1996]
ALEPH [Srinivasan, 2001]
27
Inductive Logic Programming example
[King et al., 1995]
28
Results with existing learning methods for MLNs
Average accuracy
Data set
MLN1*
MLN2**
ALEPH
Alzheimer amine
50.1 ± 0.5
51.3 ± 2.5
81.6 ± 5.1
Alzheimer toxic
54.7 ± 7.4
51.7 ± 5.3
81.7 ± 4.2
Alzheimer acetyl
48.2 ± 2.9
55.9 ± 8.7
79.6 ± 2.2
50 ± 0.0
49.8 ± 1.6
76.0 ± 4.9
Alzheimer memory
*MLN1: MSL + PSCG
**MLN2: BUSL+ PSCG

What happened: The existing learning methods for
MLNs fail to capture the relations between the
background predicates and the target predicate
New discriminative learning methods for MLNs
29
Proposed approach
Step 1
Step 2
Clause Learner
Discriminative structure learning
(Generating candidate clauses)
Discriminative weight learning
(Selecting good clauses)
30
Discriminative structure learning

Use a variant of ALEPH, called ALEPH++, to
produce a larger set of candidate clauses:
 Score
the clauses by m-estimate [Dzeroski, 1991], a
Bayesian estimate of the accuracy of a clause.
 Keep all the clauses having an m-estimate greater than
a pre-defined threshold (0.6), instead of the final
theory produced by ALEPH.
31
Facts
r _subst_1(A1,H)
r_subst_1(B1,H)
r _subst_1(D1,H)
x_subst(B1,7,CL)
x_subst(HH1,6,CL)
x _subst(D1,6,OCH3)
polar(CL,POLAR3)
polar(OCH3,POLAR2)
great_polar(POLAR3,POLAR2)
size(CL,SIZE1)
size(OCH3,SIZE2)
great_size(SIZE2,SIZE1)
alk_groups(A1,0)
alk groups(B1,0)
alk_groups(D1,0)
alk_groups(HH1,1)
flex(CL,FLEX0)
flex(OCH3,FLEX1)
less_toxic(A1,D1)
less_toxic(B1,D1)
less_toxic(HH1,A1)
ALEPH++
Candidate clauses
x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2)
alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2)
x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1)
=> less_toxic(d1,d2)
….
They are all non-recursive clauses
32
Discriminative weight learning

Maximize CLL with L1-regularization:
 Use
exact inference instead of approximate
inferences
 Use L1-regularization instead of L2-regularization
33
Exact inference

Since the candidate clauses are non-recursive, the
query predicate appears only once in each clause,
i.e. the probability of a query atom being true or
false only depends on the evidence
34
L1-regularization

Put a Laplacian prior with zero mean on each weight wi
P( wi )  ( b / 2)  exp(  b | wi |)


L1 ignores irrelevant features by setting their weights to zero [Ng, 2004]
Larger value of b, the regularizing parameter, corresponds to smaller variance of the
prior distribution
35
CLL with L1-regularization


This is convex and non-smooth optimization problem
Use the Orthant-Wise Limited-memory Quasi-Newton (OWLQN) software [Andrew & Gao, 2007] to solve the optimization
problem
36
Facts
r _subst_1(A1,H)
r_subst_1(B1,H)
r _subst_1(D1,H)
x_subst(B1,7,CL)
x_subst(HH1,6,CL)
x _subst(D1,6,OCH3)
…
Candidate clauses
alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2)
x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) =>
less_toxic(d1,d2)
x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2)
….
L1 weight learner
Weighted clauses
0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720)
0.34487 alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2)
2.70323 x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2)
….
37
Experiments
38
Datasets
Datasets
# Examples
% Pos. example
#Predicates
Alzheimer amine
686
50%
30
Alzheimer toxic
886
50%
30
Alzheimer acetyl
1326
50%
30
Alzheimer memory
642
50%
30
39
Methodology


10-fold cross-validation
Metric:
 Average
predictive accuracy over 10 folds
40
Average accuracy
Q1: Does the proposed approach perform better
than existing learning methods for MLNs and
traditional ILP methods?
41
# of clauses
Q2: The effect of L1-regularization
42
Average accuracy
Q2: The effect of L1-regularization (cont.)
43
Q3:The benefit of collective inference
 Adding
a transitive clause with infinite weight to the
learned MLNs.
Average accuracy
less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c).
44
Average accuracy
Q4: The performance of our approach against
other “advanced ILP” methods
45
Outline






Motivation
Background
Discriminative learning for MLNs with non-recursive
clause
Max-margin weight learning for MLNs
Future work
Conclusion
46
Motivation
All of the existing training methods for MLNs learn a
model that produce good predictive probabilities
 In many applications, the actual goal is to optimize
some application specific performance measures such
as F1 score (harmonic mean of precision and recall)
 Max-margin training methods, especially Structural
Support Vector Machines (SVMs), provide the
framework to optimize these application specific
measures
 Training MLNs under the max-margin framework

47
Generic Strutural SVMs[Tsochantaridis et.al., 2004]

Learn a discriminant function f: X x Y → R
f ( x, y; w)  wT ( x, y)

Predict for a given input x:
h( x; w)  arg max wT ( x, y )
yY

Maximize the separation margin:
 ( x, y; w)  wT ( x, y)  max wT ( x, y' )
yY \ y

Can be formulated as a quadratic optimization
problem
48
Generic Strutural SVMs (cont.)

[Joachims et.al., 2009] proposed the 1-slack formulation
of the Structural SVM:
1 T
min w w  C
w,  0 2
st .
n
n
1
1
( y1,..., yn )  Y n : wT  [( xi , yi )  ( xi , yi )]   ( yi , yi)  
n
n i 1
i 1
Make the original cutting-plane algorithm [Tsochantaridis
et.al., 2004] run faster and more scalable
49
Cutting plane algorithm for solving the structural SVMs
Structural SVM Problem


Cutting plane algorithm
Exponential constraints

Most are dominated by a small set of
“important” constraints

*Slide credit: Yisong Yue
Repeatedly finds the next most
violated constraint…
… until cannot find any new constraint
50
Cutting plane algorithm for solving the 1-slack SVMs
Structural SVM Problem


Cutting plane algorithm
Exponential constraints

Most are dominated by a small set of
“important” constraints

*Slide credit: Yisong Yue
Repeatedly finds the next most
violated constraint…
… until cannot find any new constraint
51
Cutting plane algorithm for solving the 1-slack SVMs
Structural SVM Problem


Cutting plane algorithm
Exponential constraints

Most are dominated by a small set of
“important” constraints

*Slide credit: Yisong Yue
Repeatedly finds the next most
violated constraint…
… until cannot find any new constraint
52
Cutting plane algorithm for solving the 1-slack SVMs
Structural SVM Problem


Cutting plane algorithm
Exponential constraints

Most are dominated by a small set of
“important” constraints

*Slide credit: Yisong Yue
Repeatedly finds the next most
violated constraint…
… until cannot find any new constraint
53
Applying the generic structural SVMs to a new problem

Representation: Φ(x,y)
Loss function: Δ(y,y')

Algorithms to compute

 Prediction:
yˆ  arg max yY {wT ( x, y)}
 Most
violated constraint: separation oracle [Tsochantaridis
et.al., 2004] or loss-augmented inference [Taskar et.al.,2005]
yˆ  arg max{wT ( x, y)  ( y, y)}
yY
54
Max-margin Markov Logic Networks

Maximize the ratio:
P ( y | x) exp

P ( yˆ | x) exp
 w n ( x, y)
 w n ( x, yˆ )
i
i i
i
i i
yˆ  arg max yY \ y P( y | x)

Equivalent to maximize the separation margin:
 ( x, y; w)  wT n( x, y)  wT n( x, yˆ )
 wT n( x, y)  max wT n( x, y)
yY \ y

Joint feature: Φ(x,y)
Can be formulated as a 1-slack Structural SVMs
55
Problems need to be solved

MPE inference:
yˆ  arg max y 'Y wT n( x, y' )

Loss-augmented MPE inference:
yˆ  arg max ( y, y ' )  wT n( x, y ' )
y 'Y
Problem: Exact MPE inference in MLNs are intractable
Solution: Approximation inference via relaxation
methods [Finley et.al.,2008]
56
Relaxation MPE inference for MLNs

Many work on approximating the Weighted MAX-SAT
via Linear Programming (LP) relaxation [Goemans and
Williamson, 1994], [Asano and Williamson, 2002], [Asano, 2006]
 Convert
the problem into an Integer Linear
Programming (ILP) problem
 Relax the integer constraints to linear constraints
 Round the LP solution by some randomized
procedures
 Assume the weights are finite and positive
57
Relaxation MPE inference for MLNs (cont.)

Translate the MPE inference in a ground MLN into
an ILP problem:
 Convert
all the ground clauses into clausal form
 Assign a binary variable yi to each unknown ground
atom and a binary variable zj to each non-deterministic
ground clause
 Translate each ground clause into linear constraints of
yi’s and zj’s
58
Relaxation MPE inference for MLNs (cont.)
Ground MLN
3 InField(B1,Fauthor,P01)
0.5 InField(B1,Fauthor,P01) v InField(B1,Fvenue,P01)
-1 InField(B1,Ftitle,P01) v InField(B1,Fvenue,P01)
Translated ILP problem
max 3 y1  0.5 z1  z 2
y,z
st .
y1  y2  z1
1  y2  z 2
1  y3  z 2
!InField(B1,Fauthor,P01) v !InField(a1,Ftitle,P01).
!InField(B1,Fauthor,P01) v !InField(a1,Fvenue,P01).
!InField(B1,Ftitle,P01) v !InField(a1,Fvenue,P01).
(1  y1 )  (1  y2 )  1
(1  y1)  (1  y3 )  1
(1  y2 )  (1  y3 )  1
yi , z j  {0,1}
59
Relaxation MPE inference for MLNs (cont.)


LP-relaxation: relax the integer constraints {0,1} to
linear constraints [0,1].
Adapt the ROUNDUP [Boros and Hammer, 2002]
procedure to round the solution of the LP problem
 Pick
a non-integral component and round it in each step
60
Loss-augmented LP-relaxation MPE inference

Represent the loss function as a linear function of
yi’s:
 Hammming ( yT , y) 

i: yiT

0
yi 
 (1  y )
i
i: yiT
1
Add the loss term to the objective of the LPrelaxation  the problem is still a LP problem 
can be solved by the previous algorithm
61
Experiments
62
Collective multi-label webpage classification

WebKB dataset [Slattery and Craven, 1998] [Lowd and Domingos,
2007]



4,165 web pages and 10,935 web links of 4
departments
Each page is labeled with a subset of 7 categories:
Course, Department, Faculty, Person, Professor,
Research Project, Student
MLN [Lowd and Domingos, 2007] :
Has(+word,page) => PageClass(+class,page)
¬Has(+word,page) => PageClass(+class,page)
PageClass(+c1,p1) ^ Linked(p1,p2) => PageClass(+c2,p2)
63
Collective multi-label webpage classification (cont.)

Largest ground MLN for one department:
 8,876
query atoms
 174,594 ground clauses
64
Citation segmentation

Citeseer dataset [Lawrence et.al., 1999] [Poon and Domingos,
2007]




1,563 citations, divided into 4 research topics
Each citation is segmented into 3 fields: Author, Title,
Venue
Used the simplest MLN in [Poon and Domingos, 2007]
Largest ground MLN for one topic:
 37,692
query atoms
 131,573 ground clauses
65
Experimental setup





4-fold cross-validation
Metric: F1 score
Compare against the Preconditioned Scaled
Conjugate Gradient (PSCG) algorithm
Train with 5 different values of C: 1, 10, 100, 1000,
10000 and test with the one that performs best on
training
Use Mosek to solve the QP and LP problems
66
F1 scores on WebKB
67
F1 scores on WebKB(cont.)
68
F1 scores on Citeseer
69
Sensitivity to the tuning parameter
70
Outline






Motivation
Background
Discriminative learning for MLNs with non-recursive
clause
Max-margin weight learning for MLNs
Future work
Conclusion
71
More efficient MPE inference


Goal: Don’t need to ground the whole ground MLN
Current solution:
 Lazy
inference [Singla & Domingos, 2006] exploits the
sparsity of the domain
 CPI [Riedel, 2008] exploits redundancies in the ground
network
 Lifted inference [Singla & Domingos, 2008; Kersting et al., 2009]
exploits symmetries in the ground network

Challenge: combining the advantages of these
algorithms to have a more efficient algorithm
72
More efficient weight learning

Proposed approach: online max-margin weight
learning
 Subgradient
method [Nathan Ratliff & Zinkevich, 2007]
 Convert
the problem into an unconstrained optimization
 Use the LP-relaxation MPE inference algorithm to compute
the subgradients
 Passive-aggressive
 k-best
algorithms [Crammer et.al., 2006]
MIRA algorithm [Crammer et.al., 2005]
 Problem need to solve: find the k-best MPE
73
Discriminative structure revision


Goal: revise bad clauses in the model
Problems need to solve:

Detect bad clauses


Diagnose bad clauses


Clauses whose weights are near zero
Use techniques in RTAMAR [Mihalkova et al., 2007]
Revise bad clauses

Use top-down beam search or stochastic local search [Paes et al.,
2007] or bottom-up approach [Duboc et al., 2008]
Score candidate clauses in an efficient way
 Online structure learning and revision

74
Joint learning in NLP

Jointly recognizing entities and relations in sentences
[Roth & Yih, 2002]

Construct an MLN with:
Clauses that express the correlation between lexical and
syntactical information and entity types
 Clauses that express the correlation between lexical and
syntactical information and relation types
 Clauses that express the relationships among entities, among
relations, and between entities and relations.


Challenge: learning weights and do inference with this
complicated MLN
75
Joint learning in computer vision

Sky
Athlete
Tree
Horse
Grass

Problem: recognize both the
objects and the scene of an
image
Proposed approach:
 Use
Class: Polo
[Li & Fei-Fei 07]
MLNs to combine the outputs
of scene and object classifiers
 Learn an MLN that can detect both
the objects and the scene
76
Conclusion

We have presented two different discriminative
learning methods for MLNs
Discriminative structure and weight learning method for
MLNs with non-recursive clauses
 Max-margin weight learners


We propose:
Develop more effective inference and weight learning
methods
 Revise the clauses to improve the predictive accuracy
 Apply the system to joint learning problems in NLP and
computer vision

77
Questions?
Thank you!
78
Download