Discriminative Learning for Markov Logic Networks Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney PhD Proposal October 9th, 2009 Some slides are taken from [Domingos, 2007], [Mooney, 2008] Motivation Most machine learning methods assume independent and identically distributed (i.i.d.) examples represented as feature vectors. Most of data in the real world are not i.i.d. and also cannot be effectively represented as feature vectors and Biochemical data Social network data Multi-relational data … 2 Biochemical data Predicting mutagenicity [Srinivasan et. al, 1995] 3 Web-KB dataset [Slattery & Craven, 1998] Characteristics of these structured data Contains multiple objects/entities and relationships among them There are a lot of uncertainties in the data: Uncertainty about the attributes of an object Uncertainty about the type of an object Uncertainty about relationships between objects 5 Statistical Relational Learning (SRL) SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. Some proposed SRL models: Stochastic Logic Programs (SLPs) [Muggleton, 1996] Probabilistic Relational Models (PRMs) [Friedman et al., 1999] Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] Relational Markov networks (RMNs) [Taskar et al., 2002] Markov Logic Networks (MLNs) [Richardson & Domingos, 2006] 6 Statistical Relational Learning (SRL) SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. Some proposed SRL models: Stochastic Logic Programs (SLPs) [Muggleton, 1996] Probabilistic Relational Models (PRMs) [Friedman et al., 1999] Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] Relational Markov networks (RMNs) [Taskar et al., 2002] Markov Logic Networks (MLNs) [Richardson & Domingos, 2006] 7 Discriminative learning Generative learning: learn a joint model over all variables Discriminative learning: learn a conditional model of the output variables given the input variables directly learn a model for predicting the outputs has better predictive performance on the outputs in general Most problems in structured/relational data are discriminative: make predictions based on some evidence (observable data). Discriminative learning is more suitable 8 Discriminative Learning for Markov Logic Networks 9 Outline Motivation Background Discriminative learning for MLNs with non-recursive clause [Huynh & Mooney, 2008] Max-margin weight learning for MLNs [Huynh & Mooney, 2009] Future work Conclusion 10 First-Order Logic Constants: Anna, Bob Variables: x, y Function: fatherOf(x) Predicate: binary functions E.g: Smoke(x), Friends(x,y) Literals: Predicates or its negation Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground literals 11 First-Order Clauses Clause: A disjunction of literals ¬Smoke(x) v Cancer(x) Can be rewritten as a set of implication rules Smoke(x) => Cancer(x) ¬Cancer(x) => ¬Smoke(x) 12 Markov Networks [Pearl, 1988] Undirected graphical models Smoking Cancer Asthma Cough Potential function: function defined over a clique ( a complete sub-graph) Smoking Cancer False False 4.5 False True 4.5 True False 2.7 True True 4.5 Ф(S,C) 1 P( x) c ( xc ) Z c Z c ( xc ) x c 13 Markov Networks [Pearl, 1988] Undirected graphical models Smoking Cancer Asthma Cough Log-linear model: 1 P( x) exp wi f i ( x) Z i Weight of Feature i Feature i 1 if Smoking Cancer f1 (Smoking, Cancer ) 0 otherwise w1 1.5 14 Markov Logic Networks [Richardson & Domingos, 2006] Set of weighted first-order clauses. Larger weight indicates stronger belief that the clause should hold. The clauses are called the structure of the MLN. MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers 1.5 x Smokes( x) Cancer ( x) 1.1 x, y Friends ( x, y ) Smokes( x) Smokes( y ) 15 Example: Friends & Smokers 1.5 x Smokes( x ) Cancer ( x ) 1.1 x, y Friends ( x, y ) Smokes( x ) Smokes( y ) Two constants: Anna (A) and Bob (B) 16 Example: Friends & Smokers 1.5 x Smokes( x ) Cancer ( x ) 1.1 x, y Friends ( x, y ) Smokes( x ) Smokes( y ) Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Cancer(A) Friends(B,B) Cancer(B) Friends(B,A) 17 Example: Friends & Smokers 1.5 x Smokes( x ) Cancer ( x ) 1.1 x, y Friends ( x, y ) Smokes( x ) Smokes( y ) Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Cancer(A) Friends(B,B) Cancer(B) Friends(B,A) 18 Example: Friends & Smokers 1.5 x Smokes( x ) Cancer ( x ) 1.1 x, y Friends ( x, y ) Smokes( x ) Smokes( y ) Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Cancer(A) Friends(B,B) Cancer(B) Friends(B,A) 19 Probability of a possible world a possible world 1 P( X x) exp wi ni ( x) Z i Weight of formula i No. of true groundings of formula i in x Z exp wi ni ( x) x i A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. 20 Inference in MLNs MAP/MPE inference: find the most likely state of all unknown grounding literals given the evidence yMAP arg max yY P( y | x) arg max yY wi ni ( x, y ) i MaxWalkSAT algorithm [Kautz et al., 1997] Cutting Plane Inference algorithm [Riedel, 2008] Computing the marginal conditional probability of a set of grounding literals: P(Y=y|x) MC-SAT algorithm [Poon & Domingos, 2006] Lifted first-order belief propagation [Singla & Domingos, 2008] 21 Existing structure learning methods for MLNs Top-down approach: MSL[Kok & Domingos 05], [Biba etal., 2008] Start from unit clauses and search for new clauses Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL [Kok & Domingos 09] Use data to generate candidate clauses 22 Existing weight learning methods in MLNs Generative: maximize the (Pseudo) Log-Likelihood [Richardson & Domingos, 2006] Discriminative : maximize the Conditional LogLikelihood (CLL) [Singla & Domingos, 2005] Structured Perceptron [Collins, 2002] [Lowd & Domingos, 2007] First and second-order methods to optimize the CLL Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best 23 Outline Motivation Background Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion 24 Drug design for Alzheimer’s disease Comparing different analogues of Tacrine drug for Alzheimer’s disease on four biochemical properties: Maximization of inhibition of amine re-uptake Minimization of toxicity Maximization of acetyl cholinesterase inhibition Maximization of the reversal of scopolamine-induced memory impairment Tacrine drug Template for the proposed drugs 25 Inductive Logic Programming Use first-order logic to represent background knowledge and examples Automated learning of logic rules from examples and background knowledge 26 Inductive Logic Programming systems GOLEM [Muggleton and Feng, 1992] FOIL [Quinlan, 1993] PROGOL [Muggleton, 1995] CHILLIN [Zelle and Mooney, 1996] ALEPH [Srinivasan, 2001] 27 Inductive Logic Programming example [King et al., 1995] 28 Results with existing learning methods for MLNs Average accuracy Data set MLN1* MLN2** ALEPH Alzheimer amine 50.1 ± 0.5 51.3 ± 2.5 81.6 ± 5.1 Alzheimer toxic 54.7 ± 7.4 51.7 ± 5.3 81.7 ± 4.2 Alzheimer acetyl 48.2 ± 2.9 55.9 ± 8.7 79.6 ± 2.2 50 ± 0.0 49.8 ± 1.6 76.0 ± 4.9 Alzheimer memory *MLN1: MSL + PSCG **MLN2: BUSL+ PSCG What happened: The existing learning methods for MLNs fail to capture the relations between the background predicates and the target predicate New discriminative learning methods for MLNs 29 Proposed approach Step 1 Step 2 Clause Learner Discriminative structure learning (Generating candidate clauses) Discriminative weight learning (Selecting good clauses) 30 Discriminative structure learning Use a variant of ALEPH, called ALEPH++, to produce a larger set of candidate clauses: Score the clauses by m-estimate [Dzeroski, 1991], a Bayesian estimate of the accuracy of a clause. Keep all the clauses having an m-estimate greater than a pre-defined threshold (0.6), instead of the final theory produced by ALEPH. 31 Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) polar(CL,POLAR3) polar(OCH3,POLAR2) great_polar(POLAR3,POLAR2) size(CL,SIZE1) size(OCH3,SIZE2) great_size(SIZE2,SIZE1) alk_groups(A1,0) alk groups(B1,0) alk_groups(D1,0) alk_groups(HH1,1) flex(CL,FLEX0) flex(OCH3,FLEX1) less_toxic(A1,D1) less_toxic(B1,D1) less_toxic(HH1,A1) ALEPH++ Candidate clauses x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. They are all non-recursive clauses 32 Discriminative weight learning Maximize CLL with L1-regularization: Use exact inference instead of approximate inferences Use L1-regularization instead of L2-regularization 33 Exact inference Since the candidate clauses are non-recursive, the query predicate appears only once in each clause, i.e. the probability of a query atom being true or false only depends on the evidence 34 L1-regularization Put a Laplacian prior with zero mean on each weight wi P( wi ) ( b / 2) exp( b | wi |) L1 ignores irrelevant features by setting their weights to zero [Ng, 2004] Larger value of b, the regularizing parameter, corresponds to smaller variance of the prior distribution 35 CLL with L1-regularization This is convex and non-smooth optimization problem Use the Orthant-Wise Limited-memory Quasi-Newton (OWLQN) software [Andrew & Gao, 2007] to solve the optimization problem 36 Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) … Candidate clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. L1 weight learner Weighted clauses 0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720) 0.34487 alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) 2.70323 x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. 37 Experiments 38 Datasets Datasets # Examples % Pos. example #Predicates Alzheimer amine 686 50% 30 Alzheimer toxic 886 50% 30 Alzheimer acetyl 1326 50% 30 Alzheimer memory 642 50% 30 39 Methodology 10-fold cross-validation Metric: Average predictive accuracy over 10 folds 40 Average accuracy Q1: Does the proposed approach perform better than existing learning methods for MLNs and traditional ILP methods? 41 # of clauses Q2: The effect of L1-regularization 42 Average accuracy Q2: The effect of L1-regularization (cont.) 43 Q3:The benefit of collective inference Adding a transitive clause with infinite weight to the learned MLNs. Average accuracy less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c). 44 Average accuracy Q4: The performance of our approach against other “advanced ILP” methods 45 Outline Motivation Background Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion 46 Motivation All of the existing training methods for MLNs learn a model that produce good predictive probabilities In many applications, the actual goal is to optimize some application specific performance measures such as F1 score (harmonic mean of precision and recall) Max-margin training methods, especially Structural Support Vector Machines (SVMs), provide the framework to optimize these application specific measures Training MLNs under the max-margin framework 47 Generic Strutural SVMs[Tsochantaridis et.al., 2004] Learn a discriminant function f: X x Y → R f ( x, y; w) wT ( x, y) Predict for a given input x: h( x; w) arg max wT ( x, y ) yY Maximize the separation margin: ( x, y; w) wT ( x, y) max wT ( x, y' ) yY \ y Can be formulated as a quadratic optimization problem 48 Generic Strutural SVMs (cont.) [Joachims et.al., 2009] proposed the 1-slack formulation of the Structural SVM: 1 T min w w C w, 0 2 st . n n 1 1 ( y1,..., yn ) Y n : wT [( xi , yi ) ( xi , yi )] ( yi , yi) n n i 1 i 1 Make the original cutting-plane algorithm [Tsochantaridis et.al., 2004] run faster and more scalable 49 Cutting plane algorithm for solving the structural SVMs Structural SVM Problem Cutting plane algorithm Exponential constraints Most are dominated by a small set of “important” constraints *Slide credit: Yisong Yue Repeatedly finds the next most violated constraint… … until cannot find any new constraint 50 Cutting plane algorithm for solving the 1-slack SVMs Structural SVM Problem Cutting plane algorithm Exponential constraints Most are dominated by a small set of “important” constraints *Slide credit: Yisong Yue Repeatedly finds the next most violated constraint… … until cannot find any new constraint 51 Cutting plane algorithm for solving the 1-slack SVMs Structural SVM Problem Cutting plane algorithm Exponential constraints Most are dominated by a small set of “important” constraints *Slide credit: Yisong Yue Repeatedly finds the next most violated constraint… … until cannot find any new constraint 52 Cutting plane algorithm for solving the 1-slack SVMs Structural SVM Problem Cutting plane algorithm Exponential constraints Most are dominated by a small set of “important” constraints *Slide credit: Yisong Yue Repeatedly finds the next most violated constraint… … until cannot find any new constraint 53 Applying the generic structural SVMs to a new problem Representation: Φ(x,y) Loss function: Δ(y,y') Algorithms to compute Prediction: yˆ arg max yY {wT ( x, y)} Most violated constraint: separation oracle [Tsochantaridis et.al., 2004] or loss-augmented inference [Taskar et.al.,2005] yˆ arg max{wT ( x, y) ( y, y)} yY 54 Max-margin Markov Logic Networks Maximize the ratio: P ( y | x) exp P ( yˆ | x) exp w n ( x, y) w n ( x, yˆ ) i i i i i i yˆ arg max yY \ y P( y | x) Equivalent to maximize the separation margin: ( x, y; w) wT n( x, y) wT n( x, yˆ ) wT n( x, y) max wT n( x, y) yY \ y Joint feature: Φ(x,y) Can be formulated as a 1-slack Structural SVMs 55 Problems need to be solved MPE inference: yˆ arg max y 'Y wT n( x, y' ) Loss-augmented MPE inference: yˆ arg max ( y, y ' ) wT n( x, y ' ) y 'Y Problem: Exact MPE inference in MLNs are intractable Solution: Approximation inference via relaxation methods [Finley et.al.,2008] 56 Relaxation MPE inference for MLNs Many work on approximating the Weighted MAX-SAT via Linear Programming (LP) relaxation [Goemans and Williamson, 1994], [Asano and Williamson, 2002], [Asano, 2006] Convert the problem into an Integer Linear Programming (ILP) problem Relax the integer constraints to linear constraints Round the LP solution by some randomized procedures Assume the weights are finite and positive 57 Relaxation MPE inference for MLNs (cont.) Translate the MPE inference in a ground MLN into an ILP problem: Convert all the ground clauses into clausal form Assign a binary variable yi to each unknown ground atom and a binary variable zj to each non-deterministic ground clause Translate each ground clause into linear constraints of yi’s and zj’s 58 Relaxation MPE inference for MLNs (cont.) Ground MLN 3 InField(B1,Fauthor,P01) 0.5 InField(B1,Fauthor,P01) v InField(B1,Fvenue,P01) -1 InField(B1,Ftitle,P01) v InField(B1,Fvenue,P01) Translated ILP problem max 3 y1 0.5 z1 z 2 y,z st . y1 y2 z1 1 y2 z 2 1 y3 z 2 !InField(B1,Fauthor,P01) v !InField(a1,Ftitle,P01). !InField(B1,Fauthor,P01) v !InField(a1,Fvenue,P01). !InField(B1,Ftitle,P01) v !InField(a1,Fvenue,P01). (1 y1 ) (1 y2 ) 1 (1 y1) (1 y3 ) 1 (1 y2 ) (1 y3 ) 1 yi , z j {0,1} 59 Relaxation MPE inference for MLNs (cont.) LP-relaxation: relax the integer constraints {0,1} to linear constraints [0,1]. Adapt the ROUNDUP [Boros and Hammer, 2002] procedure to round the solution of the LP problem Pick a non-integral component and round it in each step 60 Loss-augmented LP-relaxation MPE inference Represent the loss function as a linear function of yi’s: Hammming ( yT , y) i: yiT 0 yi (1 y ) i i: yiT 1 Add the loss term to the objective of the LPrelaxation the problem is still a LP problem can be solved by the previous algorithm 61 Experiments 62 Collective multi-label webpage classification WebKB dataset [Slattery and Craven, 1998] [Lowd and Domingos, 2007] 4,165 web pages and 10,935 web links of 4 departments Each page is labeled with a subset of 7 categories: Course, Department, Faculty, Person, Professor, Research Project, Student MLN [Lowd and Domingos, 2007] : Has(+word,page) => PageClass(+class,page) ¬Has(+word,page) => PageClass(+class,page) PageClass(+c1,p1) ^ Linked(p1,p2) => PageClass(+c2,p2) 63 Collective multi-label webpage classification (cont.) Largest ground MLN for one department: 8,876 query atoms 174,594 ground clauses 64 Citation segmentation Citeseer dataset [Lawrence et.al., 1999] [Poon and Domingos, 2007] 1,563 citations, divided into 4 research topics Each citation is segmented into 3 fields: Author, Title, Venue Used the simplest MLN in [Poon and Domingos, 2007] Largest ground MLN for one topic: 37,692 query atoms 131,573 ground clauses 65 Experimental setup 4-fold cross-validation Metric: F1 score Compare against the Preconditioned Scaled Conjugate Gradient (PSCG) algorithm Train with 5 different values of C: 1, 10, 100, 1000, 10000 and test with the one that performs best on training Use Mosek to solve the QP and LP problems 66 F1 scores on WebKB 67 F1 scores on WebKB(cont.) 68 F1 scores on Citeseer 69 Sensitivity to the tuning parameter 70 Outline Motivation Background Discriminative learning for MLNs with non-recursive clause Max-margin weight learning for MLNs Future work Conclusion 71 More efficient MPE inference Goal: Don’t need to ground the whole ground MLN Current solution: Lazy inference [Singla & Domingos, 2006] exploits the sparsity of the domain CPI [Riedel, 2008] exploits redundancies in the ground network Lifted inference [Singla & Domingos, 2008; Kersting et al., 2009] exploits symmetries in the ground network Challenge: combining the advantages of these algorithms to have a more efficient algorithm 72 More efficient weight learning Proposed approach: online max-margin weight learning Subgradient method [Nathan Ratliff & Zinkevich, 2007] Convert the problem into an unconstrained optimization Use the LP-relaxation MPE inference algorithm to compute the subgradients Passive-aggressive k-best algorithms [Crammer et.al., 2006] MIRA algorithm [Crammer et.al., 2005] Problem need to solve: find the k-best MPE 73 Discriminative structure revision Goal: revise bad clauses in the model Problems need to solve: Detect bad clauses Diagnose bad clauses Clauses whose weights are near zero Use techniques in RTAMAR [Mihalkova et al., 2007] Revise bad clauses Use top-down beam search or stochastic local search [Paes et al., 2007] or bottom-up approach [Duboc et al., 2008] Score candidate clauses in an efficient way Online structure learning and revision 74 Joint learning in NLP Jointly recognizing entities and relations in sentences [Roth & Yih, 2002] Construct an MLN with: Clauses that express the correlation between lexical and syntactical information and entity types Clauses that express the correlation between lexical and syntactical information and relation types Clauses that express the relationships among entities, among relations, and between entities and relations. Challenge: learning weights and do inference with this complicated MLN 75 Joint learning in computer vision Sky Athlete Tree Horse Grass Problem: recognize both the objects and the scene of an image Proposed approach: Use Class: Polo [Li & Fei-Fei 07] MLNs to combine the outputs of scene and object classifiers Learn an MLN that can detect both the objects and the scene 76 Conclusion We have presented two different discriminative learning methods for MLNs Discriminative structure and weight learning method for MLNs with non-recursive clauses Max-margin weight learners We propose: Develop more effective inference and weight learning methods Revise the clauses to improve the predictive accuracy Apply the system to joint learning problems in NLP and computer vision 77 Questions? Thank you! 78