Statistical Relational Learning for Knowledge Extraction from the Web Hoifung Poon Dept. of Computer Science & Eng. University of Washington 1 “Drowning in Information, Starved for Knowledge” WWW 2 Great Vision: Knowledge Extraction from Web Craven et al., “Learning to Construct Knowledge Bases from the World Wide Web," Artificial Intelligence, 1999. Also need: Knowledge representation and reasoning Close the loop: Apply knowledge to extraction Machine reading [Etzioni et al., 2007] 3 Machine Reading: Text Knowledge …… 4 Rapidly Growing Interest AAAI-07 Spring Symposium on Machine Reading DARPA Machine Reading Program (2009-2014) NAACL-10 Workshop on Learning By Reading Etc. 5 Great Impact Scientific inquiry and commercial applications Literature-based discovery, robot scientists Question answering, semantic search Drug design, medical diagnosis Breach knowledge acquisition bottleneck for AI and natural language understanding Automatically semantify the Web Etc. 6 This Talk Statistical relational learning offers promising solutions to machine reading Markov logic is a leading unifying framework A success story: USP Unsupervised, end-to-end machine reading Extracts five times as many correct answers as state of the art, with highest accuracy of 91% 7 USP: Question-Answer Example Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells. Q: What does IL-2 control? A: The DEX-mediated IkappaBalpha induction 8 Overview Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 9 Key Challenges Complexity Uncertainty Pipeline accumulates errors Supervision is scarce 10 Languages Are Structural governments lm$pxtm (Hebrew: according to their families) IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, he met Laura Welch at a barbecue.1111 Languages Are Structural S govern-ment-s l-m$px-t-m VP NP (Hebrew: according to their families) V NP IL-4 induces CD11B Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41...... involvement Theme Cause up-regulation Theme IL-10 Cause gp41 Site activation Theme human p70(S6)-kinase monocyte George Walker Bush was the 43rd President of the United States. …… Bush was the eldest son of President G. H. W. Bush and Babara Bush. ……. In November 1977, he met Laura Welch at a barbecue.1122 Knowledge Is Heterogeneous Individuals E.g.: Socrates is a man Types E.g.: Man is mortal Inference rules E.g.: Syllogism Ontological relations MAMMAL ISA HUMAN Etc. FACE ISPART EYE 13 Complexity Can handle using first-order logic Trees, graphs, dependencies, hierarchies, etc. easily expressed Inference algorithms (satisfiability testing, theorem proving, etc.) But … logic is brittle with uncertainty 14 Languages Are Ambiguous I saw the man with the telescope NP I saw the man with the telescope NP ADVP I saw the man with the telescope Microsoft buys Powerset Microsoft acquires Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … …… Here in London, Frances Deek is a retired teacher … In the Israeli town …, Karen London says … Now London says … London PERSON or LOCATION? G. W. Bush …… …… Laura Bush …… Mrs. Bush …… Which one? 15 Knowledge Has Uncertainty We need to model correlations Our information is always incomplete Our predictions are uncertain 16 Uncertainty Statistics provides the tools to handle this Mixture models Hidden Markov models Bayesian networks Markov random fields Maximum entropy models Conditional random fields Etc. But … statistical models assume i.i.d. data (independently and identically distributed) objects feature vectors 17 Pipeline is Suboptimal E.g., NLP pipeline: Tokenization Morphology Chunking Syntax … Accumulates and propagates errors Wanted: Joint inference Across all processing stages Among all interdependent objects 18 Supervision is Scarce Tons of text … but most is not annotated Labeling is expensive (Cf. Penn-Treebank) Need to leverage indirect supervision 19 Redundancy Key source of indirect supervision State-of-the-art systems depend on this E.g., TextRunner [Banko et al., 2007] But … Web is heterogeneous: Long tail Redundancy only present in head regime 20 Overview Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 21 Statistical Relational Learning Burgeoning field in machine learning Offers promising solutions for machine reading Unify statistical and logical approaches Replace pipeline with joint inference Principled framework to leverage both direct and indirect supervision 22 Machine Reading: A Vision Challenge: Long tail 23 Machine Reading: A Vision 24 Challenges in Applying Statistical Relational Learning Learning is much harder Inference becomes a crucial issue Greater complexity for user 25 Progress to Date Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction [Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009] Etc. 26 Progress to Date Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction [Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Leading unifying framework Relational Markov networks [Taskar et al., 2002] Markov logic [Domingos & Lowd, 2009] Etc. 27 Overview Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 28 Markov Networks Undirected graphical models Smoking Cancer Asthma Cough Log-linear model: 1 P( x) exp wi f i ( x) Z i Weight of Feature i Feature i 1 if Smoking Cancer f1 (Smoking, Cancer ) 0 otherwise w1 1.5 29 First-Order Logic Constants, variables, functions, predicates E.g.: Anna, x, MotherOf(x), Friends(x,y) Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground predicates 30 Markov Logic Intuition: Soften logical constraints Syntax: Weighted first-order formulas Semantics: Feature templates for Markov networks A Markov Logic Network (MLN) is a set of pairs (Fi, wi) where Number of Fi is a formula in first-order logic true groundings of Fi wi is a real number 1 P( x) exp wi ni ( x) Z i 31 Example: Friends & Smokers Smoking causes cancer. Friends have similar smoking habits. 32 Example: Friends & Smokers x Sm okes( x ) Cancer( x) x, y Friends( x, y ) Sm okes( x ) Sm okes( y ) 33 Example: Friends & Smokers 1.5 x Sm okes( x ) Cancer( x) 1.1 x, y Friends( x, y ) Sm okes( x ) Sm okes( y ) 34 Example: Friends & Smokers 1.5 x Sm okes( x ) Cancer( x) 1.1 x, y Friends( x, y ) Sm okes( x ) Sm okes( y ) Two constants: Anna (A) and Bob (B) Probabilistic graphical models and Friends(A,B) first-order logic are special cases Friends(A,A) Smokes(A) Smokes(B) Cancer(A) Friends(B,B) Cancer(B) Friends(B,A) 35 MLN Algorithms: The First Three Generations Problem MAP inference Marginal inference Weight learning Structure learning First generation Weighted satisfiability Gibbs sampling Pseudolikelihood Inductive logic progr. Second generation Lazy inference MC-SAT Voted perceptron ILP + PL (etc.) Third generation Cutting planes Lifted inference Scaled conj. gradient Clustering + pathfinding 36 Efficient Inference Logical or statistical inference already hard But … can do approximate inference Suffice to perform well in most cases Combine ideas from both camps E.g., MC-SAT MCMC SAT solver More: Poon & Domingos, “Sound and Efficient Inference with Probabilistic and Deterministic Dependencies”, in Proc. AAAI-2006. Can also leverage sparsity in relational domains More: Poon, Domingos & Sumner, “A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC”, 37 in Proc. AAAI-2008. Weight Learning Probability model P(X) X: Observable in training data Maximize likelihood of observed data Regularization to prevent overfitting 38 Weight Learning Gradient descent Requires inference log P ( x) ni ( x) Ex ni ( x) wi No. of times clause i is true in data Expected no. times clause i is true according to MLN Use MC-SAT for inference Can also leverage second-order information [Lowd & Domingos, 2007] 39 Unsupervised Learning: How? I.I.D. learning: Sophisticated model requires more labeled data Statistical relational learning: Sophisticated model may require less labeled data Ambiguities vary among objects Joint inference Propagate information from unambiguous objects to ambiguous ones One formula is worth a thousand labels Small amount of domain knowledge large-scale joint inference 40 Unsupervised Weight Learning Probability model P(X,Z) X: Observed in training data Z: Hidden variables E.g., clustering with mixture models Z: Cluster assignment X: Observed features P ( X , Z ) P(Z ) P( X | Z ) Maximize likelihood of observed data by summing out hidden variables Z 41 Unsupervised Weight Learning Gradient descent log P ( x) Ez| x ni ( x, z ) Ex , z ni ( x, z ) wi Sum over z, conditioned on observed x Summed over both x and z Use MC-SAT to compute both expectations May also combine with contrastive estimation More: Poon, Cherry, & Toutanova, “Unsupervised Morphological Segmentation with Log-Linear Models”, in Proc. NAACL-2009. 42 42 Best Paper Award Markov Logic Unified inference and learning algorithms Can handle millions of variables, billions of features, ten of thousands of parameters Easy-to-use software: Alchemy Many successful applications E.g.: Information extraction, coreference resolution, semantic parsing, ontology induction 43 Pipeline Joint Inference Combine segmentation and entity resolution for information extraction More: Poon & Domingos, “Joint Inference for Information Extraction”, in Proc. AAAI-2007. Extract complex and nested bio-events from PubMed abstracts More: Poon & Vanderwende, “Joint Inference for Knowledge Extraction from Biomedical Literature”, in Proc. NAACL-2010. 44 Unsupervised Learning: Example Coreference resolution: Accuracy comparable to previous supervised state of the art More: Poon & Domingos, “Joint Unsupervised Coreference Resolution with Markov Logic”, in Proc. EMNLP-2008. 45 Overview Machine reading: Challenges Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 46 Unsupervised Semantic Parsing USP [Poon & Domingos, EMNLP-09] Best Paper Award First unsupervised approach for semantic parsing End-to-end machine reading system Read text, answer questions OntoUSP USP Ontology Induction [Poon & Domingos, ACL-10] Encoded in a few Markov logic formulas 47 Semantic Parsing Goal Microsoft buys Powerset BUY(MICROSOFT,POWERSET) Challenge Microsoft buys Powerset Microsoft acquires semantic search engine Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … 48 Limitations of Existing Approaches Manual grammar or supervised learning Applicable to restricted domains only For general text Not clear what predicates and objects to use Hard to produce consistent meaning annotation Also, often learn both syntax and semantics Fail to leverage advanced syntactic parsers Make semantic parsing harder 49 USP: Key Idea # 1 Target predicates and objects can be learned Viewed as clusters of syntactic or lexical variations of the same meaning BUY(-,-) buys, acquires, ’s purchase of, … Cluster of various expressions for acquisition MICROSOFT Microsoft, the Redmond software giant, … Cluster of various mentions of Microsoft 50 USP: Key Idea # 2 Relational clustering Cluster relations with same objects USP Recursively cluster arbitrary expressions with similar subexpressions Microsoft buys Powerset Microsoft acquires semantic search engine Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … 51 USP: Key Idea # 2 Relational clustering Cluster relations with same objects USP Recursively cluster arbitrary expressions with similar subexpressions Microsoft buys Powerset Microsoft acquires semantic search engine Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … Cluster same forms at the atom level 52 USP: Key Idea # 2 Relational clustering Cluster relations with same objects USP Recursively cluster arbitrary expressions with similar subexpressions Microsoft buys Powerset Microsoft acquires semantic search engine Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … Cluster forms in composition with same forms 53 USP: Key Idea # 2 Relational clustering Cluster relations with same objects USP Recursively cluster arbitrary expressions with similar subexpressions Microsoft buys Powerset Microsoft acquires semantic search engine Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … Cluster forms in composition with same forms 54 USP: Key Idea # 2 Relational clustering Cluster relations with same objects USP Recursively cluster arbitrary expressions with similar subexpressions Microsoft buys Powerset Microsoft acquires semantic search engine Powerset Powerset is acquired by Microsoft Corporation The Redmond software giant buys Powerset Microsoft’s purchase of Powerset, … Cluster forms in composition with same forms 55 USP: Key Idea # 3 Start directly from syntactic analyses Focus on translating them to semantics Leverage rapid progress in syntactic parsing Much easier than learning both 56 Joint Inference in USP Forms canonical meaning representation by recursively clustering synonymous expressions Text Logical form in this representation Induces ISA hierarchy among clusters and applies hierarchical smoothing (shrinkage) 57 USP: System Overview Input: Dependency trees for sentences Converts dependency trees into quasi-logical forms (QLFs) Starts with QLF clusters at atom level Recursively builds up clusters of larger forms Output: Probability distribution over QLF clusters and their composition MAP semantic parses of sentences 58 Generating Quasi-Logical Forms buys nsubj Microsoft dobj Powerset Convert each node into an unary atom 59 Generating Quasi-Logical Forms buys(n1) nsubj Microsoft(n2) dobj Powerset(n3) n1, n2, n3 are Skolem constants 60 Generating Quasi-Logical Forms buys(n1) nsubj Microsoft(n2) dobj Powerset(n3) Convert each edge into a binary atom 61 Generating Quasi-Logical Forms buys(n1) nsubj(n1,n2) Microsoft(n2) dobj(n1,n3) Powerset(n3) Convert each edge into a binary atom 62 A Semantic Parse buys(n1) nsubj(n1,n2) Microsoft(n2) dobj(n1,n3) Powerset(n3) Partition QLF into subformulas 63 A Semantic Parse buys(n1) nsubj(n1,n2) Microsoft(n2) dobj(n1,n3) Powerset(n3) Subformula Lambda form: Replace Skolem constant not in unary atom with a unique lambda variable 64 A Semantic Parse buys(n1) λx2.nsubj(n1,x2) Microsoft(n2) λx3.dobj(n1,x3) Powerset(n3) Subformula Lambda form: Replace Skolem constant not in unary atom with a unique lambda variable 65 A Semantic Parse Core form buys(n1) Argument form λx2.nsubj(n1,x2) Microsoft(n2) Argument form λx3.dobj(n1,x3) Powerset(n3) Core form: No lambda variable Argument form: One lambda variable 66 A Semantic Parse buys(n1) λx2.nsubj(n1,x2) λx3.dobj(n1,x3) BUY Microsoft(n2) MICROSOFT Powerset(n3) POWERSET Assign subformula to object cluster 67 Object Cluster: BUY buys(n1) 0.1 One formula in MLN acquires(n1) 0.2 …… Learn weights for each pair of cluster and core form Distribution over core forms 68 Object Cluster: BUY buys(n1) 0.1 acquires(n1) 0.2 BUYER BOUGHT …… PRICE …… May contain variable number of property clusters 69 Property Cluster: BUYER λx2.nsubj(n1,x2) 0.5 MICROSOFT 0.2 Zero 0.1 λx2.agent(n1,x2) 0.4 GOOGLE 0.1 One 0.8 Three MLN formulas …… …… …… Distributions over argument forms, clusters, and number 70 Probabilistic Model Exponential prior on number of parameters Cluster mixtures: Object Cluster: BUY buys 0.1 acquires 0.4 Property Cluster: BUYER nsubj 0.5 GOOGLE 0.1 Zero 0.1 One 0.8 …… … … … … agent 0.4 MICROSOFT 0.2 71 E.g., picking MICROSOFT as BUYER Probabilistic Model argument depends not only on BUY, but also on its ISA ancestors Exponential prior on number of parameters Cluster mixtures with hierarchical smoothing: Object Cluster: BUY buys 0.1 acquires 0.4 Property Cluster: BUYER nsubj 0.5 GOOGLE 0.1 Zero 0.1 One 0.8 …… … … … … agent 0.4 MICROSOFT 0.2 72 Abstract Lambda Form buys(n1) λx2.nsubj(n1,x2) λx3.dobj(n 3) Finallogical form is1,x obtained via lambda reduction BUYS(n1) λx2.BUYER(n1,x2) λx3.BOUGHT(n1,x3) 73 Challenge: State Space Too Large Potential cluster number exp(token-number) Also, meaning units and clusters often small Use combinatorial search 74 Inference: Find MAP Parse induces Initialize nsubj dobj protein CD11B nn IL-4 Lambda reduction protein Search Operator nn IL-4 protein nn IL-4 75 Learning: Greedily Maximize Posterior Initialize Search Operators induces 1.0 enhances 1.0 MERGE induces 1.0 enhances 1.0 induces 0.2 enhances 0.8 amino 1.0 acid 1.0 …… COMPOSE amino 1.0 acid 1.0 amino acid 1.0 76 Operator: Abstract induces INDUCE induces enhances inhibits suppresses 0.6 up-regulates MERGE with REGULATE? 0.2 … … ISA INHIBIT inhibits 0.4 suppresses 0.2 0.3 0.1 0.2 0.1 ISA INHIBIT INDUCE induces 0.6 0.2 Captures substantial similarities … … … up-regulates inhibits 0.4 suppresses 0.2 77 Experiments Apply to machine reading: Extract knowledge from text and answer questions Evaluation: Number of answers and accuracy GENIA dataset: 1999 Pubmed abstracts Use simple factoid questions, e.g.: What does anti-STAT1 inhibit? What regulates MIP-1 alpha? 78 USP extracted five times as many correct answers as TextRunner Total and Correct Answers 500 Highest precision of 91% 400 300 200 100 0 79 KW-SYN TextRunner RESOLVER DIRT USP Qualitative Analysis Resolve many nontrivial variations Argument forms that mean the same, e.g., expression of X X expression X stimulates Y Y is stimulated with X Active vs. passive voices Synonymous expressions Etc. 80 Clusters And Compositions Clusters in core forms investigate, examine, evaluate, analyze, study, assay diminish, reduce, decrease, attenuate synthesis, production, secretion, release dramatically, substantially, significantly …… Compositions amino acid, t cell, immune response, transcription factor, initiation site, binding site … 81 Question-Answer Example Interestingly, the DEX-mediated IkappaBalpha induction was completely inhibited by IL-2, but not IL-4, in Th1 cells, while the reverse profile was seen in Th2 cells. Q: What does IL-2 control? A: The DEX-mediated IkappaBalpha induction 82 Overview Machine reading Statistical relational learning Markov logic USP: Unsupervised Semantic Parsing Research directions 83 Web-Scale Joint Inference Challenge: Efficiently identify the relevant Key: Induce and leverage an ontology Ontology Capture essential properties & Abstract away unimportant variations Upper-level nodes Skip irrelevant branches Wanted: Combine the following Probabilistic ontology induction (e.g., USP) Coarse-to-fine learning and inference [Felzenszwalb & McAllester, 2007; Petrov, Ph.D. Thesis] 84 Knowledge Reasoning Most facts/rules are not explicitly stated “Dark matter” in the natural language universe kale contains calcium calcium prevent osteoporosis kale prevents osteoporosis Keys: Induce generic reasoning patterns Incorporate reasoning in extraction Additional sources of indirect supervision 85 Harness Social Computing Bootstrap online community Knowledge Base 86 Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop “Tell me everything about dicer applied to synapse …” Knowledge Base 87 Harness Social Computing Bootstrap online community“Your extraction from my paper correct Incorporate human & end tasks in the is loop except for blah …” Knowledge Base 88 Harness Social Computing Bootstrap online community Incorporate human & end tasks in the loop Form positive feedback loop Knowledge Base 89 Acknowledgments Pedro Domingos, Colin Cherry, Kristina Toutanova, Lucy Vanderwende, Oren Etzioni, Dan Weld, Matt Richardson, Parag Singla, Stanley Kok, Daniel Lowd, Marc Sumner ARO, AFRL, ONR, DARPA, NSF 90 Summary Statistical relational learning offers promising solutions for machine reading Markov logic provides a language for this Syntax: Weighted first-order logical formulas Semantics: Feature templates of Markov nets Open-source software: Alchemy alchemy.cs.washington.edu A success story: USP alchemy.cs.washington.edu/papers/poon09 Three key research directions 91