Natural Language Processing via Learning and Global Inference with Constraints Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak Funding: ARDA, under the AQUAINT program 2007 November NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980 Beckman Institute A DOI grant under the Reflex program, DASH Optimization (Xpress-MP) Page 1 Nice to Meet You Page 2 Learning and Inference Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. (Learned) models/classifiers for different sub-problems Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local models as well as domain & context specific constraints. Global inference for the best assignment to all variables of interest. Page 3 Inference Page 4 Comprehension A process that maintains and updates a collection of propositions about the state of affairs. (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 3. Christopher Robin’s dad was a magician. 2. Winnie the Pooh is a title of a book. 4. Christopher Robin must be at least 65 now. Page 5 What we Know: Stand Alone Ambiguity Resolution Illinois’ bored of education board ...Nissan Car and truck plant is … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian a hospital Learn a function f: X Y that maps observations in a domain to one of several categories or < Page 6 Classification is Well Understood Theoretically: generalization bounds Algorithmically: good learning algorithms for linear representations. How many example does one need to see in order to guarantee good behavior on previously unobserved examples. Can deal with very high dimensionality (106 features) Very efficient in terms of computation and # of examples. On-line. Key issues remaining: Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; sequences; adaptation What are the features? No good theoretical understanding here. Programming systems that have multiple classifiers Page 7 A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 3. Christopher Robin’s dad was a magician. 2. Winnie the Pooh is a title of a book. 4. Christopher Robin must be at least 65 now. This is an Inference Problem Page 8 This Talk Global Inference over Local Models/Classifiers + Expressive Constraints Model Generality of the framework Training Paradigms Global vs. Local training Semi-Supervised Learning Examples Semantic Parsing Information Extraction Pipeline processes Page 9 Allows for Dynamic Programming based Inference Sequential Constrains Structure Three models for sequential inference with classifiers [Punyakanok & Roth NIPS’01] HMM; HMM with Classifiers Conditional Models (PMM) s4 s5 s6 o11 o2 o3 o4 o5 o 6 By far, the most popular in applications Allows direct modeling of states as a function of input Classifiers may vary; SNoW (Winnow;Perceptron); MEMM: MaxEnt; SVM based Constraint Satisfaction Models (CSCL: more general constrains) Sufficient for easy problems s11 s2 s3 The inference problem is modeled as weighted 2-SAT With sequential constraints: shown to have efficient solution Recent work – viewed as multi-class classification; emphasis on global training [Collins’02, CRFs,SVMs]; efficiency and performance issues What if the structure of the problem/constraints is not sequential? Page 10 Pipeline Raw Data POS Tagging Most problems are not single classification problems Phrases Semantic Entities Parsing WSD Relations Semantic Role Labeling Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions. Leads to propagation of errors Occasionally, later stage problems are easier but upstream mistakes will not be corrected. Looking for: Global inference over the outcomes of different (learned) predictors as a way to break away from this paradigm [between pipeline & fully global] Allows a flexible way to incorporate linguistic and structural constraints. Page 11 Improvement over no inference: 2-5% Inference with General Constraint Structure other 0.05 other 0.10 other 0.05 per 0.85 per 0.60 per 0.50 loc 0.10 loc 0.30 loc 0.45 Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 R12 E3 R23 irrelevant irrelevant 0.05 irrelevant 0.10 spouse_of spouse_of 0.45 0.45 spouse_of 0.05 born_in born_in 0.50 born_in 0.85 Page 12 Problem Setting Random Variables Y: y4 y1 C(y1,y4) y5 y2 C(y2,y3,y6,y7,y8) y6 y7 y3 y8 observations Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W ) Goal: Find the “best” assignment The assignment that achieves the highest global performance. This is an Integer Programming Problem Y*=argmaxY PY (+ WC) subject to constraints C Other, more general ways to incorporate soft constraints here [ACL’07] Page 13 A General Inference Setting Essentially all complex models studied today can be viewed as optimizing a linear objective function: HMMs/CRFs [Roth’99; Collins’02;Laffarty et. al 02] Linear objective functions can be derived from probabilistic perspective: Markov Random Field [standard; Kleinberg&Tardos] Optimization Problem (Metric Labeling) [Chekuri et. al’01] Linear Programming Problems Inference as Constrained Optimization [Yih&Roth CoNLL’04] [Punyakanok et. al COLING’04]… The probabilistic perspective supports finding the most likely assignment Not necessarily what we want Our Integer linear programming (ILP) formulation Allows the incorporation of more general cost functions General (non-sequential) constraint structure Better exploitation (computationally) of hard constraints Can find the optimal solution if desired Page 14 Formal Model Penalty for violating the constraint. Subject to constraints Weight Vector for “local” models A collection of Classifiers; Log-linear models (HMM, CRF) or a combination (Soft) constraints component How far away is y from a “legal” assignment How to solve? How to train? This is an Integer Linear Program Incorporating constraints in the learning process? Solve using ILP packages gives an exact solution. Search techniques are also possible Page 15 Example: Semantic Role Labeling Who did what to whom, when, where, why,… I left my pearls to my daughter in my will . [I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC . Special Case (structure output problem): A0 Leaver here, all the data is available at one time; in general, classifiers might be learned A1 Things left from different sources, at different times, A2 Benefactor at different contexts. AM-LOC Location Implications on training paradigms I left my pearls to my daughter in my will . Overlapping arguments If A2 is present, A1 must also be present. Page 16 Semantic Role Labeling (2/2) PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. Core arguments: A0-A5 and AA It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees. different semantics for each verb specified in the PropBank Frame files 13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type Page 17 Identify Vocabulary Algorithmic Approach I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Pruning [Xue&Palmer, EMNLP’04] Argument Identifier Binary classification (SNoW) Classify argument candidates Argument Classifier candidate arguments Identify argument candidates I left my nice pearls to her Multi-class classification (SNoW) Inference Inference over (old and I left my nice pearls to her [new) [ [ Vocabulary [ [ ]I left ] my ] nice pearls] to her ] Use the estimated probability distribution given by the argument classifier Use structural and linguistic constraints Infer the optimal global output I left my nice pearls to her Page 18 I left my nice pearls to her Inference The output of the argument classifier often violates some constraints, especially when the sentence is long. Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05] Input: The probability estimation (by the argument classifier) Structural and linguistic constraints Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types). Page 19 Integer Linear Programming Inference For each argument ai Set up a Boolean variable: ai,t indicating whether ai is classified as t Goal is to maximize i score(ai = t ) ai,t Subject to the (linear) constraints Any Boolean constraints can be encoded as linear constraint(s). If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. Page 20 Constraints No duplicate argument classes a POTARG x{a = A0} 1 Any Boolean rule can be encoded as a linear constraint. R-ARG If there is an R-ARG phrase, there is an ARG Phrase a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0} C-ARG a2 POTARG , If there is an C-ARG phrase, there is an ARG before it (a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0} Many other possible constraints: Unique labels No overlapping or embedding Relations between number of arguments If verb is of type A, no argument of type B Universally quantified rules Joint inference can be used also to combine different SRL Systems. Page 21 ILP as a Unified Algorithmic Scheme Consider a common model for sequential inference: HMM/CRF Inference in this model is done via the Viterbi Algorithm. s So far, shown the use of only (deterministic) constraints. Can be used with statistical constraints. A B C A B C A B C A B C A B C t y y1 y2 y3 y4 y5 x x1 x2 x3 x4 x5 Viterbi is a special case of the Linear Programming based Inference. Viterbi is a shortest path problem, which is a LP, with a canonical matrix that is totally unimodular. Therefore, you can get integrality constraints for free. One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix The extension reduces to a polynomial scheme under some conditions (e.g., when constraints are sequential, when the solution space does not change, etc.) Not necessarily increases complexity and very efficient in practice [Roth&Yih, ICML’05] Learn a rather simple model; make decisions with a more expressive model Page 22 Extracting Relations via Semantic Analysis Screen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp Semantic parsing reveals several relations in the sentence along with their arguments. This approach produces a very good semantic parser. F1~90% Easy and fast: ~7 Sent/Sec (using Xpress-MP) Top ranked system in CoNLL’05 shared task Key difference is the Inference Page 23 This Talk Global Inference over Local Models/Classifiers + Expressive Constraints Model Generality of the framework Training Paradigms Global vs. Local training Semi-Supervised Learning Examples Semantic Parsing Information Extraction Pipeline processes Page 24 Phrasal verb paraphrasing [Connor&Roth’07] Textual Entailment Given: Entity matching [Li et. al, Q: Who acquired Overture? AAAI’04, NAACL’04] Determine: A: Eyeing the huge market potential, currently Semantic Role Labeling led by Google, Yahoo took over search company Overture Services Inc last year. Is it true that? market(Textual Entailment) Eyeing the huge potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Yahoo acquired Overture Overture is a search company Google is a search company Google owns Overture ………. Page 25 Training Paradigms that Support Global Inference Incorporating general constraints (Algorithmic Approach) Allow both statistical and expressive declarative constraints Allow non-sequential constraints (generally difficult) Coupling vs. Decoupling Training and Inference. Incorporating global constraints is important but Should it be done only at evaluation time or also at training time? Issues related to: modularity, efficiency and performance, availability of training data May not be relevant in some problems. Page 26 Did this classifier make a mistake? Phrase Identification Problem How to train it? Use classifiers’ outcomes to identify phrases Final outcome determined by optimizing classifiers outcome and constrains Input: o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 Classifier 1: [ Classifier 2: Infer: [ [ ] ] ] [ [ ] ] [ ] ] ] Page 27 Training in the presence of Constraints General Training Paradigm: First Term: Learning from data Second Term: Guiding the model by constraints Can choose if constraints are included in training or only in evaluation Page 28 Cartoon: each model can be more complex and may have a view on a set of output variables. L+I: Learning plus Inference Training w/o Constraints Testing: Inference with Constraints Learning the components together! y1 IBT: Inference-basedf1(x) Training y2 y3 y4 f2(x) y5 x3 f3(x) x4 x1 x5 f4(x) Y f5(x) x2 x6 x7 X Page 29 Perceptron-based Global Learning True Global Labeling Local Predictions Apply Constraints: Y -1 1 -1 -1 1 Y’ -1 1 1 -1 1 1 f1(x) X f2(x) x3 f3(x) x4 x1 x5 x2 x6 x7 f4(x) Y f5(x) Which one is better? When and Why? Page 30 Claims When the local classification problems are “easy”, L+I outperforms IBT. In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples. When data is scarce, problems are not easy and constraints can be used, along with a “weak” model, to label unlabeled data and improve model. Experimental results and theoretical intuition to support our claims. L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases. Combinations: L+I, and then IBT are possible Page 31 Bound Prediction L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2 Global Indication for hardness of problem ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2 Bounds Simulated Data opt =0.1 =0 opt opt=0.2 Page 32 Relative Merits: SRL In some cases problems are hard due to lack of training data. Semi-supervised learning L+I is better. When the problem is artificially made harder, the tradeoff is clearer. hard Difficulty of the learning problem (# features) easy Page 33 Information extraction with Background Knowledge (Constraints) Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Lars Ole Andersen . Program analysis and specialization for the Violates lots of C constraints! Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . Page 34 Examples of Constraints Each field must be a consecutive list of words, and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE ……. Page 35 Information Extraction with Constraints Adding constraints, we get correct results! [AUTHOR] [TITLE] [TECH-REPORT] [INSTITUTION] [DATE] Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May, 1994 . If incorporated into the training, better results mean Better Feedback! Page 36 Semi-Supervised Learning with Constraints In traditional Semi-Supervised learning the model can drift away from the correct one. Constraints can be used At decision time, to bias the objective function towards favoring constraint satisfaction. At training to improve labeling of un-labled data (and thus improve the model) Model Decision Time Constraints Constraints Un-labeled Data Page 37 Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07] Any supervised learning algorithm parametrized by Augmenting the training set (feedback). Any inference algorithm (with constraints). =learn(Tr) For N iterations do T= For each x in unlabeled dataset Inference(x,C,) ) y Inference(x, T=T {(x, y)} Page 38 Constraint - Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07] Any supervised learning algorithm parametrized by Augmenting the training set (feedback). Any inference algorithm (with constraints). =learn(Tr) For N iterations do T= For each x in unlabeled dataset {y1,…,yK} Top-K-Inference(x,C, ) T=T {(x, yi)}i=1…k = +(1- )learn(T) Learn from new training data. Weight supervised and unsupervised model(Nigam2000*). Page 39 Token-based accuracy (inference with constraints) supervised 100 Weighted EM CODL 95 Accuracy 90 85 80 75 70 65 60 5 10 15 20 25 300 Labeled set size Page 40 Semi-Supervised Learning with Constraints Objective function: Learning w/o Constraints: 300 examples. Learning w Constraints Constraints are used to Bootstrap a semisupervised learner Poor model + constraints are used to annotate unlabeled data, which in turn is used to keep training the model. # of available labeled examples Page 41 Conclusions Discussed a general paradigm for learning and inference in the context of natural language understanding tasks A general Constraint Optimization approach for integration of learned models with additional (declarative or statistical) expressivity. A paradigm for making Machine Learning practical – allow domain/task specific constraints. How to train? Learn Locally and make use globally (via global inference) [Punyakanok et. al IJCAI’05] Ability to make use of domain & constraints to drive supervision [Klementiev & Roth, ACL’06; Chang, Ratinov, Roth, ACL’07] LBJ (Learning Based Java): http://L2R.cs.uiuc.edu/~cogcomp A modeling language that supports programming along with building learned models and allows incorporating and inference with constraints Page 42 Problem Setting Random y1 Variables Y: y4 C(y1,y4) y5 y2 y6 C(y2,y3,y6,y7,y8) y3 y7 y8 Conditional Distributions P (learned by models/classifiers) Constraints C– any Boolean function defined on partial assignments (possibly: + weights W ) Goal: Find the “best” assignment The assignment that achieves the highest global accuracy. This is an Integer Programming Problem Y*=argmaxY PY (+ WC) subject to constraints C Other, more general ways to incorporate soft constraints here [ACL’07] Page 43