Constraints Driven Structured Learning with Indirect Supervision Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign With thanks to: April 2010Dan Goldwasser, Lev Ratinov, Collaborators: Ming-Wei Chang, James Clarke, Vivek Srikumar, Many others Carnegie Mellon University Funding: NSF: ITR IIS-0085836, SoD-HCER-0613885, DHS; DARPA: Bootstrap Learning & Machine Reading Programs DASH Optimization (Xpress-MP) Page 1 Nice to Meet You Page 2 A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 3. Christopher Robin’s dad was a magician. 2. Winnie the Pooh is a title of a book. 4. Christopher Robin must be at least 65 now. This is an Inference Problem Page 3 Coherency in Semantic Role Labeling Predicate-arguments generated should be consistent across phenomena The touchdown scored by Bettis cemented the victory of the Steelers. Verb Nominalization Preposition Predicate: score Predicate: win Sense: 11(6) A0: Bettis (scorer) A1: The touchdown (points scored) A0: the Steelers (winner) “the object of the preposition is the object of the underlying verb of the nominalization” Linguistic Constraints: A0: the Steelers Sense(of): 11(6) A0:Bettis Sense(by): 1(1) Page 4 Semantic Parsing X :“What is the largest state that borders New York and Maryland ?" Y: largest( state( next_to( state(NY) AND next_to (state(MD)))) Successful interpretation involves multiple decisions What entities appear in the interpretation? “New York” refers to a state or a city? How to compose fragments together? state(next_to()) >< next_to(state()) Page 5 Learning and Inference Natural Language Decisions are Structured Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. But: Learning structured models requires annotating structures. Page 6 Constrained Conditional Models (aka ILP Inference) Penalty for violating the constraint. (Soft) constraints component Weight Vector for “local” models Features, classifiers; loglinear models (HMM, CRF) or a combination How far y is from a “legal” assignment CCMs can be viewed as a general interface to easily combine domain knowledge with data driven statistical models How to solve? How to train? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible Training is learning the objective function How to exploit the structure to minimize supervision? Page 7 Example: Semantic Role Labeling Who did what to whom, when, where, why,… I left my pearls to my daughter in my will . [I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC . A0 A1 A2 AM-LOC Leaver Things left Benefactor Location I left my pearls to my daughter in my will . Overlapping arguments If A2 is present, A1 must also be present. Page 8 Semantic Role Labeling (2/2) PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. Core arguments: A0-A5 and AA It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees. different semantics for each verb specified in the PropBank Frame files 13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type Page 9 Identify Vocabulary Algorithmic Approach candidate arguments Identify argument candidates Binary classification (SNoW) Classify argument candidates Argument Classifier I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Pruning [Xue&Palmer, EMNLP’04] Argument Identifier I left my nice pearls to her Multi-class classification (SNoW) Inference Inference over (old and I left my nice pearls to her [new) [ [ Vocabulary [ [ ]I left ] my ] nice pearls] to her ] Use the estimated probability distribution given by the argument classifier Use structural and linguistic constraints Infer the optimal global output I left my nice pearls to her Page 10 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . 0.5 0.05 0.15 0.1 0.15 0.15 0.1 0.6 0.1 0.05 0.05 0.05 0.05 0.05 0.7 0.2 0.6 0.05 0.05 0.3 0.15 0.2 0.2 0.1 0.2 Page 11 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . 0.5 0.05 0.15 0.1 0.15 0.15 0.1 0.6 0.1 0.05 0.05 0.05 0.05 0.05 0.7 0.2 0.6 0.05 0.05 0.3 0.15 0.2 0.2 0.1 0.2 Page 12 Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . 0.5 0.05 0.15 0.1 0.15 0.15 0.1 0.6 0.1 0.05 0.05 0.05 One inference problem for each verb predicate. 0.05 0.05 0.7 0.2 0.6 0.05 0.05 0.3 0.15 0.2 0.2 0.1 0.2 Page 13 Integer Linear Programming Inference For each argument ai Goal is to maximize Set up a Boolean variable: ai,t indicating whether ai is classified as t i score(ai = t ) ai,t The Constrained Conditional Model is completely decomposed during training Subject to the (linear) constraints If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. Page 14 Constraints No duplicate argument classes a POTARG x{a = A0} 1 Any Boolean rule can be encoded as a (collection of) linear constraints. R-ARG If there is an R-ARG phrase, there is an ARG Phrase a2 POTARG , a POTARG x{a = A0} x{a2 = R-A0} C-ARG a2 POTARG , If there is an C-ARG phrase, there is an ARG before it (a POTARG) (a is before a2 ) x{a = A0} x{a2 = C-A0} Universally quantified Many other possible constraints: LBJ: allows a developer rulesto encode constraints in FOL; these are Unique labels compiled into linear inequalities No overlapping or embedding automatically. Relations between number of arguments; order constraints If verb is of type A, no argument of type B Joint inference can be used also to combine different (SRL) Systems. Page 15 Learning and Inference Natural Language Decisions are Structured Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. But: Learning structured models requires annotating structures. Page 16 Information extraction without Prior Knowledge Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . Violates lots of natural constraints! Page 17 Examples of Constraints Each field must be a consecutive list of words and can appear at most once in a citation. State transitions must occur on punctuation marks. The citation can only start with AUTHOR or EDITOR. The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE Easy to express pieces of “knowledge” ……. Non Propositional; May use Quantifiers Page 18 Information Extraction with Constraints Adding constraints, we get correct results! Without changing the model [AUTHOR] [TITLE] [TECH-REPORT] [INSTITUTION] [DATE] Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May, 1994 . Page 19 Guiding Semi-Supervised Learning with Constraints In traditional Semi-Supervised learning the model can drift away from the correct one. Constraints can be used to generate better training data At training to improve labeling of un-labeled data (and thus improve the model) At decision time, to bias the objective function towards favoring constraint satisfaction. Model Decision Time Constraints Constraints Un-labeled Data Page 20 Constraints Driven Learning (CoDL) [Chang, Ratinov, Roth, ACL’07;ICML’08,Long’10] (w0,½0)=learn(L) Supervised learning algorithm parameterized by (w,½). Learning can be justified as an optimization procedure for an objective function For N iterations do Inference with constraints: T= augment the training set For each x in unlabeled dataset h à argmaxy wT Á(x,y) - ½k dC(x,y) T=T {(x, h)} (w,½) = (w0,½0) + (1- ) learn(T) Learn from new training data Weigh supervised & unsupervised models. Excellent Experimental Results showing the advantages of using constraints, especially with small amounts on labeled data [Chang et. al, Others] Page 21 Constraints Driven Learning (CODL) [Chang et. al 07,08; others] Semi-Supervised Learning Paradigm that makes use of constraints to bootstrap from a small number of examples Objective function: Learning w 10 Constraints Learning w/o Constraints: 300 examples. Poor model + constraints Constraints are used to: Bootstrap a semi-supervised learner Correct weak models predictions on unlabeled data, which in turn are used to keep training the model. # of available labeled examples Page 22 Learning and Inference Natural Language Decisions are Structured Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. But: Learning structured models requires annotating structures. Interdependencies among decision variables should be exploited in learning. Goal: use minimal, indirect supervision Amplify it using interdependencies among variables Page 23 Two Ideas Idea1: Simple, easy to supervise, binary decisions often depend on the structure you care about. Learning to do well on the binary task can drive the structure learning. Idea2: Global Inference can be used to amplify the minimal supervision. Idea 2 ½: There are several settings where a binary label can be used to replace a structured label. Perhaps the most intriguing is where you use the world response to the model’s actions. Page 24 Outline Inference Semi-supervised Training Paradigms for structures Constraints Driven Learning Indirect Supervision Training Paradigms for structure Indirect Supervision Training with latent structure [NAACL’10] Training Structure Predictors by Inventing (easy to supervise) binary labels [ICML’10] Transliteration; Textual Entailment; Paraphrasing POS, Information extraction tasks Driving supervision signal from World’s Response [CoNLL’10] Semantic Parsing Page 25 Textual Entailment Former military specialist Carpenter took the helm at FictitiousCom Inc. after five years as press official at the United States embassy in the Entailment Requires an Intermediate Representation UnitedKingdom. Alignment based Features Given the intermediate features – learn a decision Entail/ Does not Entail x3 x1 x4 x1 x5 x2 x6 x3 x4 x2 x7 But only positive entailments are expected to have a meaningful intermediate representation Jim Carpenter worked for the US Government. Page 26 Given an input x 2 X Learn a model f : X ! {-1, 1} Paraphrase Identification Consider the following sentences: S1: Druce will face murder charges, Conte said. S2: Conte said Druce will be charged with murder . We need latent variables that explain: why this is a positive example. Are S1 and S2 a paraphrase of each other? There is a need for an intermediate representation to justify this decision Given an input x 2 X Learn a model f : X ! H ! {-1, 1} Page 27 Algorithms: Two Conceptual Approaches Two stage approach (typically used for TE and paraphrase identification) Learn hidden variables; fix it Need supervision for the hidden layer (or heuristics) For each example, extract features over x and (the fixed) h. Learn a binary classier Proposed Approach: Joint Learning Drive the learning of h from the binary labels Find the best h(x) An intermediate structure representation is good to the extent is supports better final prediction. Algorithm? Page 28 Learning with Constrained Latent Representation (LCLR): Intuition If x is positive There must exist a good explanation (intermediate representation) 9 h, wT Á(x,h) ¸ 0 New feature vector for the final decision. Chosen h selects a representation. or, maxh wT Á(x,h) ¸ 0 If x is negative No explanation is good enough to support the answer 8 h, wT Á(x,h) · 0 or, maxh wT Á(x,h) · 0 Altogether, this can be combined into an objective function: Minw ¸/2 ||w||2 + Ci l(1-zimaxh 2 C wT {s} hs Ás (xi)) Why does inference help? Inference: best h subject to constraints C Page 29 Optimization Non Convex, due to the maximization term inside the global minimization problem In each iteration: Find the best feature representation h* for all positive examples (offthe shelf ILP solver) Having fixed the representation for the positive examples, update w solving the convex optimization problem: Not the standard SVM/LR: need inference Asymmetry: Only positive examples require a good intermediate representation that justify the positive label. Consequently, the objective function decreases monotonically Page 30 Iterative Objective Function Learning Generate features Inference for h subj. to C Prediction with inferred h Initial Objective Function Update weight vector Training w/r to binary decision label Formalized as Structured SVM + Constrained Hidden Structure LCRL: Learning Constrained Latent Representation Page 31 Learning with Constrained Latent Representation (LCLR): Framework LCLR provides a general inference formulation that allows that use of expressive constraints Flexibly adapted for many tasks that require latent representations. LCLR Model Declarative model Paraphrasing: Model input as graphs, V(G1,2), E(G1,2) Four Hidden variables: hv1,v2 – possible vertex mappings; he1,e2 – possible edge mappings Constraints: Each vertex in G1 can be mapped to a single vertex in G2 or to null Each edge in G1 can be mapped to a single edge in G2 or to null Edge mapping is active iff the corresponding node mappings are active Page 32 Experimental Results Transliteration: Recognizing Textual Entailment: Paraphrase Identification:* Page 33 Outline Inference Semi-supervised Training Paradigms for structures Constraints Driven Learning Indirect Supervision Training Paradigms for structure Indirect Supervision Training with latent structure Training Structure Predictors by Inventing (easy to supervise) binary labels Transliteration; Textual Entailment; Paraphrasing POS, Information extraction tasks Driving supervision signal from World’s Response Semantic Parsing Page 34 Structured Prediction Before, the structure was in the intermediate level What if we care about the structure? We cared about the structured representation only to the extent it helped the final binary decision The binary decision variable was given as supervision Information Extraction; Relation Extraction; POS tagging, many others. Invent a companion binary decision problem! Parse Citations: Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Companion: Given a citation; does it have a legitimate parse? POS Tagging Companion: Given a word sequence, does it have a legitimate POS tagging sequence? Page 35 Predicting phonetic alignment (For Transliteration) Target Task I t a l y י ל ט י א I l l i no i s ה Yes/No ל י א י ו נ י Target Task Companion Task Input: an English Named Entity and its Hebrew Transliteration Output: Phonetic Alignment (character sequence mapping) A structured output prediction task (many constraints), hard to label Companion Task Why it is a companion task? Input: an English Named Entity and an Hebrew Named Entity Companion Output: Do they form a transliteration pair? A binary output problem, easy to label Negative Examples are FREE, given positive examples Page 36 Companion Task Binary Label as Indirect Supervision The two tasks are related just like the binary and structured tasks discussed earlier Positive transliteration pairs must have “good” phonetic alignments All positive examples must have a good structure Negative examples cannot have a good structure We are in the same setting as before Negative transliteration pairs cannot have “good” phonetic alignments Binary labeled examples are easier to obtain We can take advantage of this to help learning a structured model Here: combine binary learning and structured learning Page 37 Learning Structure with Indirect Supervision In this case we care about the predicted structure Use both Structural learning and Binary learning Predicted Correct Negative examples cannot have a good structure The feasible structures of an example Negative examples restrict the space of hyperplanes supporting the decisions for x Page 38 Joint Learning with Indirect Supervision (J-LIS) Joint learning : If available, make use of both supervision types Target Task I Companion Task t a l y י ל ט י א I l l i no i s ה Yes/No ל י א י ו נ י Loss function: LB, as before; LS, Structural learning Key: the same parameter w for both components 1 T min w w C1 LS ( xi , yi ; w) C2 LB ( xi , zi ; w) w 2 iS iB Loss on Target Task Loss on Companion Task Page 39 Experimental Result Very little direct (structured) supervision. (Almost free) Large amount binary indirect supervision Page 40 Experimental Result Very little direct (structured) supervision. (Almost free) Large amount binary indirect supervision Page 41 Relations to Other Frameworks B=Á, l=(squared) hinge loss: Structural SVM S=Á, LCLR Related to Structural Latent SVM (Yu & Johachims) and to Felzenszwalb. If S=Á, Conceptually related to Contrastive Estimation No “grouping” of good examples and bad neighbors Max vs. Sum: we do not marginalize over the hidden structure space Allows for complex domain specific constraints Related to some Semi-Supervised approaches, but can use negative examples (Sheffer et. al) Page 42 Outline Inference Semi-supervised Training Paradigms for structures Constraints Driven Learning Indirect Supervision Training Paradigms for structure Indirect Supervision Training with latent structure Training Structure Predictors by Inventing (easy to supervise) binary labels Transliteration; Textual Entailment; Paraphrasing POS, Information extraction tasks Driving supervision signal from World’s Response Semantic Parsing Page 43 Connecting Language to the World Can I get a coffee with no sugar and just a bit of milk Great! Arggg Semantic Parser MAKE(COFFEE,SUGAR=NO,MILK=LITTLE) Can we rely on this interaction to provide supervision? Page 44 Real World Feedback Supervision = Expected Response Traditional approach: Our approach: use learnthe from logical forms only responses and gold alignments EXPENSIVE! x NL Query ry Logical Query Query Response: Query Response: r “What is the largest state that borders NY?" largest( state( next_to( const(NY)))) Pennsylvania Pennsylvania Interactive Computer System Binary Check if Predicted response == Expected response Supervision Semantic parsing is a structured prediction problem: Expectedfrom : Pennsylvania : Pennsylvania identify mappings text to a meaningExpected representation Predicted : Pennsylvania Positive Response Predicted : NYC Negative Response Train a structured predictor with this binary supervision ! Page 45 Learning Structures with a Binary Signal Protocol I: Direct learning with binary supervision Uses predicted structures as examples for learning a binary decision Inference used to predict the query: (y,z) = argmaxy,zwT Á(x,y,z) Positive feedback: add a positive binary example Negative feedback: add a negative binary example Learned parameters form the objective function; iterate - - + + “What is the largest state that borders NY?" largest(state(next_to(const(NJ)))) “What is the largest state that borders NY?" b largest(state(next_to(const(NY)))) “What is the smallest state?" state(next_to(const(NY)))) Page 46 Learning Structures with a Binary Signal Protocol II: Aggressive Learning with Binary supervision Positive feedback IFF the structure is correct (y,z) = argmaxy,zwT Á(x,y,z) + Train a structured predictor from these structures Positive feedback: add a positive structured example Iterate until no new structures are found “What is the largest state that borders NY?" largest( state( next_to( const(NY)))) Correct Response! Pennsylvania Interactive Computer System Page 47 Empirical Evaluation Key Question: Can we learn from this type of supervision? Algorithm # training structures Test set accuracy No Learning: Initial Objective Fn Binary signal: Protocol I 0 0 22.2% 69.2 % Binary signal: Protocol II 0 73.2 % 310 75 % WM*2007 (fully supervised – uses gold structures) *[WM] Y.-W. Wong and R. Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. ACL. Page 48 Summary Constrained Conditional Model: Computation Framework for global interference and an vehicle for incorporating knowledge Direct supervision for structured NLP tasks is expensive Indirect supervision is cheap and easy to obtain We suggested learning protocols for Indirect Supervision Make use of simple, easy to get, binary supervision Showed how to use it to learn structure Done in the context of Constrained Conditional Models Inference is an essential part of propagating the simple supervision Learning Structures from Real World Feedback Obtain binary supervision from “real world” interaction Indirect supervision replaces direct supervision Page 49 Features Versus Constraints Ái : X £ Y ! R; Ci : X £ Y ! {0,1}; Mathematically, soft constraints are features d: X £ Y ! R; In principle, constraints and features can encode the same properties In practice, they are very different If Á(x,y) = Á(x) – constraints provide an easy way to introduce dependence on y Local , short distance properties – to support tractable inference Propositional (grounded): E.g. True if “the followed by a Noun occurs in the sentence” Features Constraints Global properties Quantified, first order logic expressions E.g.True iff “all yis in the sequence y are assigned different values.” Page 50 Constraints As a Way To Encode Prior Knowledge Consider encoding the knowledge that: Entities of type A and B cannot occur simultaneously in a sentence Need more training data The “Feature” Way Requires larger models The Constraints Way A effective way to inject knowledge Keeps the model simple; add expressive constraints directly A small set of constraints Allows for decision time incorporation of constraints We can use constraints as a way to replace training data Allows one to learn simpler models Page 51