Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak Story Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. What did Mr. Robin do when Chris was three years old? 3. When was Winnie the Pooh written? 4. Why did Chris write two books of his own? Page 2 Stand Alone Ambiguity Resolution Context Sensitive Spelling Correction IIllinois’ bored of education Word Sense Disambiguation ...Nissan Car and truck plant is … …divide life into plant and animal kingdom Part of Speech Tagging (This DT) (can N) (will MD) (rust V) board DT,MD,V,N Coreference Resolution The dog bit the kid. He was taken to a hospital. The dog bit the kid. He was taken to a veterinarian. Page 3 Textual Entailment Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year. Yahoo acquired Overture. Question Answering Who acquired Overture? Page 4 Inference and Learning Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. Learned classifiers for different sub-problems Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints. Global inference for the best assignment to all variables of interest. Page 5 Text Chunking y= NP ADJP VP x= The ADVP VP guy standing there is so tall Page 6 Full Parsing S y= NP VP NP ADJP VP x= The ADVP guy standing there is so tall Page 7 Outline Semantic Role Labeling Problem Global Inference with Integer Linear Programming Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference Conclusion Page 8 Semantic Role Labeling I left my pearls to my daughter in my will . [I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC . A0 Leaver A1 Things left A2 Benefactor AM-LOC Location Page 9 Semantic Role Labeling PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. It adds a layer of generic semantic labels to Penn Tree Bank II. (Almost) all the labels are on the constituents of the parse trees. Core arguments: A0-A5 and AA different semantics for each verb specified in the PropBank Frame files 13 types of adjuncts labeled as AM-arg where arg specifies the adjunct type Page 10 Semantic Role Labeling Page 11 The Approach I left my nice pearls to her Pruning Use heuristics to reduce the number of candidates (modified from [Xue&Palmer’04]) Argument Identification Use a binary classifier to identify arguments Argument Classification Use a multiclass classifier to classify arguments Inference Infer the final output satisfying linguistic and structure constraints Page 12 Learning Both argument identifier and argument classifier are trained phrasebased classifiers. Features (some examples) voice, phrase type, head word, path, chunk, chunk pattern, etc. [some make use of a full syntactic parse] Learning Algorithm – SNoW Sparse network of linear functions weights learned by regularized Winnow multiplicative update rule with averaged weight vectors Probability conversion is done via softmax pi = exp{acti}/j exp{actj} Page 13 Inference The output of the argument classifier often violates some constraints, especially when the sentence is long. Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. Input: The probability estimation (by the argument classifier) Structural and linguistic constraints Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types). Page 14 Integer Linear Programming Inference For each argument ai and type t (including null) Set up a Boolean variable: ai,t indicating if ai is classified as t Goal is to maximize i score(ai = t ) ai,t Subject to the (linear) constraints Any Boolean constraints can be encoded this way If score(ai = t ) = P(ai = t ), the objective is find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints Page 15 Linear Constraints No overlapping or embedding arguments ai,aj overlap or embed: ai,null + aj,null 1 Page 16 Constraints Constraints No overlapping or embedding arguments No duplicate argument classes for A0-A5 Exactly one V argument per predicate If there is a C-V, there must be V-A1-C-V pattern If there is an R-arg, there must be arg somewhere If there is a C-arg, there must be arg somewhere before Each predicate can take only core arguments that appear in its frame file. More specifically, we check for only the minimum and maximum ids Page 17 SRL Results (CoNLL-2005) Training: section 02-21 Development: section 24 Test WSJ: section 23 Test Brown: from Brown corpus (very small) Development Test WSJ Test Brown Precision Recall F1 Collins 73.89 70.11 71.95 Charniak 75.40 74.13 74.76 Collins 77.09 72.00 74.46 Charniak 78.10 76.15 77.11 Collins 68.03 63.34 65.60 Charniak 67.15 63.57 65.31 Page 18 Inference with Multiple SRL systems Goal is to maximize i score(ai = t ) ai,t Subject to the (linear) constraints Any Boolean constraints can be encoded this way score(ai = t ) = k fk(ai = t ) If system k has no opinion on ai, use a prior instead Page 19 Results with Multiple Systems (CoNLL-2005) F1 Col Char Char-2 Char-3 Char-4 Char-5 Combined Devel 77.35 WSJ 79.44 Brown 67.75 60 70 80 90 Page 20 Outline Semantic Role Labeling Problem Global Inference with Integer Linear Programming Some Issues with Learning and Inference Global vs Local Training Utility of Constraints in the Inference Conclusion Page 21 Learning and Inference Training Testing: IBT: Inference-based Inference w/o Constraints withTraining Constraints Learning the components together! y1 y2 y3 f1(x) X y4 f2(x) y5 x3 f3(x) x4 x1 x5 x2 x6 x7 f4(x) Y f5(x) Which one is better? When and Why? Page 22 Comparisons of Learning Approaches Coupling (IBT) Optimize the true global objective function (This should be better in the limit) Decoupling (L+I) More efficient Reusability of classifiers Modularity in training No global examples required Page 23 Claims When the local classification problems are “easy”, L+I outperforms IBT. Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples. Will show experimental results and theoretical intuition to support our claims. Page 24 Perceptron-based Global Learning True Global Labeling Local Predictions Apply Constraints: Y -1 1 -1 -1 1 Y’ -1 1 1 -1 1 1 f1(x) X f2(x) x3 f3(x) x4 x1 x5 f4(x) Y f5(x) x2 x6 x7 Page 25 Simulation There are 5 local binary linear classifiers Global classifier is also linear h(x) = argmaxy2C(Y) i fi(x,yi) Constraints are randomly generated The hypothesis is linearly separable at the global level given that the constraints are known The separability level at the local level is varied Page 26 Bound Prediction L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I Local ≤ opt + ( ( d log m + log 1/ ) / m )1/2 Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2 Bounds Simulated Data opt =0.1 =0 opt opt=0.2 Page 27 Relative Merits: SRL L+I is better. When the problem is artificially made harder, the tradeoff is clearer. hard Difficulty of the learning problem (# features) easy Page 28 Summary When the local classification problems are “easy”, L+I outperforms IBT. Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a large enough number of training examples. Why does inference help at all? Page 29 About Constraints We always assume that global coherency is good Constraints does help in real world applications Performance is usually measured at the local prediction Depending on the performance metric, constraints can hurt Page 30 Results: Contribution of Expressive Constraints [Roth & Yih 05] Basic: Learning with statistical constraints only; Additional constraints added at evaluation time (efficiency) F1 basic (Viterbi) + no dup + cand + argument + verb pos + disallow CRF-ML 66.46 diff 67.10 +0.64 71.78 +4.68 71.71 -0.07 71.72 +0.01 71.94 +0.22 CRF-D 69.14 diff 69.74 +0.60 73.64 +3.90 73.71 +0.07 73.78 +0.07 73.91 +0.13 Page 31 Assumptions y = h y1, …, yl i Non-interactive classifiers: fi(x,yi) Each classifier does not use as inputs the outputs of other classifiers Inference is linear summation hun(x) = argmaxy2Y i fi(x,yi) hcon(x) = argmaxy2C(Y) i fi(x,yi) C(Y) always contains correct outputs No assumption on the structure of constraints Page 32 Performance Metrics Zero-one loss Mistakes are calculated in terms of global mistakes y is wrong if any of yi is wrong Hamming loss Mistakes are calculated in terms of local mistakes Page 33 Zero-One Loss Constraints cannot hurt Constraints never fix correct global outputs This is not true for Hamming Loss Page 34 Boolean Cube 4-bit binary outputs 4 mistakes 0000 Hamming Loss 3 mistakes 2 mistakes 1 mistake 0 mistake 1100 1000 0100 0010 0001 1010 1001 0110 0101 1110 1101 1011 0111 0011 1111 Score Page 35 Hamming Loss 0000 hun Hamming Loss 1000 1100 0100 1010 0010 0110 1110 0001 1001 1101 0101 1011 0011 0111 1111 Score Page 36 Best Classifiers 4 0000 Hamming Loss 1000 1100 3 0100 1010 0010 0110 1001 0001 0101 0011 1 1110 1101 1011 0111 2 1111 1 + 2 Score Page 37 When Constraints Cannot Hurt i : distance between the correct label and the 2nd best label i : distance between the predicted label and the correct label Fcorrect = { i | fi is correct} Fwrong = { i | fi is wrong} Constraints cannot hurt if 8 i 2 Fcorrect : i > i 2 Fwrong i Page 38 An Empirical Investigation SRL System CoNLL-2005 WSJ test set Local Accuracy Without Constraints With Constraints 82.38% 84.08% Page 39 An Empirical Investigation Number Percentages 19822 100.00 3492 17.62 Correct Predictions 16330 82.38 Violate the condition 1833 9.25 Total Predictions Incorrect Predictions Page 40 Good Classifiers 0000 Hamming Loss 1000 1100 0100 1010 0010 0110 1110 0001 1001 1101 0101 1011 0011 0111 1111 Score Page 41 Bad Classifiers 0000 Hamming Loss 1000 1100 0100 1010 1110 0010 10010110 1101 1011 0101 0001 0011 0111 1111 Score Page 42 Average Distance vs Gain in Hamming Loss Good High Loss ! Low Score (Low Gain) Page 43 Good Classifiers 0000 Hamming Loss 1000 1100 0100 1010 0010 0110 1110 0001 1001 1101 0101 1011 0011 0111 1111 Score Page 44 Bad Classifiers 0000 Hamming Loss 1000 1100 0100 1010 1110 0010 10010110 1101 1011 0101 0001 0011 0111 1111 Score Page 45 Average Gain in Hamming Loss vs Distance Good High Score ! Low Loss (High Gain) Page 46 Utility of Constraints Constraints improve the performance because the classifiers are good Good Classifiers: When the classifier is correct, it allows large margin between the correct label and the 2nd best label When the classifier is wrong, the correct label is not far away from the predicted one Page 47 Conclusions Show how global inference can be used Semantic Role Labeling Tradeoffs between Coupling vs. Decoupling learning and inference Investigation of utility of constraints The analyses are very preliminary Average-case analysis for the tradeoffs between Coupling vs. Decoupling learning and inference Better understanding for using constraints More interactive classifiers Different performance metrics, e.g. F1 Relation with margin Page 48