CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture 1 Outline – Multi-Class classification: – Structured Prediction – Models for Structured Prediction and Classification • Example of POS tagging 2 Mutliclass problems – Most of the machinery we talked before was focused on binary classification problems – e.g., SVMs we discussed so far – However most problems we encounter in NLP are either: • MultiClass: e.g., text categorization • Structured Prediction: e.g., predict syntactic structure of a sentence – How to deal with them? 3 Binary linear classification 4 Multiclass classification 5 Perceptron 6 Structured Perceptron • Joint feature representation: • Algoritm: Perceptron 8 Binary Classification Margin 9 Generalize to MultiClass 10 Converting to MultiClass SVM 11 Max margin = Min Norm •As before, these are equivalent formulations: 12 Problems: •Requires separability •What if we have noise in data? •What if we have little simple feature space? 13 Non-separable case 14 Non-separable case 15 Compare with MaxEnt 16 Loss Comparison 17 Multiclass -> Structured • So far, we considered multiclass classification • 0-1 losses l(y,y’) • What if what we want to do is to predict: • • sequences of POS • syntactic trees • translation 18 Predicting word alignments • 19 Predicting Syntactic Trees • 20 Structured Models • 21 Parsing • 22 Max Margin Markov Networks (M3Ns) • Taskar et al, 2003; similar Tsochantaridis et al, 2004 23 Max Margin Markov Networks (M3Ns) • 24 Solving MultiClass with binary learning • MultiClass classifier – Function f : Rd {1,2,3,...,k} • Decompose into binary problems • Not always possible to learn • Different scale • No theoretical justification 25 Real Problem MultiClass Classification Learning via One-Versus-All (OvA) Assumption • Find vr,vb,vg,vy Rn such that – vr.x > 0 – vb.x > 0 – vg.x > 0 – vy.x > 0 iff y = red iff y = blue iff y = green iff y = yellow H = Rkn • Classifier f(x) = argmax vi.x Individual Classifiers Decision Regions 26 MultiClass Classification Learning via All-Verses-All (AvA) Assumption • Find vrb,vrg,vry,vbg,vby,vgy Rd such that – vrb.x > 0 if y = red <0 if y = blue – vrg.x > 0 if y = red <0 if y = green – ... (for all pairs) H = Rkkn How to classify? Individual Classifiers Decision Regions 27 MultiClass Classification Classifying with AvA Tree Majority Vote 1 red, 2 yellow, 2 green ? Tournament All are post-learning and might cause weird stuff 28 POS Tagging • English tags • 29 POS Tagging, examples from WSJ From McCallum 30 POS Tagging • Ambiguity: not a trivial task • • Useful tasks: • important features for other steps are based on POS • E.g., use POS as input to a parser 31 But still why so popular – Historically the first statistical NLP problem – Easy to apply arbitrary classifiers: – both for sequence models and just independent classifiers – Can be regarded as Finite-State Problem – Easy to evaluate – Annotation is cheaper to obtain than TreeBanks (other languages) 32 HMM (reminder) • 33 HMM (reminder) - transitions • 34 Transition Estimates • 35 Emission Estimates • 36 MaxEnt (reminder) • 37 Decoding: HMM vs MaxEnt • 38 Accuracies overview • 39 Accuracies overview • 40 SVMs for tagging – We can use SVMs in a similar way as MaxEnt (or other classifiers) – We can use a window around the word – 97.16 % on WSJ 41 SVMs for tagging from Jimenez & Marquez 42 No sequence modeling 43 CRFs and other global models 44 CRFs and other global models 45 Compare W T HMMs MEMMs - Note: after each step t the remaining probability mass cannot be reduced – it can only be distributed across among possible state transitions CRFs - no local normalization Label Bias based on a slide from Joe 47Drish Label Bias • Recall Transition based parsing -- Nivre’s algorithm (with beam search) • At each step we can observe only local features (limited look-ahead) • If later we see that the following word is impossible we can only distribute probability uniformly across all (im)possible decisions • If a small number of such decisions – we cannot decrease probability dramatically • So, label bias is likely to be a serious problem if: • Non local dependencies • States have small number of possible outgoing transitions 48 Pos Tagging Experiments – “+” is an extended feature set (hard to integrate in a generative model) – oov – out-of-vocabulary 49 Supervision – We considered before the supervised case – Training set is labeled – However, we can try to induce word classes without supervision – Unsupervised tagging – We will later discuss the EM algorithm – Can do it in a partly supervised: – Seed tags – Small labeled dataset – Parallel corpus – .... 50 Why not to predict POS + parse trees simultaneously? – It is possible and often done this way – Doing tagging internally often benefits parsing accuracy – Unfortunately, parsing models are less robust than taggers – e.g., non-grammatical sentences, different domains – It is more expensive and does not help... 51 Questions • Why there is no label-bias problem for a generative model (e.g., HMM) ? • How would you integrate word features in a generative model (e.g., HMMs for POS tagging)? • e.g., if word has: • -ing, -s, -ed, -d, -ment, ... • post-, de-,... 52 “CRFs” for more complex structured output problems • We considered sequence labeled problems • Here, the structure of dependencies is fixed • What if we do not know the structure but would like to have interactions respecting the structure ? 53 “CRFs” for more complex structured output problems • Recall, we had the MST algorithm (McDonald and Pereira, 05) 54 “CRFs” for more complex structured output problems • Complex inference • E.g., arbitrary 2nd order dependency parsing models are not tractable (non-projective) NP-complete: (McDonald & Pereira, EACL 06) • Recently conditional models for constituent parsing: • (Finkel et al, ACL 08) • (Carreras et al, CoNLL 08) • ... 55 Back to MultiClass – Let us review how to decompose multiclass problem to binary classification problems 56 Summary • Margin-based method for multiclass classification and structured prediction • CRFs vs HMMs vs MEMMs for POS tagging 57 Conclusions • All approaches use linear representation • The differences are – Features – How to learn weights – Training Paradigms: • Global Training (CRF, Global Perceptron) • Modular Training (PMM, MEMM, ...) – These approaches are easier to train, but may requires additional mechanisms to enforce global constraints. 58