Max-margin sequential learning methods William W. Cohen CALD Announcements • Upcoming assignments: – Wed 3/3: project proposal due: • personnel + 1-2 page – Spring break next week, no class – Will get feedback on project proposals by end of break – No write-ups for “Distance Metrics for Text” week are due Wed 3/17 • not the Monday after spring break Collins’ paper • Notation: – label (y) is a “tag” t – observation (x) is word w – history h is a 4-tuple <ti,ti-1,w[1:n],i> – phis(h,t) is a feature of h, t Collins’ papers • Notation con’t: – Phi is summation of phi for all positions i – alphas is weight to give phis Collins’ paper The theory Claim 1: the algorithm is an instance of this perceptron variant: Claim 2: the arguments in the mistake-bounded classification results of F&S99 extend immediately to this ranking task as well. F&S99 algorithm F&S99 result Collins’ result Results • Two experiments – POS tagging, using the Adwait’s features – NP chunking (Start,Continue,Outside tags) – NER on special AT&T dataset (another paper) Features for NP chunking Results More ideas • The dual version of a perceptron: – w is built up by repeatedly adding examples => w is a weighted sum of the examples x1,...,xn – inner product <w,x> is can be rewritten: m w j x j , where j is 0,1, or - 1 j 1 so m m w x j x j x j ( x j x) j 1 j 1 Dual version of perceptron ranking alpha i,j = i,j range over example and correct/incorrect tag sequence NER features for re-ranking MAXENT tagger output NER features NER results Altun et al paper • Starting point – dual version of Collins’ perceptron algorithm – final hypothesis is weighted sum of inner products with a subset of the examples – this a lot like an SVM – except that the perceptron algorithm is used to set the weights rather than quadratic optimization SVM optimization • Notation: – – – – yi is the correct tag for xi y is an incorrect tag F(xi,yi) are features Optimization problem: • find weights w on the examples that maximize minimal margin, limiting ||w||=1, or • minimize ||w||2 such that every margin >= 1 SVMs for ranking SVMs for ranking Proposition: (14) and (15) are equivalent: p n1 1 p n2 1 Let p n1 n2 1 p 2 SVMs for ranking A binary classification problem – with xi yi the positive example and xi y’negative examples, except that thetai varies for each example. Why? because we’re ranking. SVMs for ranking • Altun et al work give the remaining details • Like for perceptron learning, “negative” data is found by running Viterbi given the learned weights and looking for errors – Each mistake is a possible new support vector – Need to iterate over the data repeatedly – Could be exponential time before convergence if the support vectors are dense... Altun et al results • NER on 300 sentences from CoNLL2002 shared task – Spanish – Four entity types, nine labels (beginning-T, intermediate-T, other) • POS tagging on 300 sentences from Penn TreeBank • 5-CV, window of size 3, simple features Altun et al results Altun et al results