STRUCTURED PERCEPTRON Alice Lai and Shi Zhi Presentation Outline • Introduction to Structured Perceptron • ILP-CRF Model • Averaged Perceptron • Latent Variable Perceptron Motivation • An algorithm to learn weights for structured prediction • Alternative to POS tagging with MEMM and CRF (Collins 2002) • Convergence guarantees under certain conditions even for inseparable data • Generalizes to new examples and other sequence labeling problems POS Tagging Example Example: the man saw the D D D D D N N N N N A A A A A V V V V V Gold labels: the/D man/N saw/V the/D dog/N Prediction: the/D man/N saw/N the/D dog/N Parameter update: Add 1: πΌ π·,π,π , πΌ π,π,π· , πΌ π,π·,π , πΌ π,π ππ€ Subtract 1: πΌ π·,π,π , πΌ π,π,π· , πΌ π,π·,π , πΌ π,π ππ€ dog MEMM Approach • Conditional model: probability of the current state given previous state and current observation • For tagging problem, define local features for each tag in context • Features are often indicator functions • Learn parameter vector α with Generalized Iterative Scaling or gradient descent Global Features • Local features are defined only for a single label • Global features are defined for an observed sequence and a possible label sequence • Simple version: global features are local features summed over an observation-label sequence pair • Compared to original perceptron algorithm, we have prediction of a vector of labels instead of a single label • Which of the possible incorrect label vectors do we use as the negative example in training? Structured Perceptron Algorithm Input: training examples (π₯π , π¦π ) Initialize parameter vector πΌ = 0 For t = 1…max_iter: For i = 1…n: π¦ ∗ = ππππππ₯ π¦∈GEN π₯π Φ π₯π , π¦ ⋅ πΌ If π¦ ∗ ≠ π¦π then update: πΌ = πΌ + Φ π₯π , π¦π − Φ π₯π , π¦ ∗ Output: parameter vector πΌ GEN(π₯π ) enumerates possible label sequences π¦for observed sequence π₯π . Properties • Convergence • Data π₯, π¦ is separable with margin πΏ > 0 if there is some vector πΌ where π = 1 such that ∀π, ∀π¦ ∈ GEN π₯π − π¦π , πΌ ⋅ Φ π₯π , π¦π − πΌ ⋅ Φ π₯π , π¦π ≥ πΏ • For data π₯, π¦ that is separable with margin πΏ, then the number of π 2 mistakes made in training is bounded by πΏ2 where π is a constant such that ∀π, ∀π¦ ∈ GEN π₯π − π¦π , Φ π₯π , π¦π − Φ π₯π , π¦π ≤ π • Inseparable case • Number of mistakes ≤ min π,πΏ (π +π·2π,πΏ ) πΏ2 • Generalization Theorems and proofs from Collins 2002 Global vs. Local Learning • Global learning (IBT): constraints are used during training • Local learning (L+I): classifiers are trained without constraints, constraints are applied later to produce global output • Example: ILP-CRF model [Roth and Yih 2005] Perceptron IBT • This is structured perceptron! Input: training examples (π₯π , π¦π ) Initialize parameter vector π€ = 0 For t = 1…max_iter: For i = 1…n: π¦ ∗ = ππππππ₯ π¦∈GEN π₯π Φ π₯π , π¦ ⋅ πΌ If π¦ ∗ ≠ π¦π then update: πΌ = πΌ + F π₯π , π¦π − F π₯π , π¦ ∗ Output: parameter vector πΌ GEN(π₯π ) enumerates possible label sequences for observed sequence π₯. F is a scoring function. Perceptron I+L • Decomposition: π¦ ∗ = ππππππ₯ πΌ ⋅ π π₯ + π ⋅ Φ π₯, π¦ • Prediction: π¦ ∗ = ππππππ₯ π¦∈GEN π₯π π π₯π ⋅ πΌ • If π¦ ∗ ≠ π¦π then update: πΌ = πΌ + F π₯π , π¦π − F π₯π , π¦ ∗ • Either learn parameter vector π for global features Φ or do inference only at evaluation time ILP-CRF Introduction [Roth and Yih 2005] • ILP-CRF model for Semantic Role Labeling as a sequence labeling problem • Viterbi inference for CRFs can include constraints • Cannot handle long-range or general constraints • Viterbi is a shortest path problem that can be solved with ILP • Use integer linear programming to express general constraints during inference • Allows incorporation of expressive constraints, including long-range constraints between distant tokens that cannot be handled by Viterbi s A B C A B C A B C A B C A B C t ILP-CRF Models • CRF trained with max log-likelihood • CRF trained with voted perceptron • I+L • IBT • Local training (L+I) • Perceptron, winnow, voted perceptron, voted winnow ILP-CRF Results Sequential Models L+I Local IBT L+I ILP-CRF Conclusions • Performance of local learning models perform poorly improves dramatically when constraints are added at evaluation • Performance is comparable to IBT methods • The best models for global and local training show comparable results • L+I vs. IBT: L+I requires fewer training examples, is more efficient, outperforms IBT in most situations (unless local problems are difficult to solve) [Punyakanok et. al , IJCAI 2005] Variations: Voted Perceptron • For iteration t=1,…,T • For example i=1,…,n t ,i • Given parameterο‘ ,by Viterbi Decoding, • Get sequence labels for one example best _ tags i ο½ arg m ax tags i ο‘ t ,i ο ο ( w ords , tags ) i i • Each example define a tagging sequence. • The voted perceptron: takes the most frequently ocurring output in the set 1 n {best _ tags , ....., best _ tags } Variations: Voted Perceptron • Averaged algorithm(Collins‘02): approximation of the voted method. It takes the averaging parameter ο§ instead of final parameter ο‘ T , n t ,i ο§ ο½ ο t ο½1,..,T , i ο½1,..., n ο‘ / n T • Performance: • Higher F-Measure, Lower error rate • Greater Stability on variance in its scores • Variation: modified averaged algorithm for latent perceptron Variations: Latent Structure Perceptron • Model Definition y ' ο½ arg m ax (m ax ο· ο ο ο· ( x , h , y )) y οY hο H • ο· is the parameter for perceptron. ο ο· ( ο) is the feature encoding function mapping to feature vector • In NER task, x is word sequence, y is the named-entity type sequence, h is the hidden latent variable sequence. • Features: unigram bigram for word, POS and orthography (prefix, upper/lower case) • Why latent variables? • Capture latent dependencies (i.e. hidden sub-structure) Variations: Latent Structure Perceptron • Purely Latent Structure Perceptron(Connor’s) • Training(Structure perceptron with margin) • C: margin • Alpha: learning rate • Variation: modified averaging parameter method(Sun’s): re-initiate the parameter with averaged parameter in each k iteration. • Advantage: reduce overfitting of the latent perceptron Variations: Latent Structure Perceptron • Disadvantage of purely latent perceptron: h* is found and then forgotten for each x. • Solution: Online Latent Classifier (Connor’s) • Two classifiers: latent classifier: parameter: u label classifier: parameter: w ( y , h ) ο½ arg m ax (ο· ο ο ο· ( x , h , y ) ο« u ο ο u ( x , h )) * * y οY , hο H Variations: Latent Structure Perceptron • Online Latent Classifier Training(Connor’s) Variations: Latent Structure Perceptron • Experiments: Bio-NER with purely latent perceptron cc: cut-off Odr:#order dependency Train-time F-measure High-order Variations: Latent Structure Perceptron • Experiments: Semantic Role Labeling with argument/predicate as latent structure • X: She likes yellow flowers (sentence) • Y: agent predicate -----patient (role) • H: predicate: only one; argument: at least one (latent structure) • Optimization for (h*,y*): search all possible argument/predicate structure. For more complex data, need other methods. On test set: Summary • Structured Perceptron definition and motivation • IBT vs. L+I • Variations of Structure Perceptron References: • Discriminative Training for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, M. Collins, EMNLP 2002. • Latent Variable Perceptron Algorithm for Structured Classification, Sun, Xu, Takuya Matsuzaki, Daisuke Okanohara and Jun'ichi Tsujii, IJCAI 2009. • Integer Linear Programming Inference for Conditional Random Fields, D. Roth, W. Yih, ICML 2005. • Online Latent Structure Training for Language Acquisition, M. Connor and C. Fisher and D. Roth, IJCAI 2011