Conditional Random Fields Sequence Labeling: The Problem • Given a sequence (in NLP, words), assign appropriate labels to each word. • For example, POS tagging: DT NN VBD IN DT NN . The cat sat on the mat . Sequence Labeling: The Problem • Given a sequence (in NLP, words), assign appropriate labels to each word. • Another example, partial parsing (aka chunking): B-NP I-NP B-VPB-PP B-NP I-NP The cat sat on the mat Sequence Labeling: The Problem • Given a sequence (in NLP, words), assign appropriate labels to each word. • Another example, relation extraction: B-ArgI-ArgB-Rel I-Rel B-Arg I-Arg The cat sat on the mat The CRF Equation • A CRF model consists of – F = <f1, …, fk>, a vector of “feature functions” – θ = < θ1, …, θk>, a vector of weights for each feature function. • Let O = < o1, …, oT> be an observed sentence • Let X = <x1, …, xT> be the latent variables. P (X x | O ) exp θ F x , O exp θ F x ' , O x' • This is the same as the Maximum Entropy equation! CRF Equation, standard format • Note that the denominator depends on O, but not on y (it’s marginalizing over y). • Typically, we write P (X x | O ) 1 Z (O ) where Z (O ) exp θ F x, O exp θ F x' , O x' Making Structured Predictions Structured prediction vs. Text Classification Recall: max. ent. for text classification: arg max P ( A c | O doc ) c 1 arg max exp θ F c , doc c Z ( doc ) arg max θ F c , doc c CRFs for sequence labeling: arg max P ( A y | O ) y 1 arg max exp θ F y, O y Z (O ) arg max θ F y, O y What’s the difference? Structured prediction vs. Text Classification Two (related) differences, both for the sake of efficiency: 1) Feature functions in CRFs are restricted to graph parts (described later) 2) We can’t do brute force to compute the argmax. Instead, we do Viterbi. Finding the Best Sequence Best sequence is arg max P ( X x | O ) x 1 arg max exp θ F x, O x Z (O ) arg max θ F x, O x Recall from HMM discussion: If there are K possible states for each xi variable, and N total xi variables, Then there are KN possible settings for x So brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program. Viterbi Algorithm X1 Xt-1 Xt=hj o1 ot-1 ot ot+1 oT j ( t ) max θ F ( x1 ... x t 1 , o1 ... o t 1 , x t h j , o t ) x1 ... x t 1 The state sequence which maximizes the score of seeing the observations to time t-1, landing in state hj at time t, and seeing the observation at time t Viterbi Algorithm x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Xˆ T arg max i (T ) i Xˆ t ( t 1) ^ X t 1 P ( Xˆ ) arg max i (T ) i Compute the most likely state sequence by working backwards Viterbi Algorithm X1 Xt-1 Xt=hj Xt+1 o1 ot-1 ot ot+1 oT j ( t ) max θ F ( x1 ... x t 1 , o1 ... o t 1 , x t h j , o t ) x1 ... x t 1 j ( t 1) max i ( t ) a ij b jo i t 1 ??! j ( t 1) arg max i ( t ) a ij b jo i t 1 ??! Recursive Computation Feature functions and Graph parts To make efficient computation (dynamic programs) possible, we restrict the feature functions to: Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph. Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique. Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. CRF X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. Individual node cliques CRF X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. Pair-of-node cliques CRF X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 Clique Example For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques: X5’ CRF X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 Larger cliques Graph part as Feature Function Example Graph parts are feature functions f(x,o) that count how many cliques have a particular configuration. For example, f(x,o) = count of [xi = Noun]. CRF x1=D x2=N x3=V x4=D x5=A x6=N o1 o2 o3 o4 o5 o6 Here, x2 and x6 are both Nouns, so f(x,o) = 2. Graph part as Feature Function Example For a pair-of-nodes example, f(x,o) = count of [xi = Noun,xi+1=Verb] CRF x1=D x2=N x3=V x4=D x5=A x6=N o1 o2 o3 o4 o5 o6 Here, x2 is a Noun and x3 is a Verb, so f(x,o) = 1. Features can depend on the whole observation In a CRF, each feature function can depend on o, in addition to a clique in x HMM X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 Normally, we draw a CRF like this: CRF X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 Features can depend on the whole observation In a CRF, each feature function can depend on o, in addition to a clique in x HMM X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 But really, it’s more like this: CRF X1 X2 X3 X4 X5 X6 o1 o2 o3 o4 o5 o6 This would cause problems for a generative model, but in a conditional model, o is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently. Graph part as Feature Function Example An example part including x and o: f(x,o) = count of [xi = A or D,xi+1=N,o2=cat] CRF x1=D x2=N x3=V x4=D x5=A x6=N The cat chased the tiny fly Here, x1 is a D and x2 is a N, plus x5 is a A and x6 is a N, plus o2=cat, so f(x,o) = 2. Notice that the clique x5-x6 is allowed to depend on o2. Graph part as Feature Function Example An more usual example including x and o: f(x,o) = count of [xi = A or D,xi+1=N,oi+1=cat] CRF x1=D x2=N x3=V x4=D x5=A x6=N The cat chased the tiny fly Here, x1 is a D and x2 is a N, plus o2=cat, so f(x,o)=1. The CRF Equation, with Parts • A CRF model consists of – P = <p1, …, pk>, a vector of parts – θ = < θ1, …, θk>, a vector of weights for each part. • Let O = < o1, …, oT> be an observed sentence • Let X = <x1, …, xT> be the latent variables. P (X x | O ) exp θ P x , O Z (O ) Viterbi Algorithm – 2nd Try X1 Xt-1 Xt=hj Xt+1 o1 ot-1 ot ot+1 oT j ( t ) max θ P ( x1 ... x t 1 , x t h j , o ) x1 ... x t 1 i (t ) j ( t 1) max θ one P one ( x t 1 h j , o ) i θ P ( x h , x h , o ) pair pair t i t 1 j i (t ) j ( t 1) arg max θ one P one ( x t 1 h j , o ) i θ P ( x h , x h , o ) pair pair t i t 1 j Recursive Computation Supervised Parameter Estimation Conditional Training • Given a set of observations o and the correct labels x for each, determine the best θ: arg max P ( x | o , θ ) θ • Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: – Determine the gradient – Step in the direction of the gradient – Repeat until convergence Recall: Training a ME model Training is an optimization problem: find the value for λ that maximizes the conditional log-likelihood of the training data: CLL (Train ) log P (c | d ) c , d Train c , d Train i f i ( c , d ) log Z ( d ) i 29 Recall: Training a ME model Optimization is normally performed using some form of gradient descent: 0) Initialize λ0 to 0 1) Compute the gradient: ∇CLL 2) Take a step in the direction of the gradient: λi+1 = λi + α ∇CLL 3) Repeat until CLL doesn’t improve: stop when |CLL(λi+1) – CLL(λi)| < ε 30 Recall: Training a ME model Computing the gradient: i CLL (Train ) i f i ( c , d ) log Z ( d ) i c , d Train i f i ( c , d ) log exp i f i ( c , d ) c , d Train c i i c , d Train f i ( c , d ) exp i f i ( c , d ) c i f i (c, d ) exp i f i ( c , d ) c i f i (c, d ) E P f i (c, d ) c , d Train 31 Recall: Training a ME model i Computing the gradient: CLL (Train ) i f i ( c , d ) log Z ( d ) i c , d Train i f i ( c , d ) log exp i f i ( c ' , d ) c , d Train c i i f i ( c ' , d ) exp i f i ( c ' , d ) c i f i (c, d ) c , d Train exp i f i ( c ' ' , d ) c' i f i ( c , d ) Pλ ( c ' | d ) f i ( c ' , d ) c , d Train c' Involves a sum over all possible classes 32 Recall: Training a ME model: Expected feature counts • In ME models, each document d is classified independently. • The sum Pλ ( c ' | d ) f i ( c ' , d ) involves as many c' terms as there are classes c’. • Very doable. Training a CRF i CLL (Train ) x , o Train f ( x , x , o ) log Z ( o ) i i t t 1 t i x ,o Train i , t f i ( x t , x t 1 , o t ) log exp i f i ( x 't , x 't 1 , o t ) i x' i ,t t f i ( x 't , x 't 1 , o t ) exp i f i ( x 't , x 't 1 , o t ) x' t i ,t f i ( x t , x t 1 , o t ) exp f ( x ' ' , x ' ' , o ) x , o Train t i i t t 1 t x' ' i ,t f ( x , x , o ) P ( x' | o ) f ( x ' , x ' , o ) i t t 1 t λ i t t 1 t x , o Train t x' t The hard part for CRFs 34 Training a CRF: Expected feature counts • For CRFs, the term P ( x' | o ) λ x' f i ( x ' t , x ' t 1 , o t ) t involves an exponential sum. • The solution again involves dynamic programming, very similar to the Forward algorithm for HMMs. CRFs vs. HMMs Generative (Joint Probability) Models • HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states) • From a generative model, one can compute – Two conditional models: • P(sentence | hidden-states) and • P(hidden-states| sentence) – Marginal models P(sentence) and P(hidden-states) • For sequence labeling, we want P(hidden-states | sentence) Discriminative (Conditional) Models • Most often, people are most interested in the conditional probability P(hidden-states | sentence) For example, this is the distribution needed for sequence labeling. • Discriminative (also called conditional) models directly represent the conditional distribution P(hidden-states | sentence) – These models cannot tell you the joint distribution, marginals, or other conditionals. – But they’re quite good at this particular conditional distribution. Discriminative vs. Generative HMM (generative) CRF (discriminative) Marginal, or Language model: P(sentence) Forward algorithm or Backward algorithm, linear in length of sentence Can’t do it. Find optimal label sequence Viterbi, Linear in length of sentence Viterbi, Linear in length of sentence Supervised parameter estimation Bayesian learning, Easy and fast Convex optimization, Can be slow-ish (multiple passes through the data) Unsupervised parameter estimation Baum-Welch (non-convex optimization), Slow but doable Very difficult, and requires making extra assumptions. Feature functions Parents and children in the graph Restrictive! Arbitrary functions of a latent state and any portion of the observed nodes CRFs vs. HMMs, a closer look It’s possible to convert an HMM into a CRF: Set pprior,state(x,o) = count[x1=state] Set θprior,state = log PHMM(x1=state) = log state Set ptrans,state1,state2(x,o)= count[xi=state1,xi+1=state2] Set θtrans,state1,state2 = log PHMM(xi+1=state2|xi=state1) = log Astate1,state2 Set pobs,state,word(x,o)= count[xi=state,oi=word] Set θobs,state,word = log PHMM(oi=word|xi=state) = log Bstate,word CRF vs. HMM, a closer look If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities. Therefore, they will all be between –∞ and 0 Notice: CRF parameters can be between –∞ and +∞. So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)? – HMMs have more bias – CRFs have more variance Comparing feature functions The biggest advantage of CRFs over HMMs is that they can handle overlapping features. For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful. However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.” These features overlap: some words end in “ing”, some don’t. • Generative models have trouble handling overlapping features correctly • Discriminative models don’t: they can simply use the features. CRF Example A CRF POS Tagger for English Vocabulary We need to determine the set of possible word types V. Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training} U {UNKNOWN} (for word types we haven’t seen) L = Label Set Standard Penn Treebank tagset Number Tag Description Number Tag Description 1. CC Coordinating conjunction 9. JJS Adjective, superlative 2. CD Cardinal number 10. LS List item marker 3. DT Determiner 11. MD Modal 4. EX Existential there 12. NN 5. FW Foreign word Noun, singular or mass 13. NNS Noun, plural 6. IN Preposition or subordinating conjunction 14. NNP Proper noun, singular 7. JJ Adjective 15. NNPS Proper noun, plural 8. JJR Adjective, comparative 16. PDT Predeterminer 17. POS Possessive ending L = Label Set Number Tag Description Number Tag Description 18. PRP Personal pronoun 30. VBN Verb, past participle 19. PRP$ Possessive pronoun 20. RB Adverb 31. VBP Verb, non-3rd person singular present 21. RBR Adverb, comparative 22. RBS Adverb, superlative 32. VBZ Verb, 3rd person singular present 23. RP Particle 24. SYM Symbol 33. WDT Wh-determiner 25. TO to 34. WP Wh-pronoun 26. UH Interjection 35. WP$ 27. VB Verb, base form Possessive whpronoun 28. VBD Verb, past tense 36. WRB Wh-adverb 29. VBG Verb, gerund or present participle CRF Features Feature Type Description Prior k xi = k Transition k,k’ xi = k and xi+1=k’ Word k,w xi = k and oi=w k,w xi = k and oi-1=w k,w xi = k and oi+1=w k,w,w’ xi = k and oi=w and oi-1=w’ k,w,w’ xi = k and oi=w and oi+1=w’ Orthography: Suffix s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and k xi=k and oi ends with s Orthography: Punctuation k xi = k and oi is capitalized k xi = k and oi is hyphenated k xi = k and oi contains a period k xi = k and oi is ALL CAPS k xi = k and oi contains a digit (0-9) …