Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira Goal: Sequence segmentation and labeling Computational biology Computational linguistics Computer science Overview HMM RF -> -> MEMM -> MRF -> Generative: HMM Task s’ s o CRF CRF Discriminative / Conditional: MEMM s’ s o Evaluation Find P(oT|M) Decoding = Prediction Find sT s.t. P(oT| sT, M) is maximized Find sT s.t. P(sT| oT, M) is maximized Given o, find M s.t. P(o | M) is maximized Given o and s, find M s.t. P(s | o, M) is maximized (Need EM because S is unknown) (Simpler Max likelihood problem) Learning Conditional CRF Find sT s.t. P(sT|oT, M) is maximized Given o and s, find M s.t. P(s|o,M) is maximized CWS Example P(S|S) S S P(我|S) 我 MEMM CRF P(B|S) P(B|E) B P(是|S) 是 E P(歌|B) 歌 P(手|E) 手 P(B|S,歌) S S B E 我 是 歌 手 S S B E 我 是 歌 手 Generative Models HMMs and stochastic grammars Assign a joint probability to paired observation and label sequences Parameters are trained to maximize joint likelihood of training examples Generative Models Need to enumerate all possible observation sequences To ensure tractability of inference problem, must make strong independence assumptions (i.e., conditional independence given labels) Example: MEMMs Maximum entropy Markov models Each source state has an exponential model that takes the observation feature as input and outputs a distribution over possible next states Weakness: Label bias problem Label Bias Problem Per-state normalization of transition scores implies “conservation Label of scoreBias mass” Example Bias towards states with fewer outgoing transitions D: determiner N: noun Fred wheels round State transition effectively V: verb with single outgoing V N R A: adjective ignores R: adverb observation The robot s D e N wheels N are V round A Obs: “The robot wheels are round.” But if P(V|N,wheels) > P(N|N,wheels), then upper path is chosen regardless of obs. After Wallach ‘02 5 Solving Label Bias (cont’d) Start with fully-connected model and let training procedure figure out a good structure Precludes use of prior structure knowledge Overview Generative models Conditional models Label bias problem Conditional random fields Experiments Conditional Random Fields Undirected graph (random field) Construct conditional model p(Y|X) Does not explicitly model marginal p(X) Assumption: graph is fixed Solving Label Bias (cont’d) Start with fully-connected model and let training procedure figure out a good structure Precludes use of prior structure knowledge Normalization CRFs: Distribution i i i CRFs: Parameter Estimation Maximize log-likelihood objective function N sentences in training corpus CRF Parameter Estimation • Iterative Scaling: N • Maximizes likelihood O(! ) = ² log p! (y(i ) | x (i ) ) # ² p!(x, y)log p! (y | x) i =1 x,y by iteratively updating ! k ² ! k + #! k m k ! m k + ²m k • Define auxilliary function A() s.t. A(! ',! ) ² O(! ') # O(! ) • Initialize each ! k • Do until convergence: Solve dA(! ',! ) = 0 for each d²# k Update parameter: !²k ! k ² ! k + #! k 12 CRF Parameter Estimation • For chain CRF, setting n +1 dA(! ',! ) =0 d²# k gives ! f ] " ! p!(x, y)! f (y , y , x) E[ k k i² 1 i x, y i =1 n +1 = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k T (x, y)) x, y i =1 • T (x, y) = ² fk (y i ! 1 , y i , x) + ² gk (y i , x) is total feature count i,k i, k • Unfortunately, T(x,y) is a global property of (x,y) • Dynamic programming will sum over sequences with potentially varying T. Inefficient exp sum computation 13 Algorithm S (Generalized Iterative Scaling) • Introduce global slack feature s.t. T(x,y) becomes constant S for all (x,y) S(x, y) ! S ! ² i, k fk (y i ! 1 , y i , x) + ² gk (y i , x) i, k • Define forward and backward variables ! i (y | x) = ! i² 1 (y | x)exp &# f k (y i ² 1 , y i , x) + # %i,k i, k ¢ gk (y i ,x)) ( # & ! i (y | x) = ! i +1 (y | x)exp %² fk (yi +1 , y i , x) + ² gk (y i +1 , x)( ¢ i,k i,k 14 Algorithm S The update equations become: Where Note Rate of convergence governed by S is like posterior as in HMM 15 Algorithm T (Improved Iterative Scaling) The equation we nwant to solve +1 ! fk ] = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# E[ i =1 k T (x, y)) is polynomial in exp (! ² k ) So can be solved with Newton’s method x, y n +1 ! fk ] = ! p!(x) p(y | x)! f k (y i ² 1 , y i , x) exp (# Define T (x) ! max T (x, y) E[ x, y i =1 T n +1 % ( t Then: ! ¢ ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k ) t = 0 &{ i =1 ) } k T (x)) max x ,y|T ( x )= t n +1 Now, let a k,t,bk,t be E[fk|T(x)=t] ak,t = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x)# ( t,T (x)) x, y i =1 UPDATE: 16 Overview Generative models Conditional models Label bias problem Conditional random fields Experiment Experiment 1: Modeling Label Bias Generate data from simple HMM that encodes noisy version of network Each state emits designated symbol with prob. 29/32 2,000 training and 500 test samples MEMM error: 42%; CRF error: 4.6% Experiment 2: More synthetic data Five labels: a – e 26 observation values: A – Z Generate data from a mixed-order HMM Randomly generate model For each model, generate sample of 1,000 sequences of length 25 MEMM vs. CRF MEMM vs. HMM CRF vs. HMM Experiment 3: Part-of-speech Tagging Each word to be labeled with one of 45 syntactic tags. 50%-50% train-test split out-of-vocabulary (oov) words: not observed in the training set Part-of-speech Tagging Second set of experiments: add small set of orthographic features (whether word is capitalized, whether word ends in –ing, -ogy, -ed, -s, -ly …) Overall error rate reduced by 25% and oov error reduced by around 50% Part-of-speech Tagging Usually start training with zero parameter vector (corresponds to uniform distribution) Use optimal MEMM parameter vector as starting point for training corresponding CRF MEMM+ trained to convergence in around 100 iterations; CRF+ took additional 1,000 iterations When starting from uniform distribution, CRF+ had not converged after 2,000 iterations Further Aspects of CRFs Automatic feature selection Start from feature-generating rules and evaluate the benefit of the generated features automatically on data Conclusions CRFs do not suffer from the label bias problem! Parameter estimation guaranteed to find the global optimum Limitation: Slow convergence of the training algorithm CRFs: Example Features Corresponding parameters λ and μ similar to the (logarithms of the) HMM parameters p(y’|y) and p(x|y)