slide

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira Goal: Sequence segmentation and labeling    Computational biology Computational linguistics Computer science Overview   HMM RF -> -> MEMM -> MRF -> Generative: HMM Task s’ s o CRF CRF Discriminative / Conditional: MEMM s’ s o Evaluation Find P(oT|M) Decoding = Prediction Find sT s.t. P(oT| sT, M) is maximized Find sT s.t. P(sT| oT, M) is maximized Given o, find M s.t. P(o | M) is maximized Given o and s, find M s.t. P(s | o, M) is maximized (Need EM because S is unknown) (Simpler Max likelihood problem) Learning Conditional CRF Find sT s.t. P(sT|oT, M) is maximized Given o and s, find M s.t. P(s|o,M) is maximized CWS Example P(S|S) S S P(我|S) 我 MEMM CRF P(B|S) P(B|E) B P(是|S) 是 E P(歌|B) 歌 P(手|E) 手 P(B|S,歌) S S B E 我是歌手 S S B E 我是歌手 Generative Models    HMMs and stochastic grammars Assign a joint probability to paired observation and label sequences Parameters are trained to maximize joint likelihood of training examples Generative Models   Need to enumerate all possible observation sequences To ensure tractability of inference problem, must make strong independence assumptions (i.e., conditional independence given labels) Example: MEMMs    Maximum entropy Markov models Each source state has an exponential model that takes the observation feature as input and outputs a distribution over possible next states Weakness: Label bias problem Label Bias Problem    Per-state normalization of transition scores implies “conservation Label of scoreBias mass” Example Bias towards states with fewer outgoing transitions D: determiner N: noun Fred wheels round State transition effectively V: verb with single outgoing V N R A: adjective ignores R: adverb observation The robot s D e N wheels N are V round A Obs: “The robot wheels are round.” But if P(V|N,wheels) > P(N|N,wheels), then upper path is chosen regardless of obs. After Wallach ‘02 5 Solving Label Bias (cont’d)  Start with fully-connected model and let training procedure figure out a good structure  Precludes use of prior structure knowledge Overview   Generative models Conditional models    Label bias problem Conditional random fields Experiments Conditional Random Fields     Undirected graph (random field) Construct conditional model p(Y|X) Does not explicitly model marginal p(X) Assumption: graph is fixed Solving Label Bias (cont’d)  Start with fully-connected model and let training procedure figure out a good structure  Precludes use of prior structure knowledge Normalization CRFs: Distribution i i i CRFs: Parameter Estimation   Maximize log-likelihood objective function N sentences in training corpus CRF Parameter Estimation • Iterative Scaling: N • Maximizes likelihood O(! ) = ² log p! (y(i ) | x (i ) ) # ² p!(x, y)log p! (y | x) i =1 x,y by iteratively updating ! k ² ! k + #! k m k ! m k + ²m k • Define auxilliary function A() s.t. A(! ',! ) ² O(! ') # O(! ) • Initialize each ! k • Do until convergence: Solve dA(! ',! ) = 0 for each d²# k Update parameter: !²k ! k ² ! k + #! k 12 CRF Parameter Estimation • For chain CRF, setting n +1 dA(! ',! ) =0 d²# k gives ! f ] " ! p!(x, y)! f (y , y , x) E[ k k i² 1 i x, y i =1 n +1 = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k T (x, y)) x, y i =1 • T (x, y) = ² fk (y i ! 1 , y i , x) + ² gk (y i , x) is total feature count i,k i, k • Unfortunately, T(x,y) is a global property of (x,y) • Dynamic programming will sum over sequences with potentially varying T. Inefficient exp sum computation 13 Algorithm S (Generalized Iterative Scaling) • Introduce global slack feature s.t. T(x,y) becomes constant S for all (x,y) S(x, y) ! S ! ² i, k fk (y i ! 1 , y i , x) + ² gk (y i , x) i, k • Define forward and backward variables ! i (y | x) = ! i² 1 (y | x)exp &# f k (y i ² 1 , y i , x) + # %i,k i, k ¢ gk (y i ,x)) ( # & ! i (y | x) = ! i +1 (y | x)exp %² fk (yi +1 , y i , x) + ² gk (y i +1 , x)( ¢ i,k i,k 14 Algorithm S The update equations become: Where Note Rate of convergence governed by S is like posterior as in HMM 15 Algorithm T (Improved Iterative Scaling) The equation we nwant to solve +1 ! fk ] = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# E[ i =1 k T (x, y)) is polynomial in exp (! ² k ) So can be solved with Newton’s method x, y n +1 ! fk ] = ! p!(x) p(y | x)! f k (y i ² 1 , y i , x) exp (# Define T (x) ! max T (x, y) E[ x, y i =1 T n +1 % ( t Then: ! ¢ ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k ) t = 0 &{ i =1 ) } k T (x)) max x ,y|T ( x )= t n +1 Now, let a k,t,bk,t be E[fk|T(x)=t] ak,t = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x)# ( t,T (x)) x, y i =1 UPDATE: 16 Overview   Generative models Conditional models    Label bias problem Conditional random fields Experiment Experiment 1: Modeling Label Bias     Generate data from simple HMM that encodes noisy version of network Each state emits designated symbol with prob. 29/32 2,000 training and 500 test samples MEMM error: 42%; CRF error: 4.6% Experiment 2: More synthetic data      Five labels: a – e 26 observation values: A – Z Generate data from a mixed-order HMM Randomly generate model For each model, generate sample of 1,000 sequences of length 25 MEMM vs. CRF MEMM vs. HMM CRF vs. HMM Experiment 3: Part-of-speech Tagging    Each word to be labeled with one of 45 syntactic tags. 50%-50% train-test split out-of-vocabulary (oov) words: not observed in the training set Part-of-speech Tagging   Second set of experiments: add small set of orthographic features (whether word is capitalized, whether word ends in –ing, -ogy, -ed, -s, -ly …) Overall error rate reduced by 25% and oov error reduced by around 50% Part-of-speech Tagging     Usually start training with zero parameter vector (corresponds to uniform distribution) Use optimal MEMM parameter vector as starting point for training corresponding CRF MEMM+ trained to convergence in around 100 iterations; CRF+ took additional 1,000 iterations When starting from uniform distribution, CRF+ had not converged after 2,000 iterations Further Aspects of CRFs  Automatic feature selection  Start from feature-generating rules and evaluate the benefit of the generated features automatically on data Conclusions    CRFs do not suffer from the label bias problem! Parameter estimation guaranteed to find the global optimum Limitation: Slow convergence of the training algorithm CRFs: Example Features  Corresponding parameters λ and μ similar to the (logarithms of the) HMM parameters p(y’|y) and p(x|y)

slide

Related documents

Products

Support

slide

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib