Graphical Models for POS Tagging: HMM, MEMM, CRF

Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging December 2005 IIT Bombay and IBM India Research Lab Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields December 2005 IIT Bombay and IBM India Research Lab POS tagging: A Sequence Labeling Problem  Input and Output – Input sequence x = x1x2 xn – Output sequence y = y1y2 ym Labels of the input sequence Semantic representation of the input  Other Applications – Automatic speech recognition – Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc. December 2005 IIT Bombay and IBM India Research Lab Hidden Markov Models  Doubly stochastic models A 0.6 A 0.9 C 0.5 0.1 S1  Efficient dynamic programming algorithms exist for – Finding Pr(S) – The highest probability path P that maximizes Pr(S,P) (Viterbi)  Training the model – (Baum-Welch algorithm) December 2005 C 0.4 0.9 S2 0.5 0.1 0.8 S3 A 0.5 C 0.5 S4 0.2 A 0.3 C 0.7 IIT Bombay and IBM India Research Lab Hidden Markov Model (HMM) : Generative Modeling y Source Model P(Y) x Noisy Channel P(X|Y) e.g., 1st order Markov chain P(y ) =  P( yi | yi 1 ) i P (x | y ) =  P ( xi | yi ) i Parameter estimation:  maximize the joint likelihood log 2 P (xof , y )training examples ( x , y )T December 2005 IIT Bombay and IBM India Research Lab Dependency (1st order) X k 2 X k 1 P ( X k  2 | Yk  2 ) P( X k 1 | Yk 1 ) P(Yk 1 | Yk  2 ) Yk  2 December 2005 P( X k | Yk ) P (Yk | Yk 1 ) Yk 1 X k 1 Xk P( X k 1 | Yk 1 ) P (Yk 1 | Yk ) Yk Yk 1 IIT Bombay and IBM India Research Lab Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields December 2005 IIT Bombay and IBM India Research Lab Disadvantage of HMMs (1)  No Rich Feature Information – Rich information are required When xk is complex When data of xk is sparse  Example: POS Tagging – How to evaluate P(wk|tk) for unknown words wk ? – Useful features Suffix, e.g., -ed, -tion, -ing, etc. Capitalization December 2005 IIT Bombay and IBM India Research Lab Disadvantage of HMMs (2)  Generative Model – Parameter estimation: maximize the joint likelihood of training examples  log 2 P(X = x, Y = y) ( x , y )T  Better Approach   Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples  log ( x , y )T December 2005 2 P(Y = y | X = x) IIT Bombay and IBM India Research Lab Maximum Entropy Markov Model  Discriminative Sub Models – Unify two parameters in generative model into one conditional model Two parameters in generative model, parameter in source model P( y | y ) and parameter in noisy channel P( x | y ) Unified conditional model P( y | x , y ) k 1 k k k  k – Employ maximum entropy principle Maximum Entropy Markov Model P(y | x) =  P( yi | yi 1 , xi ) i December 2005 k 1 k IIT Bombay and IBM India Research Lab General Maximum Entropy Model  Model – Model distribution P(Y |X) with a set of features {f1, f2, , fl} defined on X and Y  Idea – Collect information of features from training data – Assume nothing on distribution P(Y |X) other than the collected information Maximize the entropy as a criterion December 2005 IIT Bombay and IBM India Research Lab Features  Features – 0-1 indicator functions 1 if (x, y) satisfies a predefined condition 0 if not  Example: POS Tagging 1, if x ends with - tion and y is NN f1 ( x, y) =   0, otherwise 1, if x start with Captialization and y is NNP f 2 ( x, y) =  0, otherwise December 2005 IIT Bombay and IBM India Research Lab Constraints  Empirical Information – Statistics from training data T 1 Pˆ ( f i ) = f i ( x, y)  | T | ( x , y )T  Expected Value  From the distribution P(Y |X) we want to model P( f i ) =  Constraints December 2005 1 P(Y = y | X = x) f i ( x, y)   | T | ( x , y )T yD (Y ) Pˆ ( f i ) = P( f i ) IIT Bombay and IBM India Research Lab Maximum Entropy: Objective  Entropy 1 I = P(Y = y | X = x) log 2 P(Y = y | X = x)  | T | ( x , y )T = Pˆ ( x) P(Y = y | X = x) log P(Y = y | X = x)  x  2 y Maximization Problem max I P (Y | X ) s.t. Pˆ ( f ) = P ( f ) December 2005 IIT Bombay and IBM India Research Lab Dual Problem  Dual Problem – Conditional model l P(Y = y | X = x)  exp(  i f i ( x, y )) i =1 – Maximum likelihood of conditional data max 1 ,, l   log 2 P(Y = y | X = x) ( x , y )T Solution   Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000) December 2005 IIT Bombay and IBM India Research Lab Maximum Entropy Markov Model  Use Maximum Entropy Approach to Model – 1st order P(Yk = yk | X k = xk , Yk 1 = yk 1 )  Features  Basic features (like parameters in HMM)    Bigram (1st order) or trigram (2nd order) in source model State-output pair feature (Xk = xk, Yk = yk) Advantage: incorporate other advanced features on (xk, yk) December 2005 IIT Bombay and IBM India Research Lab HMM vs MEMM (1st order) Xk Xk P( X k | Yk ) P (Yk | Yk 1 ) Yk 1 Yk HMM December 2005 P(Yk | X k , Yk 1 ) Yk 1 Yk Maximum Entropy Markov Model (MEMM) IIT Bombay and IBM India Research Lab Performance in POS Tagging  POS Tagging – Data set: WSJ – Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)  Results (Lafferty et al. 2001) – 1st order HMM 94.31% accuracy, 54.01% OOV accuracy – 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy December 2005 IIT Bombay and IBM India Research Lab Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields December 2005 IIT Bombay and IBM India Research Lab Disadvantage of MEMMs (1)  Complex Algorithm of Maximum Entropy Solution – Both IIS and GIS are difficult to implement – Require many tricks in implementation  Slow in Training – Time consuming when data set is large Especially for MEMM December 2005 IIT Bombay and IBM India Research Lab Disadvantage of MEMMs (2)  Maximum Entropy Markov Model – Maximum entropy model as a sub model – Optimization of entropy on sub models, not on global model  Label Bias Problem – Conditional models with per-state normalization – Effects of observations are weakened for states with fewer outgoing transitions December 2005 IIT Bombay and IBM India Research Lab Label Bias Problem Training Data Model X:Y r rib:123 rib:123 b 1 2 3 r o b 4 5 6 rib:123 rob:456 rob:456 i New input: rob Parameters P (1 | r) = 0.4, P (4 | r) = 0.6, P ( 2 | i,1) = P ( 2 | o,1) = 1, P (5 | i,4) = P (5 | o,4) = 1, P (3 | b,2) = P (6 | b,5) = 1 P(123 | rob) = P(1 | r) P(2 | o,1) P(3 | b,2) = 0.6  1  1 = 0.6 P(456 | rob) = P(4 | r) P(5 | o,4) P(6 | b,5) = 0.4  1  1 = 0.4 December 2005 IIT Bombay and IBM India Research Lab Solution  Global Optimization – Optimize parameters in a global model simultaneously, not in sub models separately  Alternatives – Conditional random fields – Application of perceptron algorithm December 2005 IIT Bombay and IBM India Research Lab Conditional Random Field (CRF) (1)  Let –  G = (V , E ) be a graph such that Y is indexed by the vertices Y = (Yv ) vV Then  (X, Y) is a conditional random field if P(Yv | X, Yw , w  v) = P(Yv | X, Yw , ( w, v)  E )  Conditioned globally on X December 2005 IIT Bombay and IBM India Research Lab Conditional Random Field (CRF) (2) Determined by State Transitions  Exponential Model – G = (V , E) : a tree (or more specifically, a chain) with cliques as edges and vertices P(Y = y | X = x)  exp(  k f k (e, y |e x)  eE , k  State determine d   g ( v, y | vV , k k k v Parameter Estimation  Maximize the conditional likelihood of training examples  log ( x , y )T  IIS or GIS December 2005 2 P ( Y = y | X = x) x) IIT Bombay and IBM India Research Lab MEMM vs CRF  Similarities – Both employ maximum entropy principle – Both incorporate rich feature information  Differences – Conditional random fields are always globally conditioned on X, resulting in a global optimized model December 2005 IIT Bombay and IBM India Research Lab Performance in POS Tagging  POS Tagging – Data set: WSJ – Features: HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)  Results (Lafferty et al. 2001) – 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy – Conditional random fields 95.73% accuracy, 76.24% OOV accuracy December 2005 IIT Bombay and IBM India Research Lab Comparison of the three approaches to POS Tagging  Results (Lafferty et al. 2001) – 1st order HMM 94.31% accuracy, 54.01% OOV accuracy – 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy – Conditional random fields 95.73% accuracy, 76.24% OOV accuracy December 2005 IIT Bombay and IBM India Research Lab References  A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71.  J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289. December 2005

Graphical Models for POS Tagging: HMM, MEMM, CRF

Products

Support

Graphical Models for POS Tagging: HMM, MEMM, CRF

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib