Uploaded by Jai Mahajan

# 07-02-06

advertisement ```Indian Institute of Technology, Bombay
and
Research Division, India Research Lab
Graphical models for part of
speech tagging
December 2005
IIT Bombay and IBM India Research Lab
Different Models for POS tagging
 HMM
 Maximum Entropy Markov Models
 Conditional Random Fields
December 2005
IIT Bombay and IBM India Research Lab
POS tagging: A Sequence Labeling Problem
 Input and Output
– Input sequence x = x1x2 xn
– Output sequence y = y1y2 ym
Labels of the input sequence
Semantic representation of the input
 Other Applications
– Automatic speech recognition
– Text processing, e.g., tagging, name entity
recognition, summarization by exploiting
layout structure of text, etc.
December 2005
IIT Bombay and IBM India Research Lab
Hidden Markov Models
 Doubly stochastic models
A 0.6
A 0.9
C
0.5
0.1
S1
 Efficient dynamic programming
algorithms exist for
– Finding Pr(S)
– The highest probability path P that
maximizes Pr(S,P) (Viterbi)
 Training the model
– (Baum-Welch algorithm)
December 2005
C 0.4
0.9
S2
0.5
0.1
0.8
S3
A 0.5
C 0.5
S4
0.2
A 0.3
C 0.7
IIT Bombay and IBM India Research Lab
Hidden Markov Model (HMM) : Generative
Modeling
y
Source Model
P(Y)
x
Noisy Channel
P(X|Y)
e.g., 1st order Markov chain
P(y ) =  P( yi | yi 1 )
i
P (x | y ) =  P ( xi | yi )
i
Parameter estimation:

maximize the joint likelihood
log 2 P (xof
, y )training examples
( x , y )T
December 2005
IIT Bombay and IBM India Research Lab
Dependency (1st order)
X k 2
X k 1
P ( X k  2 | Yk  2 )
P( X k 1 | Yk 1 )
P(Yk 1 | Yk  2 )
Yk  2
December 2005
P( X k | Yk )
P (Yk | Yk 1 )
Yk 1
X k 1
Xk
P( X k 1 | Yk 1 )
P (Yk 1 | Yk )
Yk
Yk 1
IIT Bombay and IBM India Research Lab
Different Models for POS tagging
 HMM
 Maximum Entropy Markov Models
 Conditional Random Fields
December 2005
IIT Bombay and IBM India Research Lab
Disadvantage of HMMs (1)
 No Rich Feature Information
– Rich information are required
When xk is complex
When data of xk is sparse
 Example: POS Tagging
– How to evaluate P(wk|tk) for unknown
words wk ?
– Useful features
Suffix, e.g., -ed, -tion, -ing, etc.
Capitalization
December 2005
IIT Bombay and IBM India Research Lab
Disadvantage of HMMs (2)
 Generative Model
– Parameter estimation: maximize the joint
likelihood of training examples
 log
2
P(X = x, Y = y)
( x , y )T

Better Approach


Discriminative model which models P(y|x) directly
Maximize the conditional likelihood of training examples
 log
( x , y )T
December 2005
2
P(Y = y | X = x)
IIT Bombay and IBM India Research Lab
Maximum Entropy Markov Model
 Discriminative Sub Models
– Unify two parameters in generative model
into one conditional model
Two parameters in generative model,
parameter in source model P( y | y )
and parameter in noisy channel P( x | y )
Unified conditional model P( y | x , y )
k 1
k
k
k

k
– Employ maximum entropy principle
Maximum Entropy Markov Model
P(y | x) =  P( yi | yi 1 , xi )
i
December 2005
k 1
k
IIT Bombay and IBM India Research Lab
General Maximum Entropy Model
 Model
– Model distribution P(Y |X) with a set of
features {f1, f2, , fl} defined on X and Y
 Idea
– Collect information of features from
training data
– Assume nothing on distribution P(Y |X)
other than the collected information
Maximize the entropy as a criterion
December 2005
IIT Bombay and IBM India Research Lab
Features
 Features
– 0-1 indicator functions
1 if (x, y) satisfies a predefined
condition
0 if not
 Example: POS Tagging
1, if x ends with - tion and y is NN
f1 ( x, y) = 
 0, otherwise
1, if x start with Captialization and y is NNP
f 2 ( x, y) = 
0, otherwise
December 2005
IIT Bombay and IBM India Research Lab
Constraints
 Empirical Information
– Statistics from training data T
1
Pˆ ( f i ) =
f i ( x, y)

| T | ( x , y )T

Expected Value

From the distribution P(Y |X) we want to model
P( f i ) =

Constraints
December 2005
1
P(Y = y | X = x) f i ( x, y)


| T | ( x , y )T yD (Y )
Pˆ ( f i ) = P( f i )
IIT Bombay and IBM India Research Lab
Maximum Entropy: Objective
 Entropy
1
I =
P(Y = y | X = x) log 2 P(Y = y | X = x)

| T | ( x , y )T
=
Pˆ ( x) P(Y = y | X = x) log P(Y = y | X = x)

x

2
y
Maximization Problem
max I
P (Y | X )
s.t. Pˆ ( f ) = P ( f )
December 2005
IIT Bombay and IBM India Research Lab
Dual Problem
 Dual Problem
– Conditional model
l
P(Y = y | X = x)  exp(  i f i ( x, y ))
i =1
– Maximum likelihood of conditional data
max
1 ,, l

 log
2
P(Y = y | X = x)
( x , y )T
Solution


Improved iterative scaling (IIS) (Berger et al. 1996)
Generalized iterative scaling (GIS) (McCallum et al.
2000)
December 2005
IIT Bombay and IBM India Research Lab
Maximum Entropy Markov Model
 Use Maximum Entropy Approach to Model
– 1st order
P(Yk = yk | X k = xk , Yk 1 = yk 1 )

Features

Basic features (like parameters in HMM)



Bigram (1st order) or trigram (2nd order) in source
model
State-output pair feature (Xk = xk, Yk = yk)
Advantage: incorporate other advanced
features on (xk, yk)
December 2005
IIT Bombay and IBM India Research Lab
HMM vs MEMM (1st order)
Xk
Xk
P( X k | Yk )
P (Yk | Yk 1 )
Yk 1
Yk
HMM
December 2005
P(Yk | X k , Yk 1 )
Yk 1
Yk
Maximum Entropy
Markov Model (MEMM)
IIT Bombay and IBM India Research Lab
Performance in POS Tagging
 POS Tagging
– Data set: WSJ
– Features:
HMM features, spelling features (like –ed, -tion,
-s, -ing, etc.)
 Results (Lafferty et al. 2001)
– 1st order HMM
94.31% accuracy, 54.01% OOV accuracy
– 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy
December 2005
IIT Bombay and IBM India Research Lab
Different Models for POS tagging
 HMM
 Maximum Entropy Markov Models
 Conditional Random Fields
December 2005
IIT Bombay and IBM India Research Lab
Disadvantage of MEMMs (1)
 Complex Algorithm of Maximum
Entropy Solution
– Both IIS and GIS are difficult to implement
– Require many tricks in implementation
 Slow in Training
– Time consuming when data set is large
Especially for MEMM
December 2005
IIT Bombay and IBM India Research Lab
Disadvantage of MEMMs (2)
 Maximum Entropy Markov Model
– Maximum entropy model as a sub model
– Optimization of entropy on sub models,
not on global model
 Label Bias Problem
– Conditional models with per-state
normalization
– Effects of observations are weakened for
states with fewer outgoing transitions
December 2005
IIT Bombay and IBM India Research Lab
Label Bias Problem
Training Data
Model
X:Y
r
rib:123
rib:123
b
1
2
3
r
o
b
4
5
6
rib:123
rob:456
rob:456
i
New input: rob
Parameters
P (1 | r) = 0.4, P (4 | r) = 0.6,
P ( 2 | i,1) = P ( 2 | o,1) = 1,
P (5 | i,4) = P (5 | o,4) = 1,
P (3 | b,2) = P (6 | b,5) = 1
P(123 | rob) = P(1 | r) P(2 | o,1) P(3 | b,2)
= 0.6  1  1 = 0.6
P(456 | rob) = P(4 | r) P(5 | o,4) P(6 | b,5)
= 0.4  1  1 = 0.4
December 2005
IIT Bombay and IBM India Research Lab
Solution
 Global Optimization
– Optimize parameters in a global model
simultaneously, not in sub models
separately
 Alternatives
– Conditional random fields
– Application of perceptron algorithm
December 2005
IIT Bombay and IBM India Research Lab
Conditional Random Field (CRF) (1)
 Let
–

G = (V , E )
be a graph such that Y is indexed
by the vertices Y = (Yv ) vV
Then
 (X, Y) is a conditional random field if
P(Yv | X, Yw , w  v) = P(Yv | X, Yw , ( w, v)  E )

Conditioned globally on X
December 2005
IIT Bombay and IBM India Research Lab
Conditional Random Field (CRF) (2)
Determined by
State Transitions
 Exponential Model
– G = (V , E) : a tree (or more specifically, a
chain) with cliques as edges and vertices
P(Y = y | X = x)  exp(  k f k (e, y |e x) 
eE , k

State
determine
d
  g ( v, y |
vV , k
k
k
v
Parameter Estimation

Maximize the conditional likelihood of training examples
 log
( x , y )T

IIS or GIS
December 2005
2
P ( Y = y | X = x)
x)
IIT Bombay and IBM India Research Lab
MEMM vs CRF
 Similarities
– Both employ maximum entropy principle
– Both incorporate rich feature information
 Differences
– Conditional random fields are always
globally conditioned on X, resulting in a
global optimized model
December 2005
IIT Bombay and IBM India Research Lab
Performance in POS Tagging
 POS Tagging
– Data set: WSJ
– Features:
HMM features, spelling features (like –ed, -tion,
-s, -ing, etc.)
 Results (Lafferty et al. 2001)
– 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy
– Conditional random fields
95.73% accuracy, 76.24% OOV accuracy
December 2005
IIT Bombay and IBM India Research Lab
Comparison of the three approaches to POS
Tagging
 Results (Lafferty et al. 2001)
– 1st order HMM
94.31% accuracy, 54.01% OOV accuracy
– 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy
– Conditional random fields
95.73% accuracy, 76.24% OOV accuracy
December 2005
IIT Bombay and IBM India Research Lab
References
 A. Berger, S. Della Pietra, and V. Della Pietra (1996). A
Maximum Entropy Approach to Natural Language Processing.
Computational Linguistics, 22(1), 39-71.
 J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional
Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data. In Proc. ICML-2001, 282-289.
December 2005
```