slide

advertisement
Conditional Random Fields:
Probabilistic Models for
Segmenting and Labeling
Sequence Data
John Lafferty
Andrew McCallum
Fernando Pereira
Goal: Sequence segmentation
and labeling



Computational biology
Computational linguistics
Computer science
Overview


HMM
RF
->
->
MEMM ->
MRF ->
Generative:
HMM
Task
s’
s
o
CRF
CRF
Discriminative / Conditional:
MEMM
s’
s
o
Evaluation
Find P(oT|M)
Decoding
=
Prediction
Find sT s.t. P(oT| sT, M)
is maximized
Find sT s.t. P(sT| oT, M)
is maximized
Given o, find M s.t.
P(o | M) is maximized
Given o and s, find M s.t.
P(s | o, M) is maximized
(Need EM because S is unknown)
(Simpler Max likelihood problem)
Learning
Conditional
CRF
Find sT s.t. P(sT|oT, M)
is maximized
Given o and s, find M
s.t. P(s|o,M)
is maximized
CWS Example
P(S|S)
S
S
P(我|S)
我
MEMM
CRF
P(B|S)
P(B|E)
B
P(是|S)
是
E
P(歌|B)
歌
P(手|E)
手
P(B|S,歌)
S
S
B
E
我
是
歌
手
S
S
B
E
我
是
歌
手
Generative Models



HMMs and stochastic grammars
Assign a joint probability to paired observation and
label sequences
Parameters are trained to maximize joint likelihood of
training examples
Generative Models


Need to enumerate all possible
observation sequences
To ensure tractability of inference
problem, must make strong
independence assumptions (i.e.,
conditional independence given labels)
Example: MEMMs



Maximum entropy Markov models
Each source state has an exponential
model that takes the observation
feature as input and outputs a
distribution over possible next states
Weakness: Label bias problem
Label Bias Problem



Per-state normalization of transition scores implies
“conservation Label
of scoreBias
mass”
Example
Bias
towards states with fewer outgoing transitions
D: determiner
N: noun
Fred
wheels
round
State
transition
effectively
V: verb with single outgoing
V
N
R
A: adjective
ignores
R: adverb observation
The
robot
s
D
e
N
wheels
N
are
V
round
A
Obs: “The robot wheels are round.”
But if P(V|N,wheels) > P(N|N,wheels),
then upper path is chosen regardless of obs.
After Wallach ‘02
5
Solving Label Bias (cont’d)

Start with fully-connected model and let
training procedure figure out a good
structure

Precludes use of prior structure knowledge
Overview


Generative models
Conditional models



Label bias problem
Conditional random fields
Experiments
Conditional Random Fields




Undirected graph (random field)
Construct conditional model p(Y|X)
Does not explicitly model marginal p(X)
Assumption: graph is fixed
Solving Label Bias (cont’d)

Start with fully-connected model and let
training procedure figure out a good
structure

Precludes use of prior structure knowledge
Normalization
CRFs: Distribution
i
i
i
CRFs: Parameter Estimation


Maximize log-likelihood objective
function
N sentences in training corpus
CRF Parameter Estimation
• Iterative Scaling:
N
• Maximizes likelihood O(! ) = ² log p! (y(i ) | x (i ) ) # ² p!(x, y)log p! (y | x)
i =1
x,y
by iteratively updating
! k ² ! k + #! k
m k ! m k + ²m k
• Define auxilliary function A() s.t. A(! ',! ) ² O(! ') # O(! )
• Initialize each ! k
• Do until convergence:
Solve dA(! ',! ) = 0 for each
d²# k
Update parameter:
!²k
! k ² ! k + #! k
12
CRF Parameter Estimation
• For chain CRF, setting
n +1
dA(! ',! )
=0
d²# k
gives
! f ] " ! p!(x, y)! f (y , y , x)
E[
k
k
i² 1
i
x, y
i =1
n +1
= ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k T (x, y))
x, y
i =1
• T (x, y) = ² fk (y i ! 1 , y i , x) + ² gk (y i , x) is total feature count
i,k
i, k
• Unfortunately, T(x,y) is a global property of (x,y)
• Dynamic programming will sum over sequences with
potentially varying T. Inefficient exp sum computation
13
Algorithm S
(Generalized Iterative Scaling)
• Introduce global slack feature s.t. T(x,y) becomes
constant S for all (x,y)
S(x, y) ! S !
²
i, k
fk (y i ! 1 , y i , x) + ² gk (y i , x)
i, k
• Define forward and backward variables
! i (y | x) = !
i² 1
(y | x)exp &# f k (y i ² 1 , y i , x) + #
%i,k
i, k
¢
gk (y i ,x))
(
#
&
! i (y | x) = ! i +1 (y | x)exp %² fk (yi +1 , y i , x) + ² gk (y i +1 , x)(
¢
i,k
i,k
14
Algorithm S
The update equations become:
Where
Note
Rate of convergence
governed by S
is like posterior as in HMM
15
Algorithm T
(Improved Iterative Scaling)
The equation we nwant
to solve
+1
! fk ] = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (#
E[
i =1
k
T (x, y))
is polynomial in exp (! ² k )
So can be solved with Newton’s method
x, y
n +1
! fk ] = ! p!(x) p(y | x)! f k (y i ² 1 , y i , x) exp (#
Define T (x) ! max T (x, y) E[
x, y
i =1
T
n +1
%
(
t
Then: ! ¢ ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k )
t = 0 &{
i =1
)
}
k
T (x))
max
x ,y|T ( x )= t
n +1
Now, let a k,t,bk,t be E[fk|T(x)=t]
ak,t = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x)# ( t,T (x))
x, y
i =1
UPDATE:
16
Overview


Generative models
Conditional models



Label bias problem
Conditional random fields
Experiment
Experiment 1: Modeling Label
Bias




Generate data from simple HMM that encodes noisy
version of network
Each state emits designated symbol with prob. 29/32
2,000 training and 500 test samples
MEMM error: 42%; CRF error: 4.6%
Experiment 2: More synthetic
data





Five labels: a – e
26 observation values: A – Z
Generate data from a mixed-order HMM
Randomly generate model
For each model, generate sample of
1,000 sequences of length 25
MEMM vs. CRF
MEMM vs. HMM
CRF vs. HMM
Experiment 3: Part-of-speech
Tagging



Each word to be labeled with one of 45 syntactic tags.
50%-50% train-test split
out-of-vocabulary (oov) words: not observed in the
training set
Part-of-speech Tagging


Second set of experiments: add small set of
orthographic features (whether word is capitalized,
whether word ends in –ing, -ogy, -ed, -s, -ly …)
Overall error rate reduced by 25% and oov error
reduced by around 50%
Part-of-speech Tagging




Usually start training with zero parameter
vector (corresponds to uniform distribution)
Use optimal MEMM parameter vector as
starting point for training corresponding CRF
MEMM+ trained to convergence in around 100
iterations; CRF+ took additional 1,000
iterations
When starting from uniform distribution,
CRF+ had not converged after 2,000
iterations
Further Aspects of CRFs

Automatic feature selection

Start from feature-generating rules and
evaluate the benefit of the generated
features automatically on data
Conclusions



CRFs do not suffer from the label bias
problem!
Parameter estimation guaranteed to
find the global optimum
Limitation: Slow convergence of the
training algorithm
CRFs: Example Features

Corresponding parameters λ and μ similar to
the (logarithms of the) HMM parameters
p(y’|y) and p(x|y)
Download