Supervised Sequence Labeling

advertisement
Conditional Random Fields
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign
appropriate labels to each word.
• For example, POS tagging:
DT NN VBD IN DT NN .
The cat sat on the mat .
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign
appropriate labels to each word.
• Another example, partial parsing (aka
chunking):
B-NP I-NP B-VPB-PP B-NP I-NP
The cat sat on the mat
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign
appropriate labels to each word.
• Another example, relation extraction:
B-ArgI-ArgB-Rel I-Rel B-Arg I-Arg
The cat sat on the mat
The CRF Equation
• A CRF model consists of
– F = <f1, …, fk>, a vector of “feature functions”
– θ = < θ1, …, θk>, a vector of weights for each feature
function.
• Let O = < o1, …, oT> be an observed sentence
• Let X = <x1, …, xT> be the latent variables.
P (X  x | O ) 
exp θ  F  x , O 
 exp θ  F x ' , O 
x'
• This is the same as the Maximum Entropy equation!
CRF Equation, standard format
• Note that the denominator depends on O, but
not on y (it’s marginalizing over y).
• Typically, we write
P (X  x | O ) 
1
Z (O )
where
Z (O ) 
exp θ  F  x, O 
 exp θ  F x' , O 
x'
Making Structured Predictions
Structured prediction vs.
Text Classification
Recall: max. ent. for text classification:
arg max P ( A  c | O  doc )

c


1
arg max 
exp θ  F  c , doc
c
 Z ( doc )
arg max θ  F  c , doc 
c
CRFs for sequence labeling:
arg max P ( A  y | O )

y

 1

arg max 
exp θ  F  y, O 
y
 Z (O )

arg max θ  F  y, O 
y
What’s the difference?

 

Structured prediction vs.
Text Classification
Two (related) differences, both for the sake of
efficiency:
1) Feature functions in CRFs are restricted to
graph parts (described later)
2) We can’t do brute force to compute the
argmax. Instead, we do Viterbi.
Finding the Best Sequence
Best sequence is
arg max P ( X  x | O )

x

 1

arg max 
exp θ  F  x, O 
x
 Z (O )

arg max θ  F  x, O 
x
Recall from HMM discussion:
If there are
K possible states for each xi variable,
and N total xi variables,
Then there are KN possible settings for x
So brute force can’t find the best sequence.
Instead, we resort to a Viterbi-like dynamic program.
Viterbi Algorithm
X1
Xt-1
Xt=hj
o1
ot-1
ot
ot+1
oT
 j ( t )  max θ  F ( x1 ... x t 1 , o1 ... o t 1 , x t  h j , o t )
x1 ... x t 1
The state sequence which maximizes the
score of seeing the observations to time t-1,
landing in state hj at time t, and seeing the
observation at time t
Viterbi Algorithm
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T  arg max  i (T )
i
Xˆ t  
( t  1)
^
X
t 1
P ( Xˆ )  arg max  i (T )
i
Compute the most
likely state sequence
by working
backwards
Viterbi Algorithm
X1
Xt-1
Xt=hj
Xt+1
o1
ot-1
ot
ot+1
oT
 j ( t )  max θ  F ( x1 ... x t 1 , o1 ... o t 1 , x t  h j , o t )
x1 ... x t 1
 j ( t  1)  max  i ( t ) a ij b jo
i
t 1
??!
 j ( t  1)  arg max  i ( t ) a ij b jo
i
t 1
??!
Recursive
Computation
Feature functions and Graph parts
To make efficient computation (dynamic programs)
possible, we restrict the feature functions to:
Graph parts (or just parts): A feature function that
counts how often a particular configuration occurs
for a clique in the CRF graph.
Clique: a set of completely connected nodes in a graph.
That is, each node in the clique has an edge
connecting it to every other node in the clique.
Clique Example
The cliques in a linear chain CRF are the set of
individual nodes, and the set of pairs of
consecutive nodes.
CRF
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
Clique Example
The cliques in a linear chain CRF are the set of
individual nodes, and the set of pairs of
consecutive nodes.
Individual node cliques
CRF
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
Clique Example
The cliques in a linear chain CRF are the set of
individual nodes, and the set of pairs of
consecutive nodes.
Pair-of-node cliques
CRF
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
Clique Example
For non-linear-chain CRFs (something we won’t
normally consider in this class), you can get
larger cliques:
X5’
CRF
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
Larger cliques
Graph part as Feature Function
Example
Graph parts are feature functions f(x,o) that
count how many cliques have a particular
configuration.
For example, f(x,o) = count of [xi = Noun].
CRF
x1=D
x2=N
x3=V
x4=D
x5=A
x6=N
o1
o2
o3
o4
o5
o6
Here, x2 and x6 are both Nouns, so f(x,o) = 2.
Graph part as Feature Function
Example
For a pair-of-nodes example,
f(x,o) = count of [xi = Noun,xi+1=Verb]
CRF
x1=D
x2=N
x3=V
x4=D
x5=A
x6=N
o1
o2
o3
o4
o5
o6
Here, x2 is a Noun and x3 is a Verb, so f(x,o) = 1.
Features can depend on the whole observation
In a CRF, each feature function can depend on o, in addition to a clique in x
HMM
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
Normally, we draw a CRF like this:
CRF
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
Features can depend on the whole observation
In a CRF, each feature function can depend on o, in addition to a clique in x
HMM
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
But really, it’s more like this:
CRF
X1
X2
X3
X4
X5
X6
o1
o2
o3
o4
o5
o6
This would cause problems for a generative model, but in a conditional
model, o is always a fixed constant. So we can still calculate relevant
algorithms like Viterbi efficiently.
Graph part as Feature Function
Example
An example part including x and o:
f(x,o) = count of [xi = A or D,xi+1=N,o2=cat]
CRF
x1=D
x2=N
x3=V
x4=D
x5=A
x6=N
The
cat chased the
tiny
fly
Here, x1 is a D and x2 is a N, plus x5 is a A and x6 is a N,
plus o2=cat, so f(x,o) = 2.
Notice that the clique x5-x6 is allowed to depend on o2.
Graph part as Feature Function
Example
An more usual example including x and o:
f(x,o) = count of [xi = A or D,xi+1=N,oi+1=cat]
CRF
x1=D
x2=N
x3=V
x4=D
x5=A
x6=N
The
cat chased the
tiny
fly
Here, x1 is a D and x2 is a N, plus o2=cat, so f(x,o)=1.
The CRF Equation, with Parts
• A CRF model consists of
– P = <p1, …, pk>, a vector of parts
– θ = < θ1, …, θk>, a vector of weights for each part.
• Let O = < o1, …, oT> be an observed sentence
• Let X = <x1, …, xT> be the latent variables.
P (X  x | O ) 
exp θ  P  x , O 
Z (O )
Viterbi Algorithm – 2nd Try
X1
Xt-1
Xt=hj
Xt+1
o1
ot-1
ot
ot+1
oT
 j ( t )  max θ  P ( x1 ... x t 1 , x t  h j , o )
x1 ... x t 1


 i (t )


 j ( t  1)  max 
 θ one  P one ( x t  1  h j , o )

i



θ

P
(
x

h
,
x

h
,
o
)
pair
pair
t
i
t 1
j




 i (t )


 j ( t  1)  arg max 
 θ one  P one ( x t  1  h j , o )

i



θ

P
(
x

h
,
x

h
,
o
)
pair
pair
t
i
t 1
j


Recursive
Computation
Supervised Parameter Estimation
Conditional Training
• Given a set of observations o and the correct labels x
for each, determine the best θ:
arg max P ( x | o , θ )
θ
• Because the CRF equation is just a special form of the
maximum entropy equation, we can train it exactly the
same way:
– Determine the gradient
– Step in the direction of the gradient
– Repeat until convergence
Recall: Training a ME model
Training is an optimization problem:
find the value for λ that maximizes the conditional
log-likelihood of the training data:
CLL (Train )
 log

P (c | d )
 c , d  Train


 c , d  Train


   i f i ( c , d )  log Z ( d ) 
 i

29
Recall: Training a ME model
Optimization is normally performed using some
form of gradient descent:
0) Initialize λ0 to 0
1) Compute the gradient: ∇CLL
2) Take a step in the direction of the gradient:
λi+1 = λi + α ∇CLL
3) Repeat until CLL doesn’t improve:
stop when |CLL(λi+1) – CLL(λi)| < ε
30
Recall: Training a ME model
Computing the gradient:

i
CLL (Train )






   i f i ( c , d )  log Z ( d ) 

  i  c , d  Train  i




  f i ( c , d )    log  exp   i f i ( c , d ) 
 c , d  Train 
c
i
i


 c , d  Train


f i ( c , d ) exp   i f i ( c , d ) 



c
i
 f i (c, d ) 

 exp   i f i ( c , d ) 

c
i


  f i (c, d )  E P f i (c, d ) 
 c , d  Train
31
Recall: Training a ME model

i
Computing the gradient:
CLL (Train )







   i f i ( c , d )  log Z ( d ) 

  i  c , d  Train  i




  f i ( c , d )    log  exp   i f i ( c ' , d ) 
 c , d  Train 
c
i
i


f i ( c ' , d ) exp   i f i ( c ' , d ) 



c
i

  f i (c, d ) 
 c , d  Train 
 exp   i f i ( c ' ' , d ) 
c'
i




  f i ( c , d )   Pλ ( c ' | d ) f i ( c ' , d ) 
 c , d  Train 
c'

Involves a sum over all possible classes
32
Recall: Training a ME model:
Expected feature counts
• In ME models, each document d is classified
independently.
• The sum  Pλ ( c ' | d ) f i ( c ' , d ) involves as many
c'
terms as there are classes c’.
• Very doable.
Training a CRF

i
CLL (Train )


 x , o  Train







f
(
x
,
x
,
o
)

log
Z
(
o
)


i i
t
t 1
t


  i  x ,o  Train  i , t




  f i ( x t , x t 1 , o t ) 
log  exp   i f i ( x 't , x 't 1 , o t ) 


i
x'
i ,t
 t




  f i ( x 't , x 't 1 , o t ) exp   i f i ( x 't , x 't 1 , o t ) 

x'
t
i ,t

   f i ( x t , x t 1 , o t ) 
exp

f
(
x
'
'
,
x
'
'
,
o
)
 x , o  Train  t
  i i t t 1 t

x' '
i ,t




f
(
x
,
x
,
o
)

P
(
x'
|
o
)
f
(
x
'
,
x
'
,
o
)

  i t t 1 t  λ
 i t t 1 t 
 x , o  Train  t
x'
t

The hard part for CRFs
34
Training a CRF:
Expected feature counts
• For CRFs, the term
 P ( x' | o ) 
λ
x'
f i ( x ' t , x ' t 1 , o t )
t
involves an exponential sum.
• The solution again involves dynamic
programming, very similar to the Forward
algorithm for HMMs.
CRFs vs. HMMs
Generative (Joint Probability) Models
• HMMs are generative models: That is, they can
compute the joint probability
P(sentence, hidden-states)
• From a generative model, one can compute
– Two conditional models:
• P(sentence | hidden-states) and
• P(hidden-states| sentence)
– Marginal models P(sentence) and P(hidden-states)
• For sequence labeling, we want
P(hidden-states | sentence)
Discriminative (Conditional) Models
• Most often, people are most interested in the conditional
probability
P(hidden-states | sentence)
For example, this is the distribution needed for sequence
labeling.
• Discriminative (also called conditional) models directly
represent the conditional distribution
P(hidden-states | sentence)
– These models cannot tell you the joint distribution, marginals, or other
conditionals.
– But they’re quite good at this particular conditional distribution.
Discriminative vs. Generative
HMM (generative)
CRF (discriminative)
Marginal, or
Language model:
P(sentence)
Forward algorithm or
Backward algorithm,
linear in length of sentence
Can’t do it.
Find optimal label
sequence
Viterbi,
Linear in length of
sentence
Viterbi,
Linear in length of
sentence
Supervised parameter
estimation
Bayesian learning,
Easy and fast
Convex optimization,
Can be slow-ish (multiple
passes through the data)
Unsupervised parameter
estimation
Baum-Welch
(non-convex optimization),
Slow but doable
Very difficult, and requires
making extra assumptions.
Feature functions
Parents and children in the
graph
Restrictive!
Arbitrary functions of a
latent state and any
portion of the observed
nodes
CRFs vs. HMMs, a closer look
It’s possible to convert an HMM into a CRF:
Set pprior,state(x,o) = count[x1=state]
Set θprior,state = log PHMM(x1=state) = log state
Set ptrans,state1,state2(x,o)= count[xi=state1,xi+1=state2]
Set θtrans,state1,state2 = log PHMM(xi+1=state2|xi=state1)
= log Astate1,state2
Set pobs,state,word(x,o)= count[xi=state,oi=word]
Set θobs,state,word = log PHMM(oi=word|xi=state)
= log Bstate,word
CRF vs. HMM, a closer look
If we convert an HMM to a CRF, all of the CRF parameters
θ will be logs of probabilities.
Therefore, they will all be between –∞ and 0
Notice: CRF parameters can be between –∞ and +∞.
So, how do HMMs and CRFs compare in terms of bias and
variance (as sequence labelers)?
– HMMs have more bias
– CRFs have more variance
Comparing feature functions
The biggest advantage of CRFs over HMMs is that they can handle
overlapping features.
For example, for POS tagging, using words as a features (like xi=“the”
or xj=“jogging”) is quite useful.
However, it’s often also useful to use “orthographic” features, like “the
word ends in –ing” or “the word starts with a capital letter.”
These features overlap: some words end in “ing”, some don’t.
• Generative models have trouble handling overlapping features
correctly
• Discriminative models don’t: they can simply use the features.
CRF Example
A CRF POS Tagger for English
Vocabulary
We need to determine the set of possible word
types V.
Let V =
{all types in 1 million tokens of Wall Street
Journal text, which we’ll use for training}
U
{UNKNOWN} (for word types we haven’t seen)
L = Label Set
Standard Penn Treebank tagset
Number
Tag
Description
Number
Tag
Description
1.
CC
Coordinating
conjunction
9.
JJS
Adjective,
superlative
2.
CD
Cardinal number
10.
LS
List item marker
3.
DT
Determiner
11.
MD
Modal
4.
EX
Existential there
12.
NN
5.
FW
Foreign word
Noun, singular or
mass
13.
NNS
Noun, plural
6.
IN
Preposition or
subordinating
conjunction
14.
NNP
Proper noun,
singular
7.
JJ
Adjective
15.
NNPS
Proper noun, plural
8.
JJR
Adjective,
comparative
16.
PDT
Predeterminer
17.
POS
Possessive ending
L = Label Set
Number
Tag
Description
Number
Tag
Description
18.
PRP
Personal pronoun
30.
VBN
Verb, past
participle
19.
PRP$
Possessive pronoun
20.
RB
Adverb
31.
VBP
Verb, non-3rd
person singular
present
21.
RBR
Adverb, comparative
22.
RBS
Adverb, superlative
32.
VBZ
Verb, 3rd person
singular present
23.
RP
Particle
24.
SYM
Symbol
33.
WDT
Wh-determiner
25.
TO
to
34.
WP
Wh-pronoun
26.
UH
Interjection
35.
WP$
27.
VB
Verb, base form
Possessive whpronoun
28.
VBD
Verb, past tense
36.
WRB
Wh-adverb
29.
VBG
Verb, gerund or present
participle
CRF Features
Feature Type
Description
Prior
k xi = k
Transition
k,k’ xi = k and xi+1=k’
Word
k,w xi = k and oi=w
k,w xi = k and oi-1=w
k,w xi = k and oi+1=w
k,w,w’ xi = k and oi=w and oi-1=w’
k,w,w’ xi = k and oi=w and oi+1=w’
Orthography: Suffix
s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”,
“ity”, …} and k xi=k and oi ends with s
Orthography: Punctuation
k xi = k and oi is capitalized
k xi = k and oi is hyphenated
k xi = k and oi contains a period
k xi = k and oi is ALL CAPS
k xi = k and oi contains a digit (0-9)
…
Download