Conditional Random Fields Dietrich Klakow

advertisement
Conditional Random
Fields
Dietrich Klakow
Warning
• I might have to leave during the lecture
for a meeting
Overview
•
•
•
•
•
Sequence Labeling
Bayesian Networks
Markov Random Fields
Conditional Random Fields
Software example
Background Reading
Hanna M. Wallach
Conditional Random Fields: An
Introduction.
Technical Report MS-CIS-04-21.
Department of Computer and Information
Science, University of Pennsylvania,
2004.
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
Sequence Labeling Tasks
Sequence: a sentence
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
nonexecutive
director
Nov.
29
.
POS Labels
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
nonexecutive
director
Nov.
29
.
NNP
NNP
,
CD
NNS
JJ
,
MD
VB
DT
NN
IN
DT
JJ
NN
NNP
CD
.
Chunking
Task: find phrase boundaries:
Chunking
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
nonexecutive
director
Nov.
29
.
B-NP
I-NP
O
B-NP
I-NP
B-ADJP
O
B-VP
I-VP
B-NP
I-NP
B-PP
B-NP
I-NP
I-NP
B-NP
I-NP
O
Named Entity Tagging
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
nonexecutive
director
Nov.
29
.
B-PERSON
I-PERSON
O
B-DATE:AGE
I-DATE:AGE
I-DATE:AGE
O
O
O
O
B-ORG_DESC:OTHER
O
O
O
B-PER_DESC
B-DATE:DATE
I-DATE:DATE
O
Supertagging
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
nonexecutive
director
Nov.
29
.
N/N
N
,
N/N
N
(S[adj]\NP)\NP
,
(S[dcl]\NP)/(S[b]\NP)
((S[b]\NP)/PP)/NP
NP[nb]/N
N
PP/NP
NP[nb]/N
N/N
N
((S\NP)\(S\NP))/N[num]
N[num]
.
Hidden Markov Model
HMM: just an Application of a
Bayes Classifier
(πˆ1 , πˆ 2 ...πˆ N ) = arg max[P( x1 , x2 ...x N , π 1 , π 2 ...π N )]
π 1 ,π 2 ..π N
x1 , x2 ...x N : observation/input sequence
π 1 , π 2 ...π N : label sequence
Decomposition of Probabilities
P ( x1 , x2 ..x N , π 1 , π 2 ..π N )
N
= ∏ P ( xi | π i ) P(π i | π i −1 )
i =1
P(π i | π i −1 ) : transition probability
P ( xi | π i ) : emission probability
Graphical view HMM
Observation sequence
X1
X2
X3
…….
XN
π1
π2
π3
…….
πN
Label sequence
Criticism
• HMMs model only limited dependencies
a come up with more flexible models
a come up with graphical description
Bayesian Networks
Example for Bayesian Network
From Russel
and Norvig 95
AI: A Modern Approach
Corresponding joint P (C , S , R, W ) =
distribution
P (W | S , R) P( S | C ) P ( R | C ) P(C )
Naïve Bayes
Observations x1, …. xD are assumed to be independent
D
∏ P( x | z )
i
i =1
Markov Random Fields
• Undirected graphical model
• New term:
• clique in an undirected graph:
• Set of nodes such that every node is
connected to every other node
• maximal clique: there is no node that can
be added without add without destroying
the clique property
Example
cliques: green and blue
maximal clique: blue
Factorization
x : all nodes x1...x N
xC : nodes in clique C
C M : set of all maximal cliques
ΨC ( xC ) : potential function ( ΨC ( xC ) ≥ 0)
Joint distribution described by graph
1
p( x) =
Z
∏Ψ
C
( xC )
C∈C M
Normalization
Z =∑
∏Ψ
C
( xC )
x C∈C M
Z is sometimes call the partition function
Example
x2
x1
x3
x5
x4
What are the maximum cliques?
Write down joint probability
described by this graph
a white board
Energy Function
Define
ΨC ( xC ) = e
− E ( xC )
Insert into joint distribution
1
p ( x) = e
Z
−
∑ E ( xC )
C∈CM
Conditional Random Fields
Definition
Maximum random field
were each random variable yi
is conditioned on the complete input
sequence x1, …xn
y=(y1…yn)
y1
y2
y3
…..
x
yn-1
x=(x1…xn)
yn
Distribution
Distribution
n
1
p( y | x) =
e
Z ( x)
−
N
∑∑ λ j f j ( yi−1 , yi , x ,i )
i =1 j =1
λ j : parameters to be trained
f j ( yi −1 , yi , x, i ) : feature function
Example feature functions
Modeling transitions
1 if y i-1 = IN and y i = NNP
f1 ( yi −1 , yi , x, i ) = 
 0 else
Modeling emissions
1 if y i = NNP and x i = September
f 2 ( yi −1 , yi , x, i ) = 
0 else

Training
• Like in maximum entropy models
Generalized iterative scaling
• Convergence:
p(y|x) is a convex function
a unique maximum
Convergence is slow
Improved algorithms exist
Decoding: Auxiliary Matrix
Define additional start symbol
y0=START
and stop symbol
yn+1=STOP
Define matrix M i (x )
such that
N
[M
i
( x)
]
y i −1 y i
−
=M
i
y i −1 y i
( x) = e
∑ λ j f j ( y i −1 , y i , x ,i )
j =1
Reformulate Probability
With that definition we have
1 n +1 i
p( y | x) =
M yi−1 yi ( x)
∏
Z ( x) i =1
with
Z ( x) = ∑∑∑ ...∑ M
y1
y2
y3
yn
1
y0 y1
( x)M
2
y1 y 2
( x)....M
n +1
y n y n+1
( x)
Use Matrix Properties
Use matrix product
[M ( x)M
1
2
( x)
]
y0 y 2
= ∑ M 1y0 y1 ( x) M y21 y2 ( x)
y1
with
[
1
2
Z ( x) = M ( x) M ( x)...M
n +1
( x)
]
y0 = START , y n+1 = STOP
Viterbi Decoding
• Matrix M replaces the product of
transition and emission probability
• Decoding can be done in Viterbi style
• Effort:
• linear in length of sequence
• quadratic in the number of labels
Software
CRF++
• See http://crfpp.sourceforge.net/
Summary
• Sequence labeling problems
• CRFs are
• flexible
• Expensive to train
• Fast to decode
Download