CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs)

advertisement
CS 388:
Natural Language Processing:
Discriminative Training and
Conditional Random Fields (CRFs)
for Sequence Labeling
Raymond J. Mooney
University of Texas at Austin
1
Joint Distribution
• The joint probability distribution for a set of random variables,
X1,…,Xn gives the probability of every combination of values (an ndimensional array with vn values if all variables are discrete with v
values, all vn values must sum to 1): P(X1,…,Xn)
negative
positive
circle
square
red
0.20
0.02
blue
0.02
0.01
circle
square
red
0.05
0.30
blue
0.20
0.20
• The marginal probability of all possible conjunctions (assignments of
values to some subset of variables) can be calculated by summing the
appropriate subset of values from the joint distribution.
P(red  circle )  0.20  0.05  0.25
P(red )  0.20  0.02  0.05  0.3  0.57
• Therefore, all conditional probabilities can also be calculated.
P( positive | red  circle ) 
P( positive  red  circle ) 0.20

 0.80
P(red  circle )
0.25
2
Probabilistic Classification
• Let Y be the random variable for the class which takes
values {y1,y2,…ym}.
• Let X be the random variable describing an instance
consisting of a vector of values for n features
<X1,X2…Xn>, let xk be a possible vector value for X and
xij a possible value for Xi.
• For classification, we need to compute P(Y=yi | X=xk)
for i = 1…m
• Could be done using joint distribution but this requires
estimating an exponential number of parameters.
3
Bayesian Categorization
• Determine category of xk by determining for each yi
P(Y  yi | X  xk ) 
P(Y  yi ) P( X  xk | Y  yi )
P ( X  xk )
• P(X=xk) can be determined since categories are
complete and disjoint.
m
m
i 1
i 1
 P(Y  yi | X  xk )  
P(Y  yi ) P( X  xk | Y  yi )
1
P( X  xk )
m
P( X  xk )   P(Y  yi ) P( X  xk | Y  yi )
i 1
4
Bayesian Categorization (cont.)
• Need to know:
– Priors: P(Y=yi)
– Conditionals: P(X=xk | Y=yi)
• P(Y=yi) are easily estimated from data.
– If ni of the examples in D are in yi then P(Y=yi) = ni / |D|
• Too many possible instances (e.g. 2n for binary
features) to estimate all P(X=xk | Y=yi).
• Still need to make some sort of independence
assumptions about the features to make learning
tractable.
5
Naïve Bayes Generative Model
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
redgrn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri circ
circtri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
6
Naïve Bayes Inference Problem
lg red circ
??
??
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
redgrn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri circ
circtri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
7
Naïve Bayesian Categorization
• If we assume features of an instance are independent given
the category (conditionally independent).
n
P( X | Y )  P( X 1 , X 2 , X n | Y )   P( X i | Y )
i 1
• Therefore, we then only need to know P(Xi | Y) for each
possible pair of a feature-value and a category.
• If Y and all Xi and binary, this requires specifying only 2n
parameters:
– P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi
– P(Xi=false | Y) = 1 – P(Xi=true | Y)
• Compared to specifying 2n parameters without any
independence assumptions.
8
Generative vs. Discriminative Models
•
Generative models and are not directly designed to
maximize the performance of classification. They model the
complete joint distribution P(X,Y).
• Classification is then done using Bayesian inference given
the generative model of the joint distribution.
• But a generative model can also be used to perform any other
inference task, e.g. P(X1 | X2, …Xn, Y)
– “Jack of all trades, master of none.”
• Discriminative models are specifically designed and trained
to maximize performance of classification. They only model
the conditional distribution P(Y | X).
• By focusing on modeling the conditional distribution, they
generally perform better on classification than generative
models when given a reasonable amount of training data.
9
Logistic Regression
• Assumes a parametric form for directly estimating
P(Y | X). For binary concepts, this is:
P(Y  1 | X ) 
1
1  exp( w0  i 1 wi X i )
n
P(Y  0 | X )  1  P(Y  1 | X )
exp( w0  i 1 wi X i )
n

1  exp( w0  i 1 wi X i )
n
• Equivalent to a one-layer backpropagation neural net.
– Logistic regression is the source of the sigmoid function
used in backpropagation.
– Objective function for training is somewhat different.
Logistic Regression as a Log-Linear Model
• Logistic regression is basically a linear model, which
is demonstrated by taking logs.
P(Y  0 | X )
P(Y  1 | X )
n
1  exp( w0  i 1 wi X i )
Assign label Y  0 iff 1 
0  w0  i 1 wi X i
n
or equivalent ly w0  i 1  wi X i
n
• Also called a maximum entropy model (MaxEnt)
because it can be shown that standard training for
logistic regression gives the distribution with maximum
entropy that is consistent with the training data.
Logistic Regression Training
• Weights are set during training to maximize the
conditional data likelihood :
W  argmax  P(Y d | X d ,W )
d D
W
where D is the set of training examples and Yd and
Xd denote, respectively, the values of Y and X for
example d.
• Equivalently viewed as maximizing the
conditional log likelihood (CLL)
W  argmax
W
d
d
ln
P
(
Y
|
X
,W )

d D
Logistic Regression Training
• Like neural-nets, can use standard gradient
descent to find the parameters (weights) that
optimize the CLL objective function.
• Many other more advanced training
methods are possible to speed convergence.
–
–
–
–
Conjugate gradient
Generalized Iterative Scaling (GIS)
Improved Iterative Scaling (IIS)
Limited-memory quasi-Newton (L-BFGS)
Preventing Overfitting in Logistic Regression
• To prevent overfitting, one can use regularization
(a.k.a. smoothing) by penalizing large weights by
changing the training objective:
W  argmax
W
 ln P(Y
d D
d
| X ,W ) 
d

2
W
2
Where λ is a constant that determines the amount of smoothing
• This can be shown to be equivalent to MAP
parameter estimation assuming a Guassian prior
for W with zero mean and a variance related to
1/λ.
Multinomial Logistic Regression
(MaxEnt)
• Logistic regression can be generalized to multi-class
problems (where Y has a multinomial distribution).
• Create feature functions for each combination of a
class value y´ and each feature Xj and another for the
“bias weight” of each class.
– f y´, j (Y, X) = Xj if Y= y´ and 0 otherwise
– f y´ (Y, X) = 1 if Y= y´ and 0 otherwise
• The final conditional distribution is:
K
1
P(Y | X ) 
exp(  k f k (Y , X ))
Z(X )
k 1
(λk are weights)
K
Z ( X )   exp(  k f k (Y , X ))
Y
k 1
(normalizing constant)
Graphical Models
• If no assumption of independence is made, then an
exponential number of parameters must be estimated for
sound probabilistic inference.
– No realistic amount of training data is sufficient to estimate so many
parameters.
• If a blanket assumption of conditional independence is made,
efficient training and inference is possible, but such a strong
assumption is rarely warranted.
• Graphical models use directed or undirected graphs over a
set of random variables to explicitly specify variable
dependencies and allow for less restrictive independence
assumptions while limiting the number of parameters that
must be estimated.
– Bayesian Networks: Directed acyclic graphs that indicate causal
structure.
– Markov Networks: Undirected graphs that capture general
dependencies.
Bayesian Networks
• Directed Acyclic Graph (DAG)
– Nodes are random variables
– Edges indicate causal influences
Earthquake
Burglary
Alarm
JohnCalls
MaryCalls
Conditional Probability Tables
• Each node has a conditional probability table (CPT) that
gives the probability of each of its values given every possible
combination of values for its parents (conditioning case).
– Roots (sources) of the DAG that have no parents are given prior
probabilities.
P(B)
.001
P(E)
Earthquake
Burglary
Alarm
A
P(J)
T
.90
F
.05
JohnCalls
B
E
P(A)
T
T
.95
T
F
.94
F
T
.29
F
F
.001
MaryCalls
.002
A
P(M)
T
.70
F
.01
Joint Distributions for Bayes Nets
• A Bayesian Network implicitly defines a joint
distribution.
n
P( x1 , x2 ,...xn )   P( xi | Parents ( X i ))
i 1
• Example
P( J  M  A  B  E )
 P( J | A) P( M | A) P( A | B  E ) P(B) P(E )
 0.9  0.7  0.001 0.999  0.998  0.00062
Naïve Bayes as a Bayes Net
• Naïve Bayes is a simple Bayes Net
Y
X1
X2
…
Xn
• Priors P(Y) and conditionals P(Xi|Y) for
Naïve Bayes provide CPTs for the network.
Markov Networks
• Undirected graph over a set of random
variables, where an edge represents a
dependency.
• The Markov blanket of a node, X, in a
Markov Net is the set of its neighbors in the
graph (nodes that have an edge connecting
to X).
• Every node in a Markov Net is
conditionally independent of every other
node given its Markov blanket.
Distribution for a Markov Network
• The distribution of a Markov net is most compactly described
in terms of a set of potential functions (a.k.a. factors,
compatibility functions), φk, for each clique, k, in the graph.
• For each joint assignment of values to the variables in clique
k, φk assigns a non-negative real value that represents the
compatibility of these values.
• The joint distribution of a Markov network is then defined by:
1
P( x1 , x2 ,...xn )  k ( x{k } )
Z k
Where x{k} represents the joint assignment of the variables
in clique k, and Z is a normalizing constant that makes a
joint distribution that sums to 1.
Z  k ( x{k } )
x
k
Sample Markov Network
B
T
T
F
F
A
T
F
T
F
1
100
1
1
200
Earthquake
Burglary
E
T
T
F
F
A
T
F
T
F
2
50
10
1
200
Alarm
J
T
T
F
F
A
T
F
T
F
3
75
10
1
200
JohnCalls
MaryCalls
M
T
T
F
F
A
T
F
T
F
4
50
1
10
200
Logistic Regression as a Markov Net
• Logistic regression is a simple Markov Net
Y
X1
X2
…
Xn
• But only models the conditional distribution,
P(Y | X) and not the full joint P(X,Y)
• Same as a discriminatively trained naïve
Bayes.
Generative vs. Discriminative
Sequence Labeling Models
• HMMs are generative models and are not directly
designed to maximize the performance of sequence
labeling. They model the joint distribution P(O,Q).
• HMMs are trained to have an accurate probabilistic
model of the underlying language, and not all
aspects of this model benefit the sequence labeling
task.
• Conditional Random Fields (CRFs) are
specifically designed and trained to maximize
performance of sequence labeling. They model the
conditional distribution P(Q | O)
Classification
Y
X1
X2
…
Naïve
Bayes
Xn
Generative
Conditional
Discriminative
Y
X1
X2
…
Logistic
Regression
Xn
Sequence Labeling
Y1
Y2
..
YT
HMM
X1
X2
…
XT
Generative
Conditional
Discriminative
Y1
Y2
..
YT
X1
X2
…
XT
Linear-chain CRF
Simple Linear Chain CRF Features
• Modeling the conditional distribution is
similar to that used in multinomial logistic
regression.
• Create feature functions fk(Yt, Yt−1, Xt)
– Feature for each state transition pair i, j
• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise
– Feature for each state observation pair i, o
• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise
• Note: number of features grows quadratically
in the number of states (i.e. tags).
28
Conditional Distribution for
Linear Chain CRF
• Using these feature functions for a simple
linear chain CRF, we can define:
T K
1
P(Y | X ) 
exp(  k f k (Yt , Yt 1 , X t ))
Z(X )
t 1 k 1
T
K
Z ( X )   exp(  k f k (Yt , Yt 1 , X t ))
Y
t 1 k 1
29
Adding Token Features to a CRF
• Can add token features Xi,j
X1,1
…
…
Y2
Y1
X1,m
X2,1
…
X2,m
…
YT
XT,1
…
XT,m
• Can add additional feature functions for
each token feature to model conditional
distribution.
30
Features in POS Tagging
• For POS Tagging, use lexicographic
features of tokens.
– Capitalized?
– Start with numeral?
– Ends in given suffix (e.g. “s”, “ed”, “ly”)?
31
Enhanced Linear Chain CRF
(standard approach)
• Can also condition transition on the current
token features.
…
Y2
Y1
X1,1
X2,1
…
X2,m
XT,1
…
…
…
X1,m
YT
XT,m
• Add feature functions:
• fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1
and 0 otherwise
32
Supervised Learning
(Parameter Estimation)
• As in logistic regression, use L-BFGS
optimization procedure, to set λ weights to
maximize CLL of the supervised training
data.
• See paper for details.
33
Sequence Tagging
(Inference)
• Variant of Viterbi algorithm can be used to
efficiently, O(TN2), determine the globally
most probable label sequence for a given
token sequence using a given log-linear
model of the conditional probability P(Y | X).
• See paper for details.
34
Skip-Chain CRFs
• Can model some long-distance dependencies (i.e. the
same word appearing in different parts of the text) by
including long-distance edges in the Markov model.
Y1
Y2
X1
X2
Michael
Dell
Y3
X3
said
Y100
Y101
X100
X101
…
Dell bought
• Additional links make exact inference intractable,
so must resort to approximate inference to try to
find the most probable labeling.
35
CRF Results
• Experimental results verify that they have superior
accuracy on various sequence labeling tasks.
–
–
–
–
Part of Speech tagging
Noun phrase chunking
Named entity recognition
Semantic role labeling
• However, CRFs are much slower to train and do
not scale as well to large amounts of training data.
– Training for POS on full Penn Treebank (~1M words)
currently takes “over a week.”
• Skip-chain CRFs improve results on IE.
36
CRF Summary
• CRFs are a discriminative approach to sequence
labeling whereas HMMs are generative.
• Discriminative methods are usually more accurate
since they are trained for a specific performance task.
• CRFs also easily allow adding additional token features
without making additional independence assumptions.
• Training time is increased since a complex
optimization procedure is needed to fit supervised
training data.
• CRFs are a state-of-the-art method for sequence
labeling.
37
Download