Classification for NLP

advertisement
Natural Language Processing
COMPSCI 423/723
Rohit Kate
1
Classification for NLP:
Naïve Bayes Model
Maximum Entropy Model
Some of these slides have been adapted from Raymond Mooney’s slides
from his NLP and Machine Learning courses at UT Austin.
Referenes:
- Sections 6.6 & 6.7 from Jurafsy & Martin book
-Naïve Bayes portions from Word Sense Disambiguation chapters in
Jurafsy & Martin and Manning & Schutze books
-Generative and Discriminative Classifiers: Naive Bayes and Logistic
Regression by Tom Mitchell
http://www.cs.cmu.edu/~tom/NewChapters.html
-A Maximum Entropy approach to Natural Language Processing
by Adam L. Berger, Stephen A. Della Pietra and Vincent J. Della Pietra
Computational Linguistics, Vol. 22, No. 1. (1996), pp. 39-71.
2
Naïve Bayes Model
3
Classification in NLP
• Several NLP problems can be formulated as
classification problems, a few examples:
– Information Extraction
• Given an entity, is it a person name or not?
• Given two protein names, does the sentence say they
interact or not?
– Word Sense Disambiguation
• I am out of money. I am going to the bank.
– Document Classification
• Given a document, which category does it belong to?
– Sentiment Analysis
• Given a passage (e.g. product or movie review), is it
saying positive things or negative things?
– Textual Entailment
• Given two sentences, can the second sentence be
inferred from the first?
4
Classification
• Usually the classification output variable is
denoted by Y and the input variables by Xs
Y: {river bank, money bank, verb bank}
X1: Previous word
X2: Next word
X3: Part-of-speech of previous word
X4: Part-of-speech of next word
• Xs are usually called features in NLP
• Coming up with good feature sets for NLP
problems is a skill: feature engineering
– Requires linguistic insights
– Grasp of the theory behind the classification
method
5
Probabilistic Classification
• Often it is useful to know the
probabilities of different output values
and not just the best output value
– To have confidence in the output (0.9-0.1
vs 0.6-0.4)
– These probabilities may be useful for the
next stage of NLP processing
• Conditional probability: P(Y|X1,X2,..,Xn)
6
Probabilistic Classification
• If the joint probability distribution
P(Y,X1,X2,..,Xn) is given then the
conditional probability distribution can
be easily estimated
7
Estimating Conditional
Probabilities
X1,X2,Y
P(Y,X1,X2)
Circle, Red, Positive
0.2
Circle, Red, Negative
0.05
Circle, Blue, Positive
0.02
Circle, Blue, Negative
0.2
Square, Red, Positive
0.02
Square, Red, Negative
0.3
Square, Blue, Positive
0.01
Square, Blue, Negative
0.2
P( positive| red  circle) 
P( positive red  circle) 0.20

 0.80
P(red  circle)
0.25
Similarly estimate P(Y|X1,X2) for the remaining values.
8
Estimating Joint Probability
Distributions Not Easy :-(
• Assuming Y and all Xi are binary, we need
2n+1 - 1 entries (parameters) to specify the
joint probability distribution
• This is impossible to accurately estimate from
a reasonably-sized training set
• Note that P(Y|X1,X2,..,Xn) requires fewer
entries (2n-1), why? But they are still too
many for even small size of n
9
Estimating Joint Probability
Distributions
• Simplification assumptions are made about
the joint probability distribution to reduce the
number of parameters to estimate
• Let the random variables be nodes of a
graph, there are two major types of
simplifications, they are represented as
– Directed probabilistic graphical models
• Simplest: Naïve Bayes model
• More complex: Hidden Markov Model (HMM)
– Undirected probabilistic graphical models
• Simplest: Maximum entropy model
• More complex: Conditional Random Field (CRFs)
10
Directed Graphical Models
• Simplification assumption: Some
random variables are conditionally
independent of others given values for
some other random variables
11
Conditional Independence
• Two random variables A and B are
conditionally independent given C if
P(AПB|C) = P(A|C)P(B|C)
Rain and Thunder are not independent
(given there was Rain, it increases the
probability that there was Thunder).
But given that there was Lightning (or no
Lighting) they are independent.
P(Rain^Thunder|Lightning) =
P(Rain|Lightning)P(Thunder|Lightning)
12
Directed Graphical Models
• Also known as Bayesian networks
• Simplification assumption: Some
random variables are conditionally
independent of others given values for
some other random variables
• Simplest directed graphical model:
Naïve Bayes
• Naïve Bayes assumption: The features
are conditionally independent given the
category
13
Naïve Bayes Assumption
• Features are conditionally independent
given the category
n
P(X1, X 2 ,
X n |Y)   P(X i |Y )
i1
• How do we estimate P(Y|X1,X2,..,Xn)
from this?

• Recall the Bayes’ theorem: Lets us
calculate P(B|A) in terms of P(A|B)
14
Bayes’ Theorem
P( E | H ) P( H )
P( H | E ) 
P( E )
Simple proof from definition of conditional
probability:
P( H  E )
P( H | E ) 
P( E )
(Def. cond. prob.)
P( H  E )
(Def. cond. prob.)
P( E | H ) 
P( H )
P( H  E)  P( E | H ) P( H )
QED: P( H | E ) 
P( E | H ) P( H )
P( E )
15
Naïve Bayes Model
P(Y | X1, X 2 ,
P(X1, X 2, X n |Y )P(Y )
Xn ) 
P(X1, X 2, X n )
P( X1 , X 2 , X n )   P( X1 , X 2 , X n | Y )P(Y )
y
From Bayes’
Theorem
Computing marginals and
definition of conditional
probability
P( X 1 , X 2 , X n | Y ) P(Y )
P(Y | X 1 , X 2 , X n ) 
 P( X1, X 2 , X n | Y )P(Y )
y
n
P(X1, X 2 ,
X n |Y)   P(X i |Y )
i1
P(Y | X1, X 2 ,
Xn ) 
Naïve Bayes
assumption
n
 P(X
i
|Y )P(Y)
i1
n
 P(X
y
i1
i
|Y )P(Y)
16
Naïve Bayes Model
• Only need to estimate P(Y) and P(Xi|Y) for all
i, that with the naïve Bayes assumption
specifies the entire joint probability
distribution
• Assuming all Y and Xis are binary, only 2n+1
parameters instead of 2n+1-1 parameters: a
dramatic reduction
Y
Lightning
P(Y)
X3 …….
X1
X2
P(X1|Y)
P(X2|Y) P(X3|Y)
Xn
Rain
Thunder
P(Xn|Y)
Directed graphical model representation
17
Naïve Bayes Example
P(Label|Size,Color,Shape)
Probability
positive
negative
P(Y)
0.5
0.5
P(small | Y)
0.4
0.4
P(medium | Y)
0.1
0.2
P(large | Y)
0.5
0.4
P(red | Y)
0.9
0.3
P(blue | Y)
0.05
0.3
P(green | Y)
0.05
0.4
P(square | Y)
0.05
0.4
P(triangle | Y)
0.05
0.3
P(circle | Y)
0.9
0.3
Test Instance:
<medium ,red, circle>
18
Naïve Bayes Example
Probability
positive
negative
P(Y)
0.5
0.5
P(medium | Y)
0.1
0.2
P(red | Y)
0.9
0.3
P(circle | Y)
0.9
0.3
Test Instance:
<medium ,red, circle>
P(positive |medium,red,circle)
= P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(medium,red,cirlce)
0.5
*
0.1
*
0.9
*
0.9
= 0.0405 / P(medium,red,circle)
= 0.0405 / 0.0495 = 0.8181
P(negative |medium,red,circle)
= P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(medium,red,cirlce)
0.5
*
0.2
*
0.3
* 0.3
= 0.009 / P(medium,red,circle)
= 0.009 / 0.0495 = 0.1818
19
Estimating Probabilities
• Normally, probabilities are estimated based on
observed frequencies in the training data.
• If D contains nk examples in category yk, and nijk of
these nk examples have the jth value for feature Xi, xij,
nijk
then:
P( X i  xij | Y  yk ) 
nk
• However, estimating such probabilities from small
training sets is error-prone.
• If due only to chance, a rare feature, Xi, is always false
in the training data, yk :P(Xi=true | Y=yk) = 0.
• If Xi=true then occurs in a test example, X, the result
is that yk: P(X | Y=yk) = 0 and yk: P(Y=yk | X) = 0
20
Probability Estimation
Example
Ex
1
Size
small
Color
red
Shape
circle
Categor
y
positive
2
large
red
circle
positive
3
small
red
triangle
negitive
4
large
blue
circle
negitive
Test Instance X:
<medium, red, circle>
Probability
positive
negative
P(Y)
0.5
0.5
P(small | Y)
0.5
0.5
P(medium | Y)
0.0
0.0
P(large | Y)
0.5
0.5
P(red | Y)
1.0
0.5
P(blue | Y)
0.0
0.5
P(green | Y)
0.0
0.0
P(square | Y)
0.0
0.0
P(triangle | Y)
0.0
0.5
P(circle | Y)
1.0
0.5
P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0
P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0
21
Smoothing
• To account for estimation from small samples,
probability estimates are adjusted or smoothed.
• Laplace smoothing using an m-estimate assumes that
each feature is given a prior probability, p, that is
assumed to have been previously observed in a
“virtual” sample of size m.
P( X i  xij | Y  yk ) 
nijk  mp
nk  m
• For binary features, p is simply assumed to be 0.5.
22
Laplace Smothing Example
• Assume training set contains 10 positive examples:
– 4: small
– 0: medium
– 6: large
•
Estimate parameters as follows (if m=1, p=1/3)
–
–
–
–
P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394
P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03
P(large | positive) = (6 + 1/3) / (10 + 1) =
0.576
P(small or medium or large | positive) =
1.0
23
Naïve Bayes Model is a
Generative Model
• Models the joint probability distribution
P(Y,X1,X2,..,Xn) using P(Y) and P(Xi|Y)
• An assumed generative process: First
generate Y according to P(Y) then
generate X1,X2,..,Xn independently
according to P(X1|Y), P(X2|Y), ..,
P(Xn|Y) respectively
24
Naïve Bayes Generative
Model
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
red grn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri
circ
circ tri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
25
Naïve Bayes Inference
Problem
lg red circ
??
??
neg
pos pos
pos neg
pos neg
Category
med
sm lg
med
lg lg sm
sm med
red
blue
red grn red
red blue
red
circ
tri tricirc
circ circ
circ sqr
lg
sm
med med
sm lglg
sm
red
blue
grn grn
red blue
blue grn
circ
sqr
tri
circ
circ tri sqr
sqr tri
Size
Color
Shape
Size
Color
Shape
Positive
Negative
26
Some Comments on Naïve
Bayes Model
• Tends to work well despite strong (or
naïve) assumption of conditional
independence
• Experiments show it to be quite
competitive with other classification
methods on standard UCI datasets
• Although it does not produce accurate
probability estimates when its
independence assumptions are
violated, it may still pick the correct
maximum-probability class in many
cases
27
Maximum Entropy Model
28
Maximum Entropy Models
• Very popular in NLP
• Several ways to look at them:
– Exponential or log-linear classifiers or multinomial
logical regression
– Assume a parametric form for conditional
distribution
– Maximize entropy of the joint distribution given the
constraints
– Discriminative models instead of generative
(directly estimates P(Y|X1,..,Xn) instead of via
P(Y,X1,..,Xn))
29
Linear Regression
• Classification: Predict a discrete values
• Regression: Predict a real value
• Linear Regression: Predict a real value
using a linear combination of inputs
Y = W0 + W1*X1 + W2*X2 + … + Wn*Xn
Ws are the weights associated with the
features Xs
Example:
price = 16550 - 4900*(# vague adjectives)
30
Estimating Weights in Linear
Regression
• Find the Ws that minimize the sum-squared
error for the given M training examples
W *  argminW cost(W )
M

( j)
( j)
cost(W )  (Ypredicted
Yobserved
)2
j 0
• Statistical packages are available that solve
 fast
this
31
Logistic Regression
• But we are interested in probabilistic
classification, that is in predicting
P(Y|X1,..,Xn)
• Can we modify linear regression to do
that?
– Nothing constrains it to be between [0,1]
which is required for a legal probability
• Predict odds (assume Y is binary)
instead of the probability
n
P(Y  true | X1, X 2,...,X n )
 W i * X i
1 P(Y  true | X1, X 2,...,X n ) i 0
32
Logistic Regression
• But LHS lies between 0 and infinity,
RHS could be between -infinity to
infinity
• Take log of LHS (known as logit
function)
n
P(Y  true | X1, X 2,...,X n )
ln(
)  W i * X i
1 P(Y  true | X1, X 2,...,X n ) i 0
1
P(Y  true | X1, X 2,...,X n ) 

n
1 e
  Wi *X i
i 0
Logistic
function
33
Logistic Regression as a LogLinear Model
• Logistic regression is basically a linear model,
which is demonstrated by taking logs
P(Y  true | X1 ..X n )
Assign label Y  true iff 1
P(Y  false | X1 ..X n )
1  exp( wi X i )
n
i 0
0   wi X i
n
i 0


34
Logistic Regression Training
• Weights are set during training to maximize
the conditional data likelihood :
W *  argmaxP(Y d | X1 ..X nd )
d
W
d D
where D is the set of training examples and
Yd and Xid denote, respectively, the values of
Y 
and Xi for example d.
• Equivalently viewed as maximizing the
conditional log likelihood (CLL)
W *  argmax lnP(Y d | X1 ..X nd )
d
W
d D
35
Logistic Regression Training
• Use standard gradient descent to find
the parameters (weights) that optimize
the CLL objective function
• Many other more advanced training
methods are available
– Conjugate gradient
– Generalized Iterative Scaling (GIS)
– Improved Iterative Scaling (IIS)
– Limited-memory quasi-Newton (L-BFGS)
• Packages are available that do these
36
Preventing Overfitting in Logistic
Regression
• To prevent overfitting, one can use regularization
(smoothing) by penalizing large weights by changing
the training objective:
W *  argmax  ln P(Y | X ,..,X ,W ) 
d
W
d D
d
1
n
1

2
W
2
Where λ is a constant that determines the amount of smoothing

• This can be shown to be equivalent to assuming a
Guassian prior for W with zero mean and a
variance related to 1/λ.
37
Generative vs. Discriminative
Models
• Generative models and are not directly designed to
maximize the performance of classification. They
model the complete joint distribution P(Y,X1,...Xn).
• But a generative model can also be used to perform
any other inference task, e.g. P(X1 | X2, …Xn, Y)
– “Jack of all trades, master of none.”
• Discriminative models are specifically designed and
trained to maximize performance of classification. They
only model the conditional distribution P(Y | X1, …Xn).
• By focusing on modeling the conditional distribution,
they generally perform better on classification than
generative models when given a reasonable amount of
training data.
– Master of one trade: Classification P(Y|X1,.. Xn)
38
38
Multinomial Logistic Regression
(Maximum Entropy or MaxEnt)
• So far Y was binary, a generalization if
Y takes multiple values (classes)
• Make weights dependent on the class c:
Wci instead of Wi
N
P(c | X1..X n ) 
exp(W ci X i )
i 0

c' Classes
N
exp(W c' i X i )
i 0
Normalization term (Z) so that probabilities sum to 1

39
Multinomial Logistic Regression
(Maximum Entropy or MaxEnt)
• Usually features take binary values in
NLP
• Introduce indicator functions (0 or 1
output) that depend on the input and
output class
• Call X as input, features are fi(c,x)
N
P(c | X) 
exp(W ci f i (c, x))
i 0

c' Classes
N
exp(W c' i f i (c', x))
i 0
40
A Small MaxEnt Example
• Word Sense Disambiguation:
Y: {river bank, money bank, verb bank}
X: Entire Sentence
Features:
f1(river bank,X) = 1 if “river” is in the sentence, 0 otherwise
f2(river bank,X) = 1 if “water” is in the sentence, 0 otherwise
f3(money bank,X) = 1 if “money” is in the sentence, 0 otherwise
f4(money bank,X) = 1 if “deposit” is in the sentence, 0 otherwise
f5(verb bank,X) = 1 if previous word was “to”, 0 otherwise
• Obtain examples of feature values and Y from
annotated training data
• Compute weights Wci to maximize the conditional loglikelihood of the training data
• For a test example, predict Y using MaxEnt equation
41
Why is it Called Maximum
Entropy Model?
• Entropy of a random variable Y:
H(Y)   P(Y)log2 (P(Y))
Y
• The more uniform distribution, the higher is
the entropy

• It can be shown that standard training for
logistic regression gives the distribution
with maximum entropy that is empirically
consistent with the training data
42
Undirected Graphical Model
• Also called Markov Network, Random Field
• Undirected graph over a set of random
variables, where an edge represents a
dependency
• The Markov blanket of a node, X, in a
Markov Net is the set of its neighbors in the
graph (nodes that have an edge connecting
to X)
• Every node in a Markov Net is conditionally
independent of every other node given its
Markov blanket
• Simplest Markov Network: MaxEnt model
43
Relation with Naïve Bayes
Y
X1
X2
…
Naïve
Bayes
Xn
Generative
Conditional
Discriminative
Y
X1
X2
…
Logistic
Regression
Xn
44
Simplification Assumption for
MaxEnt
• The probability P(Y|X1..Xn) can be factored
as:
N
P(c | X1..X n ) 
exp(W ci X i )
i 0

c' Classes
N
exp(W c' i X i )
i 0
• Note there is no product term that has two or
more Xis
45
Naïve Bayes and MaxEnt
• Naïve Bayes can be extended to work with
continuous inputs X (like logistic regression)
• Both make the conditional independence
assumption
• MaxEnt is not rigidly tied with it because it
tries to maximize the conditional likelihood of
the data even when the data disobeys the
assumption
• It has been observed that with scarce training
data Naïve Bayes performs better and with
sufficient data MaxEnt performs better
46
Classification in General
• Several other classifiers are also available:
perceptron, neural networks, support vector
machines, k-nearest neighbors, decision trees…
• Naïve Bayes and MaxEnt are based on probabilities
– Can’t handle combination of features as features
– If right features are engineered they work very well
• Are widely used for tasks other than NLP tasks
• All this was for one label classification (there was
only one Y), extensions to handle multi-label
classifications, e.g. sequence labeling with HMMs or
CRFs
47
HW 2
• Write Naïve Bayes (P(Y|f1,f2,f3,f4,f5))
and MaxEnt (P(Y|X)) equations for the
example shown on slide #41.
48
References for Next Class
• Chapter 5 (part-of-speech tagging) of Jurafsky &
Martin book; Chapter 10 of Manning and Schutze
book
• An Introduction to Conditional Random Fields for
Relational Learning By Charles Sutton and Andrew
McCallum, Book chapter in Introduction to Statistical
Relational Learning. Edited by Lise Getoor and Ben
Taskar. MIT Press. 2006
49
Download