Log-Linear Models in NLP Applications

advertisement
Log-Linear Models in NLP Applications
Daniel Marcu
marcu@isi.edu
ACKNOWLEDGMENTS AND ATTRIBUTIONS: These lecture notes reproduce
significant segments of the Tutorial Notes of Manning and Klein (2003) and Lecture
Notes of Collins (2003).
1. Example problem: POS tagging
The/DT Canadian/JJ government/NN plans/VBZ to/TO auction/VB on/IN Tuesday/NNP
750/CD million/CD Canadian/JJ dollars/NN of/IN bonds/NNS.
Assume that we want to build a system that automatically assigns POS tags to input
sentences. Assume that we traverse the text left to right and at any given word, we make
a decision as to what tag to assign. What we would like to do is to make informed
decisions, depending on the context/history memorized during the tagging process. When
tagging, we would like to consider answering questions like:
 What is the word that we are tagging right now? (If it is “the”, we suspect the
probability of labeling it as DT is high).
 What was the tag we assigned to the previous word in the sequence? (If it was
DT, we are probably more likely to assign to the current word a JJ or NN than a
VBD tag).
We would like to have a mechanism for enumerating all these “questions/features” that
we assume to be useful for tagging and have a mechanism for automatically learning the
weights that we should assign to these features when multiple features fire up. We would
like to be able to use an arbitrary large number of features and to impose little or no
constraints on the types of questions that the features model.
It is usually infeasible to memorize all the steps we take during the tagging process, so
we usually work with a finite history/context:
Example
Input domain: X = < t i-3, t i-2, t i-1, w[1:n], i > // a history consisting of the previous 3 tags,
all the words in the sequence to be tagged, and the word/position to be tagged.
Output labels: Y = {DT, NN, JJ, IN, NNS, …}
2. Generic conditional models
Given an input domain X, a finite set of labels Y, and many examples <X,Y>, learn a
conditional probability P(y | x) for any x  X and y  Y.
Example training set
W./NNP Ed/NNP Tyler/NNP ,/, 37/CD years/NNS old/JJ ,/, was/VBD elected/VBN
president/NN ./.
Stoltzman/NN has/VBZ taken/VBN a/DT gentler/JJR ,/, more/RBR audience-friendly/JJ
approach/NN /./
…
Want to label any other sentence with POS tags
Mary has a little lamb .
3. Log-linear models
Given
 An input domain X and a finite set of labels Y
 A set of m feature functions k : X  Y →  (very often these are indicator/binary
functions: k : X  Y → {0,1}).
 The feature vectors (x,y)  m induced by the feature functions k for any x 
X and y  Y
Learn a conditional probability P(y | x W), where
 W is a parameter vector of weights (W  m )
 P(y | x, W) = e W ∙ (x,y) / ∑ y’ Y e W∙ (x,y’)
log P(y | x, W) = W ∙ (x,y) / - log ∑ y’ Y e W ∙ (x,y’) [substraction between a linear term
and a normalization term]
Examples of feature functions for POS tagging:
1 (x,y) =
current word w is “the” and y = DT
{ 10 ifotherwise
2 (x,y) =
current word w ends in “ing” and y = VBG
{ 10 ifotherwise
3 (x,y) =
{ 0 otherwise
i
i
1 if < t i-2, t i-1, t i > = < DT JJ Vt >
It is natural to come up with as many feature functions (<history, output> pairs) as we
can.
Learning in this framework amounts then to learning the weights WML that
maximize the likelihood of the training corpus.
WML = argmax W  m L(W) = argmax W  m ∏ i = 1..n P( yi | xi)
L(W) = ∑ i = 1..n log P( yi | xi)
= ∑ i = 1..n W ∙ ( xi , yi) / - ∑ i = 1..n log ∑ y’ Y e W ∙ (xi , y’)
Note: Finding the parameters that maximize the likelihood/probability of some training
corpus is a “universal” machine learning trick.
Summary: we have cast the learning problem as an optimization problem.
Several solutions exist for solving this problem:
 Gradient ascent
 Conjugate gradient methods
 Iterative scaling
 Improved iterative scaling
4. Maximum Entropy Models
An equivalent approach to learning conditional probability models is this:
 There are lots of conditional distributions out there, most of them very spiked,
overfit, etc. Let Q be the set of distributions that can be specified in log-linear
form:
Q = { p : p(y | xi) = e W ∙

(x ,y)
i
/ ∑ y’ Y e W∙ (x i , y’)
We would like to learn a distribution that is as uniform as possible without
violating any of the requirements imposed by the training data.
P = {p : ∑ i = 1..n ( xi , yi) = ∑ i = 1..n ∑ y Y p(y | xi ) (xi,y)
(empirical count = expected count)
p is an n × |Y| vector defining P(y | xi ) for all i, y.
Note that a distribution that satisfies the above equality always exist
p( y | xi ) = {
1 if y = yi
0 otherwise

Because uniformity equates high entropy, we can search for distributions that are
both consistent with the requirements imposed by the data and have high entropy.

Entropy of a vector P:
H (p) = -∑ px log px
x

Entropy if uncertainty, but also non-commitment.

What do we want from a distribution P?
o Minimize commitment = maximize entropy
o Resemble some reference distribution (data)

Solution: maximize entropy H, subject to constraints f.

Adding constraints (features):
o Lowers maximum entropy
o Raises maximum likelihood
o Brings the distribution further from uniform
o Brings the distribution closer to a target distribution

Lets say we have the following event space:
NN


Maximize H:
1/e

NNPS
VBZ
VBD
5
11
13
3
1
1/e
1/e
1/e
1/e
1/e
… but we wanted probabilities: E[NN, NNS, NNP, NNPS, VBZ, VBD] = 1
1/6

NNP
… and the following empirical data:
3

NNS
1/6
1/6
1/6
1/6
1/6
This is probably too uniform:
NN
NNS
NNP
NNPS
VBZ
VBD
1/6
1/6
1/6
1/6
1/6
1/6
… we notice that N* are more common than V* in the real data, so we introduce a
feature fN = {NN, NNS, NNP, NNPS}, with E[fN] = 32/36
8/36
8/36
8/36
8/36
2/36
2/36

… and proper nouns are more frequent than common nouns, so we add fp =
{NNP, NNPS}, with E[fp] = 24/36
4/36

4/36
12/36
12/36
2/36
2/36
… we could keep refining the models, for example by adding a feature to
distinguish singular vs. plural nouns, or verb types.
Fundamental theorem: It turns out that finding the maximum likelihood solution to the
optimization problem in Section 3 is the same with finding the maximum entropy
solution to the problem in Section 4.
 The maximum entropy solution can be written in log-linear form.
 Finding the maximum-likelihood solution also gives the maximum entropy
solution.
5. Comparison of Naïve-Bayes and Log-linear models
Naïve-Bayes
Trained to maximize joint likelihood of
data and classes.
Features assumed to supply independent
evidence
Features weights can be estimated
independently
Features must be of the conjunctive form
Ф(x)  y = yi
Log-linear models
Trained to maximize conditional likelihood
of the classes given the data.
Feature weights take feature dependence
into account
Feature weights must be mutually
estimated
Features need not be in conjunctive form
but they usually are.
6. NLP specific issues
Lots of features
 Current software packages can handle millions of features.
 Storing features arrays can be though expensive (memory wise)
Lots of sparsity
 Overfitting is likely to happen – need smoothing!
 Many features seen in training will never occur at test time – use feature
selection!
Feature weight optimization
 Learning can take a long time.
Feature interaction
 Maxent models handle overlapping features well, but do not automatically model
feature interactions.

In order to enable interaction between features, one needs to explicitly add
features that model such interactions (E.g. f1, f2, f1 and f2)
ME classifiers are widely used in a large variety of NLP applications
7. Using conditional probability models in tagging applications


TRAIN(D) is a training algorithm that takes as input a data set (histories;
outputs; feature vectors) and returns a parameter vector W. Many software
packages implement this function.
PROB returns a probability distribution over possible outputs y given a history x
and parameter vector W: P(y | x W) (see [Collins, 2004, Lecture 7], for
information on how to compute this value).
Question: how do we implement a TAGGER(S, W) that tags an entire sequence S?
Solution: dynamic programming.
Log-Linear Taggers: Independence Assumptions

The input sentence S, with length n = S.length, has |Т|n possible tag sequences.

Each tag sequence T has a conditional probability
n
P(T | S) =Πj=1 P(T(j) | S, j, T(1) . . . T(j - 1))
Chain rule
n
= Πj=1 P(T(j) | S, j, T(j - 2), T(j-1))
Independence assumptions

We have a black-box PROB which can compute the
P(T (j) | S, j, T(j - 2), T(j-1)) terms

How do we compute TAGGER (S, W) where
TAGGER (S, W) = argmax TЄТn P(T | S)
= argmax TЄТn log P(T | S)
The Viterbi Algorithm

Define n = S.length, i.e., the length of the sentence

Define a dynamic programming table
π [i, t–2 , t–1 ] = maximum log probability of a tag sequence ending
in tags t–2 , t–1 at position i

Our goal is to calculate max t-2, t-1 Є Т π [n, t - 2, t - 1]
The Viterbi Algorithm: Recursive Definitions

Base case:
π [0,*, *] = log 1 = 0
π [i, t–2 , t–1 ] = log 0 = negative infinity for all other t–2 , t–1

Recursive case: for i = 1 . . . S.length, for all t–2, t–1,
π [i, t–2 , t–1 ] = max tЄТ U {*} { π [i – 1, t, t–2] + log P (t–1 | S, i, t, t–2) }
Backpointers allow us to recover the max probability sequence:
BP[i, t–2 , t–1 ] = argmax tЄТU{*} { π [i – 1, t, t–2] + log P (t–1 | S, i, t, t–2) }
The Viterbi Algorithm: Running Time

O ( n| τ |3 ) time to calculate log P(t–1 | S, i, t, t–2) for all i, t, t–2 , t–1

n| τ |2 entries in π to be filled in

O ( τ ) time to fill in one entry
(assuming O(1) time to look up log P(t–1 | S, i, t, t–2))

 O ( n| τ |3 ) time
8. Available software packages






Zhang Le: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html.
Chris Manning and Dan Klein, Stanford University, Java based, several
optimization algorithms: http://nlp.stanford.edu/downloads/classifier.shtml
Franz Och, gcc, 132 lines of C: http://www.fjoch.com/YASMET.html
Jason Baldridge, University of Edinburgh, Java-based:
http://maxent.sourceforge.net/
Rob Malouf, UCSD, several optimization algorithms: http://wwwrohan.sdsu.edu/~malouf/pubs.html (No longer seems to be available).
Hugo WL, Perl: http://search.cpan.org/~terdoest/
References

Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and
Conditional Estimation without Magic. Tutorial at HLT-NAACL 2003 and ACL
2003 (http://nlp.stanford.edu/downloads/classifier.shtml)


Michael Collins. 2003. Lecture Notes 6 and 7 for the 6.891 course, Machine
Learning Approaches for Natural Language Processing, MIT.
(http://www.ai.mit.edu/courses/6.891-nlp/)
Adwait Ratnaparkhi. (1998). Maximum Entropy Models for Natural Language
Ambiguity Resolution. Ph.D. Dissertation. University of Pennsylvania
(http://www.cis.upenn.edu/~adwait/statnlp.html).
Download