Max-margin sequential learning methods William W. Cohen CALD

advertisement
Max-margin sequential learning methods
William W. Cohen
CALD
Announcements
• Upcoming assignments:
– Wed 3/3: project proposal due:
• personnel + 1-2 page
– Spring break next week, no class
– Will get feedback on project proposals by end
of break
– No write-ups for “Distance Metrics for Text”
week are due Wed 3/17
• not the Monday after spring break
Collins’ paper
• Notation:
– label (y) is a “tag” t
– observation (x) is word w
– history h is a 4-tuple <ti,ti-1,w[1:n],i>
– phis(h,t) is a feature of h, t
Collins’ papers
• Notation con’t:
– Phi is summation of phi for all positions i
– alphas is weight to give phis
Collins’ paper
The theory
Claim 1: the algorithm is an instance of this perceptron variant:
Claim 2: the arguments in the mistake-bounded classification
results of F&S99 extend immediately to this ranking task as
well.
F&S99
algorithm
F&S99 result
Collins’
result
Results
• Two experiments
– POS tagging, using the Adwait’s features
– NP chunking (Start,Continue,Outside tags)
– NER on special AT&T dataset (another paper)
Features for NP chunking
Results
More ideas
• The dual version of a perceptron:
– w is built up by repeatedly adding examples => w is a
weighted sum of the examples x1,...,xn
– inner product <w,x> is can be rewritten:
m
w    j x j , where  j is 0,1, or - 1
j 1
so
m
 m

w  x     j x j   x    j ( x j  x)
j 1
 j 1

Dual version of perceptron ranking
alpha i,j = i,j range over example and correct/incorrect tag sequence
NER features for
re-ranking
MAXENT tagger
output
NER features
NER results
Altun et al paper
• Starting point – dual version of Collins’
perceptron algorithm
– final hypothesis is weighted sum of inner
products with a subset of the examples
– this a lot like an SVM – except that the
perceptron algorithm is used to set the weights
rather than quadratic optimization
SVM optimization
• Notation:
–
–
–
–
yi is the correct tag for xi
y is an incorrect tag
F(xi,yi) are features
Optimization problem:
• find weights w on the examples that maximize
minimal margin, limiting ||w||=1, or
• minimize ||w||2 such that every margin >= 1
SVMs for ranking
SVMs for ranking
Proposition: (14) and (15) are equivalent:
p  n1  1
p  n2  1

Let    p 


 n1   


 n2   

1
p
2
SVMs for ranking
A binary classification problem – with xi yi the positive
example and xi y’negative examples, except that thetai varies
for each example. Why? because we’re ranking.
SVMs for ranking
• Altun et al work give the remaining details
• Like for perceptron learning, “negative”
data is found by running Viterbi given the
learned weights and looking for errors
– Each mistake is a possible new support vector
– Need to iterate over the data repeatedly
– Could be exponential time before convergence
if the support vectors are dense...
Altun et al results
• NER on 300 sentences from CoNLL2002
shared task
– Spanish
– Four entity types, nine labels (beginning-T,
intermediate-T, other)
• POS tagging on 300 sentences from Penn
TreeBank
• 5-CV, window of size 3, simple features
Altun et al results
Altun et al results
Download