CS546: Machine Learning and Natural Language Preparation to the

advertisement
CS546: Machine Learning and Natural Language
Multi-Class and Structured Prediction Problems
Slides from Taskar and Klein are
used in this lecture
1
Outline
– Multi-Class classification:
– Structured Prediction
– Models for Structured Prediction and
Classification
• Example of POS tagging
2
Mutliclass problems
– Most of the machinery we talked before was
focused on binary classification problems
– e.g., SVMs we discussed so far
– However most problems we encounter in NLP
are either:
• MultiClass: e.g., text categorization
• Structured Prediction: e.g., predict syntactic structure
of a sentence
– How to deal with them?
3
Binary linear classification
4
Multiclass classification
5
Perceptron
6
Structured Perceptron
• Joint feature representation:
• Algoritm:
Perceptron
8
Binary Classification Margin
9
Generalize to MultiClass
10
Converting to MultiClass SVM
11
Max margin = Min Norm
•As before, these are equivalent formulations:
12
Problems:
•Requires separability
•What if we have noise in data?
•What if we have little simple feature space?
13
Non-separable case
14
Non-separable case
15
Compare with MaxEnt
16
Loss Comparison
17
Multiclass -> Structured
• So far, we considered multiclass classification
• 0-1 losses l(y,y’)
• What
if what we want to do is to predict:
•
• sequences of POS
• syntactic trees
• translation
18
Predicting word alignments
•
19
Predicting Syntactic Trees
•
20
Structured Models
•
21
Parsing
•
22
Max Margin Markov Networks (M3Ns)
•
Taskar et al, 2003; similar
Tsochantaridis et al, 2004
23
Max Margin Markov Networks (M3Ns)
•
24
Solving MultiClass with binary
learning
• MultiClass classifier
– Function
f : Rd  {1,2,3,...,k}
• Decompose into binary problems
• Not always possible to learn
• Different scale
• No theoretical justification
25
Real
Problem
MultiClass Classification
Learning via One-Versus-All (OvA) Assumption
• Find vr,vb,vg,vy  Rn such that
– vr.x > 0
– vb.x > 0
– vg.x > 0
– vy.x > 0
iff y = red 
iff y = blue 
iff y = green
iff y = yellow
H = Rkn

• Classifier f(x) = argmax vi.x
Individual
Classifiers
Decision
Regions
26
MultiClass Classification
Learning via All-Verses-All (AvA) Assumption
• Find vrb,vrg,vry,vbg,vby,vgy  Rd such that
– vrb.x > 0
if y = red
<0
if y = blue
– vrg.x > 0
if y = red
<0
if y = green
– ... (for all pairs)
H = Rkkn
How to
classify?
Individual
Classifiers
Decision
Regions
27
MultiClass Classification
Classifying with AvA
Tree
Majority Vote
1 red, 2 yellow, 2 green
?
Tournament
All are post-learning and might cause weird stuff
28
POS Tagging
• English tags
•
29
POS Tagging, examples from WSJ
From McCallum
30
POS Tagging
• Ambiguity: not a trivial task
•
• Useful tasks:
• important features for other steps are based
on POS
• E.g., use POS as input to a parser
31
But still why so popular
– Historically the first statistical NLP problem
– Easy to apply arbitrary classifiers:
– both for sequence models and just
independent classifiers
– Can be regarded as Finite-State Problem
– Easy to evaluate
– Annotation is cheaper to obtain than
TreeBanks (other languages)
32
HMM (reminder)
•
33
HMM (reminder) - transitions
•
34
Transition Estimates
•
35
Emission Estimates
•
36
MaxEnt (reminder)
•
37
Decoding: HMM vs MaxEnt
•
38
Accuracies overview
•
39
Accuracies overview
•
40
SVMs for tagging
– We can use SVMs in a similar way as MaxEnt
(or other classifiers)
– We can use a window around the word
– 97.16 % on WSJ
41
SVMs for tagging
from Jimenez & Marquez
42
No sequence modeling
43
CRFs and other global models
44
CRFs and other global models
45
Compare
W
T
HMMs
MEMMs - Note:
after each step t the
remaining probability mass cannot
be reduced – it can only be distributed
across among possible state transitions
CRFs - no local normalization
Label Bias
based on a slide from Joe 47Drish
Label Bias
• Recall Transition based parsing -- Nivre’s algorithm (with
beam search)
• At each step we can observe only local features (limited
look-ahead)
• If later we see that the following word is impossible we
can only distribute probability uniformly across all (im)possible decisions
• If a small number of such decisions – we cannot
decrease probability dramatically
• So, label bias is likely to be a serious problem if:
• Non local dependencies
• States have small number of possible outgoing
transitions
48
Pos Tagging Experiments
– “+” is an extended feature set (hard to
integrate in a generative model)
– oov – out-of-vocabulary
49
Supervision
– We considered before the supervised case
– Training set is labeled
– However, we can try to induce word classes
without supervision
– Unsupervised tagging
– We will later discuss the EM algorithm
– Can do it in a partly supervised:
– Seed tags
– Small labeled dataset
– Parallel corpus
– ....
50
Why not to predict POS + parse trees
simultaneously?
– It is possible and often done this way
– Doing tagging internally often benefits
parsing accuracy
– Unfortunately, parsing models are less robust
than taggers
– e.g., non-grammatical sentences, different
domains
– It is more expensive and does not help...
51
Questions
• Why there is no label-bias problem for a
generative model (e.g., HMM) ?
• How would you integrate word features in a
generative model (e.g., HMMs for POS tagging)?
• e.g., if word has:
• -ing, -s, -ed, -d, -ment, ...
• post-, de-,...
52
“CRFs” for more complex structured
output problems
• We considered sequence labeled problems
• Here, the structure of dependencies is fixed
• What if we do not know the structure but would
like to have interactions respecting the structure ?
53
“CRFs” for more complex structured
output problems
• Recall, we had the MST algorithm (McDonald and
Pereira, 05)
54
“CRFs” for more complex structured
output problems
• Complex inference
• E.g., arbitrary 2nd order dependency parsing
models are not tractable (non-projective)
NP-complete: (McDonald & Pereira, EACL 06)
• Recently conditional models for constituent
parsing:
• (Finkel et al, ACL 08)
• (Carreras et al, CoNLL 08)
• ...
55
Back to MultiClass
– Let us review how to decompose multiclass
problem to binary classification problems
56
Summary
• Margin-based method for multiclass classification
and structured prediction
• CRFs vs HMMs vs MEMMs for POS tagging
57
Conclusions
• All approaches use linear representation
• The differences are
– Features
– How to learn weights
– Training Paradigms:
• Global Training (CRF, Global Perceptron)
• Modular Training (PMM, MEMM, ...)
– These approaches are easier to train, but may requires
additional mechanisms to enforce global constraints.
58
Download