slides

advertisement
Global Inference in Learning for
Natural Language Processing
Vasin Punyakanok
Department of Computer Science
University of Illinois at Urbana-Champaign
Joint work with Dan Roth, Wen-tau Yih, and Dav Zimak
Story Comprehension
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
1.
Who is Christopher Robin?
2.
What did Mr. Robin do when Chris was three years old?
3.
When was Winnie the Pooh written?
4.
Why did Chris write two books of his own?
Page 2
Stand Alone Ambiguity Resolution

Context Sensitive Spelling Correction
IIllinois’ bored of education

Word Sense Disambiguation
...Nissan Car and truck plant is …
…divide life into plant and animal kingdom

Part of Speech Tagging
(This DT) (can N) (will MD) (rust V)

board
DT,MD,V,N
Coreference Resolution
The dog bit the kid. He was taken to a hospital.
The dog bit the kid. He was taken to a veterinarian.
Page 3
Textual Entailment
Eyeing the huge market potential, currently led by Google, Yahoo took
over search company Overture Services Inc. last year.
Yahoo acquired Overture.
Question Answering
 Who acquired Overture?
Page 4
Inference and Learning

Global decisions in which several local decisions play a role but
there are mutual dependencies on their outcome.

Learned classifiers for different sub-problems

Incorporate classifiers’ information, along with constraints, in making
coherent decisions – decisions that respect the local classifiers as
well as domain & context specific constraints.

Global inference for the best assignment to all variables of interest.
Page 5
Text Chunking
y=
NP
ADJP
VP
x=
The
ADVP
VP
guy standing there
is
so
tall
Page 6
Full Parsing
S
y=
NP
VP
NP
ADJP
VP
x=
The
ADVP
guy standing there
is
so
tall
Page 7
Outline

Semantic Role Labeling Problem
 Global Inference with Integer Linear Programming

Some Issues with Learning and Inference
 Global vs Local Training
 Utility of Constraints in the Inference

Conclusion
Page 8
Semantic Role Labeling
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2
[in my will]AM-LOC .




A0 Leaver
A1 Things left
A2 Benefactor
AM-LOC
Location
Page 9
Semantic Role Labeling

PropBank [Palmer et. al. 05] provides a large human-annotated
corpus of semantic verb-argument relations.
 It adds a layer of generic semantic labels to Penn Tree Bank II.
 (Almost) all the labels are on the constituents of the parse trees.

Core arguments: A0-A5 and AA
 different semantics for each verb
 specified in the PropBank Frame files

13 types of adjuncts labeled as AM-arg
 where arg specifies the adjunct type
Page 10
Semantic Role Labeling
Page 11
The Approach
I left my nice pearls to her




Pruning
 Use heuristics to reduce the number of
candidates (modified from
[Xue&Palmer’04])
Argument Identification
 Use a binary classifier to identify
arguments
Argument Classification
 Use a multiclass classifier to classify
arguments
Inference
 Infer the final output satisfying linguistic
and structure constraints
Page 12
Learning

Both argument identifier and argument classifier are trained phrasebased classifiers.

Features (some examples)
 voice, phrase type, head word, path, chunk, chunk pattern, etc.
[some make use of a full syntactic parse]

Learning Algorithm – SNoW
 Sparse network of linear functions
 weights learned by regularized Winnow multiplicative update
rule with averaged weight vectors
 Probability conversion is done via softmax
 pi = exp{acti}/j exp{actj}
Page 13
Inference

The output of the argument classifier often violates some constraints,
especially when the sentence is long.

Finding the best legitimate output is formalized as an optimization
problem and solved via Integer Linear Programming.

Input:
 The probability estimation (by the argument classifier)
 Structural and linguistic constraints

Allows incorporating expressive (non-sequential) constraints on the
variables (the arguments types).
Page 14
Integer Linear Programming Inference


For each argument ai and type t (including null)
 Set up a Boolean variable: ai,t indicating if ai is classified as t
Goal is to maximize



i score(ai = t ) ai,t
Subject to the (linear) constraints
 Any Boolean constraints can be encoded this way
If score(ai = t ) = P(ai = t ), the objective is find the assignment that
maximizes the expected number of arguments that are correct and
satisfies the constraints
Page 15
Linear Constraints

No overlapping or embedding arguments
ai,aj overlap or embed: ai,null + aj,null  1
Page 16
Constraints

Constraints
 No overlapping or embedding arguments
 No duplicate argument classes for A0-A5
 Exactly one V argument per predicate
 If there is a C-V, there must be V-A1-C-V pattern
 If there is an R-arg, there must be arg somewhere
 If there is a C-arg, there must be arg somewhere before
 Each predicate can take only core arguments that appear in its
frame file.
 More specifically, we check for only the minimum and
maximum ids
Page 17
SRL Results (CoNLL-2005)




Training: section 02-21
Development: section 24
Test WSJ: section 23
Test Brown: from Brown corpus (very small)
Development
Test WSJ
Test Brown
Precision
Recall
F1
Collins
73.89
70.11
71.95
Charniak
75.40
74.13
74.76
Collins
77.09
72.00
74.46
Charniak
78.10
76.15
77.11
Collins
68.03
63.34
65.60
Charniak
67.15
63.57
65.31
Page 18
Inference with Multiple SRL systems

Goal is to maximize


i score(ai = t ) ai,t
Subject to the (linear) constraints
 Any Boolean constraints can be encoded this way

score(ai = t ) = k fk(ai = t )

If system k has no opinion on ai, use a prior instead
Page 19
Results with Multiple Systems (CoNLL-2005)
F1
Col
Char
Char-2
Char-3
Char-4
Char-5
Combined
Devel
77.35
WSJ
79.44
Brown
67.75
60
70
80
90
Page 20
Outline

Semantic Role Labeling Problem
 Global Inference with Integer Linear Programming

Some Issues with Learning and Inference
 Global vs Local Training
 Utility of Constraints in the Inference

Conclusion
Page 21
Learning and Inference
Training
Testing:
IBT:
Inference-based
Inference
w/o Constraints
withTraining
Constraints
Learning the
components together!
y1
y2
y3
f1(x)
X
y4
f2(x)
y5
x3
f3(x)
x4
x1
x5
x2
x6
x7
f4(x)
Y
f5(x)
Which one is better?
When and Why?
Page 22
Comparisons of Learning Approaches
Coupling (IBT)
 Optimize the true global objective function (This should be better in
the limit)
Decoupling (L+I)
 More efficient
 Reusability of classifiers
 Modularity in training
 No global examples required
Page 23
Claims

When the local classification problems are “easy”, L+I outperforms
IBT.

Only when the local problems become difficult to solve in isolation,
IBT outperforms L+I, but needs a large enough number of training
examples.

Will show experimental results and theoretical intuition to support
our claims.
Page 24
Perceptron-based Global Learning
True Global Labeling
Local Predictions
Apply
Constraints:
Y
-1
1
-1
-1
1
Y’
-1
1
1
-1
1
1
f1(x)
X
f2(x)
x3
f3(x)
x4
x1
x5
f4(x)
Y
f5(x)
x2
x6
x7
Page 25
Simulation





There are 5 local binary linear classifiers
Global classifier is also linear
 h(x) = argmaxy2C(Y) i fi(x,yi)
Constraints are randomly generated
The hypothesis is linearly separable at the global level given that the
constraints are known
The separability level at the local level is varied
Page 26
Bound Prediction
L+I vs. IBT: the more identifiable
individual problems are the better
overall performance is with L+I

Local
 ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global
 ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2
Bounds
Simulated Data
opt
=0.1
=0
opt
opt=0.2
Page 27
Relative Merits: SRL
L+I is better.
When the problem
is artificially made
harder, the tradeoff
is clearer.
hard
Difficulty of the learning problem
(# features)
easy
Page 28
Summary

When the local classification problems are “easy”, L+I outperforms
IBT.

Only when the local problems become difficult to solve in isolation,
IBT outperforms L+I, but needs a large enough number of training
examples.

Why does inference help at all?
Page 29
About Constraints

We always assume that global coherency is good
Constraints does help in real world applications

Performance is usually measured at the local prediction

Depending on the performance metric, constraints can hurt

Page 30
Results: Contribution of Expressive
Constraints [Roth & Yih 05]


Basic: Learning with statistical constraints only;
Additional constraints added at evaluation time (efficiency)
F1
basic (Viterbi)
+ no dup
+ cand
+ argument
+ verb pos
+ disallow
CRF-ML
66.46
diff
67.10
+0.64
71.78
+4.68
71.71
-0.07
71.72
+0.01
71.94
+0.22
CRF-D
69.14
diff
69.74 +0.60
73.64 +3.90
73.71 +0.07
73.78 +0.07
73.91 +0.13
Page 31
Assumptions





y = h y1, …, yl i
Non-interactive classifiers: fi(x,yi)
Each classifier does not use as inputs the outputs of other classifiers
Inference is linear summation
hun(x) = argmaxy2Y i fi(x,yi)
hcon(x) = argmaxy2C(Y) i fi(x,yi)
C(Y) always contains correct outputs
No assumption on the structure of constraints
Page 32
Performance Metrics

Zero-one loss
 Mistakes are calculated in terms of global mistakes
 y is wrong if any of yi is wrong

Hamming loss
 Mistakes are calculated in terms of local mistakes
Page 33
Zero-One Loss

Constraints cannot hurt
 Constraints never fix correct global outputs

This is not true for Hamming Loss
Page 34
Boolean Cube
4-bit binary outputs
4 mistakes
0000
Hamming Loss
3 mistakes
2 mistakes
1 mistake
0 mistake
1100
1000
0100
0010
0001
1010
1001
0110
0101
1110
1101
1011
0111
0011
1111
Score
Page 35
Hamming Loss
0000
hun
Hamming Loss
1000
1100
0100
1010
0010
0110
1110
0001
1001
1101
0101
1011
0011
0111
1111
Score
Page 36
Best Classifiers
4
0000
Hamming Loss
1000
1100
3
0100
1010
0010
0110
1001
0001
0101
0011
1
1110
1101
1011
0111
2
1111
1 + 2
Score
Page 37
When Constraints Cannot Hurt
i : distance between the correct label and the 2nd best label
i : distance between the predicted label and the correct label
Fcorrect = { i | fi is correct}
Fwrong = { i | fi is wrong}
Constraints cannot hurt if
8 i 2 Fcorrect :
i > i 2 Fwrong i
Page 38
An Empirical Investigation


SRL System
CoNLL-2005 WSJ test set
Local Accuracy
Without Constraints
With Constraints
82.38%
84.08%
Page 39
An Empirical Investigation
Number
Percentages
19822
100.00
3492
17.62
Correct Predictions
16330
82.38
Violate the condition
1833
9.25
Total Predictions
Incorrect Predictions
Page 40
Good Classifiers
0000
Hamming Loss
1000
1100
0100
1010
0010
0110
1110
0001
1001
1101
0101
1011
0011
0111
1111
Score
Page 41
Bad Classifiers
0000
Hamming Loss
1000
1100
0100
1010
1110
0010
10010110
1101
1011
0101
0001
0011
0111
1111
Score
Page 42
Average Distance vs Gain in Hamming Loss
Good
 High Loss ! Low Score
(Low Gain)
Page 43
Good Classifiers
0000
Hamming Loss
1000
1100
0100
1010
0010
0110
1110
0001
1001
1101
0101
1011
0011
0111
1111
Score
Page 44
Bad Classifiers
0000
Hamming Loss
1000
1100
0100
1010
1110
0010
10010110
1101
1011
0101
0001
0011
0111
1111
Score
Page 45
Average Gain in Hamming Loss vs Distance
Good
 High Score ! Low Loss
(High Gain)
Page 46
Utility of Constraints

Constraints improve the performance because the classifiers are
good
Good Classifiers:
 When the classifier is correct, it allows large margin between the
correct label and the 2nd best label
 When the classifier is wrong, the correct label is not far away from
the predicted one
Page 47
Conclusions






Show how global inference can be used
 Semantic Role Labeling
Tradeoffs between Coupling vs. Decoupling learning and inference
Investigation of utility of constraints
The analyses are very preliminary
Average-case analysis for the tradeoffs between Coupling vs.
Decoupling learning and inference
Better understanding for using constraints
 More interactive classifiers
 Different performance metrics, e.g. F1
 Relation with margin
Page 48
Download