slides

advertisement
Parsing German
with Latent Variable Grammars
Slav Petrov and Dan Klein
UC Berkeley
The Game of Designing a Grammar
 Annotation refines base treebank symbols to
improve statistical fit of the grammar
 Parent annotation [Johnson ’98]
 Head lexicalization [Collins ’99, Charniak ’00]
 Automatic clustering?
Previous Work:
Manual Annotation
[Klein & Manning ’03]
 Manually split categories
 NP: subject vs object
 DT: determiners vs demonstratives
 IN: sentential vs prepositional
 Advantages:
 Fairly compact grammar
 Linguistic motivations
 Disadvantages:
 Performance leveled out
 Manually annotated
Model
F1
Naïve Treebank Grammar 72.6
Klein & Manning ’03
86.3
[Matsuzaki et. al ’05,
Prescher ’05]
Previous Work:
Automatic Annotation Induction
 Advantages:
 Automatically learned:
Label all nodes with latent variables.
Same number k of subcategories
for all categories.
 Disadvantages:
 Grammar gets too large
 Most categories are
oversplit while others
are undersplit.
Model
Klein & Manning ’03
Matsuzaki et al. ’05
F1
86.3
86.7
Overview
 Learning:
 Hierarchical Training
 Adaptive Splitting
 Parameter Smoothing
 Inference:
 Coarse-To-Fine Decoding
 Variational Approximation
 German Analysis
[Petrov, Barrett,
Thibaux & Klein
in ACL’06]
[Petrov & Klein
in NAACL’07]
Learning Latent Annotations
Forward
EM algorithm:
 Brackets are known
 Base categories are known
 Only induce subcategories
X1
X2
X3
X7
X4
X5
X6
.
Just like Forward-Backward
for HMMs.
He
was
right
Backward
Limit of
computational
resources
Starting Point
Parsing accuracy (F1)
90
k=16
k=8
85
k=4
80
k=2
75
70
65
k=1
60
50
250
450
650
850
1050
1250
Total Number of grammar symbols
1450
1650
Refinement of the DT tag
DT
DT-1
DT-2
DT-3
DT-4
Refinement of the DT tag
DT
Hierarchical Refinement of the DT tag
DT
Hierarchical Estimation Results
90
Parsing accuracy (F1)
88
86
84
82
80
78
76
74
100
300
500
Model
F1
1300
1500
1700
Baseline
87.3
Total Number of grammar symbols
Hierarchical Training 88.4
700
900
1100
Refinement of the , tag
 Splitting all categories the same amount is
wasteful:
The DT tag revisited
Oversplit?
Adaptive Splitting
 Want to split complex categories more
 Idea: split everything, roll back splits which
were least useful
Adaptive Splitting
 Want to split complex categories more
 Idea: split everything, roll back splits which
were least useful
Adaptive Splitting
 Evaluate loss in likelihood from removing each
split =
Data likelihood with split reversed
Data likelihood with split
 No loss in accuracy when 50% of the splits are
reversed.
Adaptive Splitting Results
90
Parsing accuracy (F1)
88
86
84
82
80
78
50% Merging
Hierarchical Training
76
74
100
300
500
Flat Training
Model
F1
700
900
1100
1300
1500
1700
Previous
88.4
Total Number of grammar symbols
With 50% Merging 89.5
ROOT
DL
CO
CH
CAC
CAVP
VROOT
ISU
AA
CVZ
MTA
CPP
NM
CCP
VZ
CVP
CS
CAP
PN
AVP
CNP
S
AP
PP
NP
VP
Number of Phrasal Subcategories
35
30
25
20
15
10
5
0
NE
VVFIN
ADJA
NN
ADV
ADJD
VVPP
APPR
VVINF
CARD
ART
PIS
PIAT
PPER
KON
$[
PROAV
VAFIN
PDS
APPRAR
PPOSAT
$.
PDAT
PRELS
PTKVZ
VVIZU
VAINF
KOUS
VMFIN
FM
VAPP
KOKOM
PWAV
PWS
KOUI
TRUNC
XY
PTKZU
PWAT
VVIMP
NNE
PRELAT
PTKNEG
APZR
Number of Lexical Subcategories
35
30
25
20
15
10
5
0
Smoothing
 Heavy splitting can lead to overfitting
 Idea: Smoothing allows us to pool
statistics
Linear Smoothing
Result Overview
90
Parsing accuracy (F1)
88
86
84
82
80
50% Merging and Smoothing
78
50% Merging
76
Hierarchical Training
74
100
Flat Training
300
900
1100 F1
Model
Total Number of grammar symbols
Previous
89.5
With Smoothing 90.7
500
700
[Goodman ‘97, Charniak&Johnson ‘05]
Coarse-to-Fine Parsing
Treebank
Coarse grammar
NP … VP
NP-apple NP-1 VP-run
VP-6
NP-17 NP-12
…
VP-31
NP-dog NP-cat
…
NP-eat
…
…
grammar
RefinedRefined
grammar
Hierarchical Pruning
Consider the span 5 to 12:
coarse:
…
split in two:
split in four:
split in eight:
…
…
QP
NP
VP
…
QP1 QP2 NP1 NP2 VP1 VP2 …
…
QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 …
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Intermediate Grammars
DT
X-Bar=G0
DT1
G1
Learning
G2
G3
G4
G5
G= G6
DT1
DT2
DT2
DT3
DT4
DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8
State Drift (DT tag)
the
that
this
That
That
This
…… That
……
some
this
these some
this
…… these
this
…… these
that
…… these
that
…… some
…… some
……
EM
Projected Grammars
X-Bar=G0
0(G)
G1
1(G)
2(G)
G3
3(G)
G4
4(G)
G5
5(G)
G= G6
G
Projection i
Learning
G2
Bracket Posteriors (after G0)
Bracket Posteriors (after G1)
(Final Chart)
Bracket Posteriors (Movie)
Bracket Posteriors (Best Tree)
Parse Selection
Parses:
Derivations:
-1
-1
-1
-2
-2
-1
-1
-2
-2
-1
-1
-1
-1
-2
Computing most likely unsplit tree is NP-hard:
 Settle for best derivation.
 Rerank n-best list.
 Use alternative objective function / Variational Approximation.
Efficiency Results
 Berkeley Parser:
 15 min
 Implemented in Java
 Charniak & Johnson ‘05 Parser
 19 min
 Implemented in C
Accuracy Results
all
F1
ENG
Charniak&Johnson ‘05 (generative)
90.1
89.6
This Work
90.6
90.1
GER
Dubey ‘05
76.3
-
This Work
80.8
80.1
Chiang et al. ‘02
80.0
76.6
This Work
86.3
83.4
CHN
≤ 40 words
F1
Parsing German Shared Task
 Two Pass Parsing
 Determine constituency structure (F1: 85/94)
 Assign grammatical functions
 One Pass Approach
 Treat categories+grammatical functions as
labels
Parsing German Shared Task
 Two Pass Parsing
 Determine constituency structure
 Assign grammatical functions
 One Pass Approach
 Treat categories+grammatical functions as
labels
Development Set Results
Shared Task Results
Part-of-speech splits
Linguistic Candy
Conclusions
 Split & Merge Learning
 Hierarchical Training
 Adaptive Splitting
 Parameter Smoothing
 Hierarchical Coarse-to-Fine Inference
 Projections
 Marginalization
 Multi-lingual Unlexicalized Parsing
Thank You!
Parser is avaliable at
http://nlp.cs.berkeley.edu
Download