Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] Automatic clustering? Previous Work: Manual Annotation [Klein & Manning ’03] Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated Model F1 Naïve Treebank Grammar 72.6 Klein & Manning ’03 86.3 [Matsuzaki et. al ’05, Prescher ’05] Previous Work: Automatic Annotation Induction Advantages: Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories. Disadvantages: Grammar gets too large Most categories are oversplit while others are undersplit. Model Klein & Manning ’03 Matsuzaki et al. ’05 F1 86.3 86.7 Overview Learning: Hierarchical Training Adaptive Splitting Parameter Smoothing Inference: Coarse-To-Fine Decoding Variational Approximation German Analysis [Petrov, Barrett, Thibaux & Klein in ACL’06] [Petrov & Klein in NAACL’07] Learning Latent Annotations Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X3 X7 X4 X5 X6 . Just like Forward-Backward for HMMs. He was right Backward Limit of computational resources Starting Point Parsing accuracy (F1) 90 k=16 k=8 85 k=4 80 k=2 75 70 65 k=1 60 50 250 450 650 850 1050 1250 Total Number of grammar symbols 1450 1650 Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4 Refinement of the DT tag DT Hierarchical Refinement of the DT tag DT Hierarchical Estimation Results 90 Parsing accuracy (F1) 88 86 84 82 80 78 76 74 100 300 500 Model F1 1300 1500 1700 Baseline 87.3 Total Number of grammar symbols Hierarchical Training 88.4 700 900 1100 Refinement of the , tag Splitting all categories the same amount is wasteful: The DT tag revisited Oversplit? Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split No loss in accuracy when 50% of the splits are reversed. Adaptive Splitting Results 90 Parsing accuracy (F1) 88 86 84 82 80 78 50% Merging Hierarchical Training 76 74 100 300 500 Flat Training Model F1 700 900 1100 1300 1500 1700 Previous 88.4 Total Number of grammar symbols With 50% Merging 89.5 ROOT DL CO CH CAC CAVP VROOT ISU AA CVZ MTA CPP NM CCP VZ CVP CS CAP PN AVP CNP S AP PP NP VP Number of Phrasal Subcategories 35 30 25 20 15 10 5 0 NE VVFIN ADJA NN ADV ADJD VVPP APPR VVINF CARD ART PIS PIAT PPER KON $[ PROAV VAFIN PDS APPRAR PPOSAT $. PDAT PRELS PTKVZ VVIZU VAINF KOUS VMFIN FM VAPP KOKOM PWAV PWS KOUI TRUNC XY PTKZU PWAT VVIMP NNE PRELAT PTKNEG APZR Number of Lexical Subcategories 35 30 25 20 15 10 5 0 Smoothing Heavy splitting can lead to overfitting Idea: Smoothing allows us to pool statistics Linear Smoothing Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging 76 Hierarchical Training 74 100 Flat Training 300 900 1100 F1 Model Total Number of grammar symbols Previous 89.5 With Smoothing 90.7 500 700 [Goodman ‘97, Charniak&Johnson ‘05] Coarse-to-Fine Parsing Treebank Coarse grammar NP … VP NP-apple NP-1 VP-run VP-6 NP-17 NP-12 … VP-31 NP-dog NP-cat … NP-eat … … grammar RefinedRefined grammar Hierarchical Pruning Consider the span 5 to 12: coarse: … split in two: split in four: split in eight: … … QP NP VP … QP1 QP2 NP1 NP2 VP1 VP2 … … QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 … … … … … … … … … … … … … … … … … Intermediate Grammars DT X-Bar=G0 DT1 G1 Learning G2 G3 G4 G5 G= G6 DT1 DT2 DT2 DT3 DT4 DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8 State Drift (DT tag) the that this That That This …… That …… some this these some this …… these this …… these that …… these that …… some …… some …… EM Projected Grammars X-Bar=G0 0(G) G1 1(G) 2(G) G3 3(G) G4 4(G) G5 5(G) G= G6 G Projection i Learning G2 Bracket Posteriors (after G0) Bracket Posteriors (after G1) (Final Chart) Bracket Posteriors (Movie) Bracket Posteriors (Best Tree) Parse Selection Parses: Derivations: -1 -1 -1 -2 -2 -1 -1 -2 -2 -1 -1 -1 -1 -2 Computing most likely unsplit tree is NP-hard: Settle for best derivation. Rerank n-best list. Use alternative objective function / Variational Approximation. Efficiency Results Berkeley Parser: 15 min Implemented in Java Charniak & Johnson ‘05 Parser 19 min Implemented in C Accuracy Results all F1 ENG Charniak&Johnson ‘05 (generative) 90.1 89.6 This Work 90.6 90.1 GER Dubey ‘05 76.3 - This Work 80.8 80.1 Chiang et al. ‘02 80.0 76.6 This Work 86.3 83.4 CHN ≤ 40 words F1 Parsing German Shared Task Two Pass Parsing Determine constituency structure (F1: 85/94) Assign grammatical functions One Pass Approach Treat categories+grammatical functions as labels Parsing German Shared Task Two Pass Parsing Determine constituency structure Assign grammatical functions One Pass Approach Treat categories+grammatical functions as labels Development Set Results Shared Task Results Part-of-speech splits Linguistic Candy Conclusions Split & Merge Learning Hierarchical Training Adaptive Splitting Parameter Smoothing Hierarchical Coarse-to-Fine Inference Projections Marginalization Multi-lingual Unlexicalized Parsing Thank You! Parser is avaliable at http://nlp.cs.berkeley.edu