Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] Automatic clustering? Previous Work: Manual Annotation [Klein & Manning ’03] Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: F1 Performance leveled out Model Manually annotated Naïve Treebank Grammar 72.6 Klein & Manning ’03 86.3 [Matsuzaki et. al ’05, Prescher ’05] Previous Work: Automatic Annotation Induction Advantages: Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories. Disadvantages: Grammar gets too large Most categories are oversplit while others are undersplit. Model Klein & Manning ’03 Matsuzaki et al. ’05 F1 86.3 86.7 Previous work is complementary Manual Annotation This Work Automatic Annotation Allocates splits where needed Very tedious Compact Grammar Misses Features Splits uniformly Automatically learned Large Grammar Captures many features Learning Latent Annotations Forward EM algorithm: Brackets are known Base categories are known Only induce subcategories X1 X2 X3 X7 X4 X5 X6 . Just like Forward-Backward for HMMs. He was right Backward Limit of computational resources Overview Parsing accuracy (F1) 90 k=16 k=8 85 k=4 80 k=2 75 70 65 - Hierarchical Training - Adaptive Splitting - Parameter Smoothing k=1 60 50 250 450 650 850 1050 1250 Total Number of grammar symbols 1450 1650 Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4 Refinement of the DT tag DT Hierarchical refinement of the DT tag Hierarchical Estimation Results 90 Parsing accuracy (F1) 88 86 84 82 80 78 76 74 100 300 500 Model F1 1300 1500 1700 Baseline 87.3 Total Number of grammar symbols Hierarchical Training 88.4 700 900 1100 Refinement of the , tag Splitting all categories the same amount is wasteful: The DT tag revisited Oversplit? Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful Adaptive Splitting Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split No loss in accuracy when 50% of the splits are reversed. Adaptive Splitting Results Parsing accuracy (F1) 90 88 86 84 82 80 50% Merging 78 Hierarchical Training 76 Flat Training 74 100 300 Model F1 1300 1500 1700 Previous 88.4 Total Number of grammar symbols With 50% Merging 89.5 500 700 900 1100 0 LST ROOT X WHADJP RRC SBARQ INTJ WHADVP UCP NAC FRAG CONJP SQ WHPP PRT SINV NX PRN WHNP QP SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories 40 35 30 25 20 15 10 5 0 LST ROOT X WHADJP RRC SBARQ INTJ WHADVP UCP NAC FRAG CONJP SQ WHPP PRT SINV NX N P PRN 30 WHNP 35 QP 40 SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories VP PP 25 20 15 10 5 0 LST ROOT X WHADJP RRC NA C SBARQ INTJ WHADVP 10 UCP 15 NAC FRAG CONJP SQ WHPP PRT SINV NX PRN WHNP QP SBAR ADJP S ADVP PP VP NP Number of Phrasal Subcategories 40 35 30 25 20 X 5 30 20 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories 70 60 50 40 PO S T O , 10 60 50 40 30 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories 70 R B VBx IN DT 20 10 70 60 50 40 30 0 NNP JJ NNS NN VBN RB VBG VB VBD CD IN VBZ VBP DT NNPS CC JJR JJS : PRP PRP$ MD RBR WP POS PDT WRB -LRB. EX WP$ WDT -RRB'' FW RBS TO $ UH , `` SYM RP LS # Number of Lexical Subcategories NN P JJ NN S N N 20 10 Smoothing Heavy splitting can lead to overfitting Idea: Smoothing allows us to pool statistics Linear Smoothing Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging Hierarchical Training 76 Flat Training 74 100 300 500 700 Total Number of grammar symbols 900 1100 Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging Hierarchical Training 76 Flat Training 74 100 300 500 700 Total Number of grammar symbols 900 1100 Result Overview 90 Parsing accuracy (F1) 88 86 84 82 80 50% Merging and Smoothing 78 50% Merging Hierarchical Training 76 Flat Training 74 100 300 Model F1 900 1100 Total Number of grammar symbols Previous 89.5 With Smoothing 90.7 500 700 Final Results F1 ≤ 40 words F1 all words Klein & Manning ’03 86.3 85.7 Matsuzaki et al. ’05 86.7 86.1 This Work 90.2 89.7 Parser Final Results F1 ≤ 40 words F1 all words Klein & Manning ’03 86.3 85.7 Matsuzaki et al. ’05 86.7 86.1 Collins ’99 88.6 88.2 Charniak & Johnson ’05 90.1 89.6 This Work 90.2 89.7 Parser Linguistic Candy Proper Nouns (NNP): NNP-14 Oct. Nov. Sept. NNP-12 John Robert James NNP-2 J. E. L. NNP-1 Bush Noriega Peters NNP-15 New San Wall NNP-3 York Francisco Street Personal pronouns (PRP): PRP-0 It He I PRP-1 it he they PRP-2 it them him Linguistic Candy Relative adverbs (RBR): RBR-0 further lower higher RBR-1 more less More RBR-2 earlier Earlier later Cardinal Numbers (CD): CD-7 one two Three CD-4 1989 1990 1988 CD-11 million billion trillion CD-0 1 50 100 CD-3 1 30 31 CD-9 78 58 34 Conclusions New Ideas: Hierarchical Training Adaptive Splitting Parameter Smoothing State of the Art Parsing Performance: Improves from X-Bar initializer 63.4 to 90.2 Linguistically interesting grammars to sift through. Thank You! petrov@eecs.berkeley.edu Other things we tried X-Bar vs structurally annotated grammar: X-Bar grammar starts at lower performance, but provides more flexibility Better Smoothing: Tried different (hierarchical) smoothing methods, all worked about the same (Linguistically) constraining rewrite possibilities between subcategories: Hurts performance EM automatically learns that most subcategory combinations are meaningless: ≥ 90% of the possible rewrites have 0 probability