Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein [Petrov et al. ‘06] Unlexicalized Parsing Hierarchical, adaptive refinement: DT DT1 DT2 91.2 F1 score on Dev Set (1600 sentences) 1,140 DT1 DT2 DT1 DT2 DT3 DT4 Nonterminal symbols 531,200 Rewrites DT3 1621min DT5 DT6 DT4 Parsing time DT7 DT8 1621 min [Goodman ‘97, Charniak&Johnson ‘05] Coarse-to-Fine Parsing Treebank Coarse grammar NP … VP NP-apple NP-1 VP-run VP-6 NP-17 NP-12 … VP-31 NP-dog … NP-eat NP-cat … … grammar RefinedRefined grammar Prune? For each chart item X[i,j], compute posterior probability: < threshold E.g. consider the span 5 to 12: coarse: refined: … QP NP VP … 1621 min 111 min (no search error) [Charniak et al. ‘06] Multilevel Coarse-to-Fine Parsing X Add more rounds of pre-parsing Grammars coarser than X-bar A,B,.. NP … VP ? ??? ??? NP-dog NP-apple NP-cat … … VP-run Refined grammar NP-eat Hierarchical Pruning Consider again the span 5 to 12: coarse: … split in two: split in four: split in eight: … … QP NP VP … QP1 QP2 NP1 NP2 VP1 VP2 … … QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 … … … … … … … … … … … … … … … … … Intermediate Grammars DT X-Bar=G0 DT1 G1 Learning G2 G3 G4 G5 G= G6 DT1 DT2 DT2 DT3 DT4 DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8 1621 min 111 min 35 min (no search error) State Drift (DT tag) the that this That That This …… That …… some this these some this …… these this …… these that …… these that …… some …… some …… EM Projected Grammars X-Bar=G0 0(G) G1 1(G) 2(G) G3 3(G) G4 4(G) G5 5(G) G= G6 G Projection i Learning G2 Estimating Projected Grammars Nonterminals? Easy: NP1 NP0 VP VP0 1 S0 S 1 Nonterminals in G Projection NP VP S Nonterminals in (G) Estimating Projected Grammars Rules? S1 NP1 VP1 S1 NP1 VP2 S1 NP2 VP1 S1 NP2 VP2 S2 NP1 VP1 S2 NP1 VP2 S2 NP2 VP1 S2 NP2 VP2 0.20 0.12 0.02 0.03 0.11 0.05 0.08 0.12 Rules in G ? S NP VP ??? Rules in (G) [Corazza & Satta ‘06] Estimating Projected GrammarsGrammars … S1 NP1 VP1 S1 NP1 VP2 S1 NP2 VP1 S1 NP2 VP2 S2 NP1 VP1 S2 NP1 VP2 S2 NP2 VP1 S2 NP2 VP2 0.20 0.12 0.02 0.03 0.11 0.05 0.08 0.12 S NP VP 0.56 Rules in (G) Rules in G … InfiniteTreebank tree distribution Calculating Expectations Nonterminals: ck(X): expected counts up to depth k Converges within 25 iterations (few seconds) Rules: 1621 min 111 min 35 min 15 min (no search error) Parsing times 60 % G1 12 % G2 7% G3 6% G4 6% G5 5% G= G6 4% Learning X-Bar=G0 Bracket Posteriors (after G0) Bracket Posteriors (after G1) (Final Chart) Bracket Posteriors (Movie) Bracket Posteriors (Best Tree) Parse Selection Parses: Derivations: -1 -1 -1 -2 -2 -1 -1 -2 -2 -1 -1 -1 -1 -2 Computing most likely unsplit tree is NP-hard: Settle for best derivation. Rerank n-best list. Use alternative objective function. [Titov & Henderson ‘06] Parse Risk Minimization Expected loss according to our beliefs: TT : true tree TP : predicted tree L : loss function (0/1, precision, recall, F1) Use n-best candidate list and approximate expectation with samples. Reranking Results Objective Precision Recall F1 Exact 89.5 37.4 BEST DERIVATION Viterbi Derivation 89.6 89.4 RERANKING Precision (sampled) 91.1 88.1 89.6 21.4 Recall (sampled) 88.2 91.3 89.7 21.5 F1 (sampled) 90.2 89.3 89.8 27.2 Exact (sampled) 89.5 89.5 89.5 25.8 Exact (non-sampled) 90.8 90.8 90.8 41.7 Exact/F1 (oracle) 95.3 94.4 95.0 63.9 Dynamic Programming [Matsuzaki et al. ‘05] Approximate posterior parse distribution à la [Goodman ‘98] Maximize number of expected correct rules Dynamic Programming Results Objective Precision Recall F1 Exact 89.5 37.4 BEST DERIVATION Viterbi Derivation 89.6 89.4 DYNAMIC PROGRAMMING Variational 90.7 90.9 90.8 41.4 Max-Rule-Sum 90.5 91.3 90.9 40.4 Max-Rule-Product 91.2 91.1 91.2 41.4 Final Results (Efficiency) Berkeley Parser: 15 min 91.2 F-score Implemented in Java Charniak & Johnson ‘05 Parser 19 min 90.7 F-score Implemented in C Final Results (Accuracy) ENG GER CHN ≤ 40 words F1 all F1 Charniak&Johnson ‘05 (generative) 90.1 89.6 This Work 90.6 90.1 Charniak&Johnson ‘05 (reranked) 92.0 91.4 Dubey ‘05 76.3 - This Work 80.8 80.1 Chiang et al. ‘02 80.0 76.6 This Work 86.3 83.4 Conclusions Hierarchical coarse-to-fine inference Projections Marginalization Multi-lingual unlexicalized parsing Thank You! Parser available at http://nlp.cs.berkeley.edu