Factors Affecting the Accuracy of Korean Parsing Tagyoung Chung Matt Post Daniel Gildea Department of Computer Science University of Rochester 1 Korean 2 Complicating factors • Rich morphology – 㥋㟃㨋㣨㥽㐹㣥㖦㒫? = negative + see + passive + honorific + past + future + formal + indicative + interrogative • Scrambling 㩕㨋 㜔㛽㦂㐳 㫾㧺 㩳㥽㖰 . John-NOM Mary-DAT book-ACC give-PAST-DEC . – • Null anaphora – 㩞㥕㥱? Did you like it? 3 Korean treebank • Newswire text (LDC2000T45) • 5K sentences, 132K words, 14K unique morphemes • Tokenized & allomorph neutralized • Penn Treebank-like annotations (Han et al., 2001) 4 Parsing with probabilistic context-free grammars 5 Initial experiments F1 F1≤40 types tokens Korean 52.78 56.55 6.6K 194K English (§02–04) 72.20 73.29 7.5K 147K English (§02–21) 71.61 72.74 23K 950K • Standard PCFG • All N ULL elements and function tags removed • Nonterminal to terminal ratio is similar • Data sparsity may not be the main problem 6 Initial observations • KTB sentences are much longer Mean length of 48 instead of 23 for PBT • Ambiguous rules NP → NP NP NP NP (6) NP → NP NP NP NP NP (3) NP → NP NP NP NP NP NP (2) NP → NP NP NP NP NP NP NP (2) NP → NP NP NP NP occurs only once in PTB • Coarse nonterminals (43 vs. 72) preterminals (33 vs. 44) 7 Function tags SBJ subject with nominative case marker OBJ complement with accusative case marker COMP complement with adverbial postposition ADV NP that function as adverbial phrase VOC noun with vocative case maker LV NP coupled with “light” verb construction NP- SBJ NPR NNC PAU 㣙㧗㭐 㞇㡐 㧹 • Mark grammatical functions of NP and S nodes • Show child nodes’ morphological information • (S → NP-SBJ S) is twice as common as (S → NP-OBJ S) 8 Parsing with function tags w/o function tags w/ function tags F1 F1≤40 F1 F1≤40 Korean 52.78 56.55 56.18 60.21 English (§02–04) 72.20 73.29 70.50 71.78 English (§02–21) 71.61 72.74 72.82 74.05 • Evaluated against the same test set (without function tags) • Nonterminals are too coarse without function tags • Further improvements? 9 Latent annotations • Learning refined grammar improves English parsing Parent annotation (Johnson, 1998) lexicalization (Collins, 1999) • Petrov et al. (2006) introduce automatic learning of latent annotation using split and merge EM 1. Split a symbol into two subcategories using EM 2. Merge it back if loss in likelihood for merging is small 3. Additive smoothing 4. Repeat 10 Parsing with latent annotations w/ function tags latent annotation F1 F1≤40 F1 F1≤40 Korean 56.18 60.21 79.93 82.04 English (§02–04) 70.50 71.78 85.21 English (§02–21) 72.82 74.05 89.21 • After five cycles • Not directly comparable but obvious improvement 11 N ULL elements • N ULL elements are prevalent in KTB • Zero pronouns are especially common (1.8 per sentence) Dropped wherever pragmatically inferable • Do N ULL elements affect parsing? Train parser with N ULL elements Parse text with N ULL elements Evaluate it against the reference without N ULL elements 12 Parsing with N ULL elements English (§02–21) Korean F1 F1≤40 coarse 71.61 72.74 w/ N ULLs 73.29 74.38 w/ verb ellipses 52.85 56.52 w/ traces 55.88 59.42 w/ relative construction markers 56.74 59.87 w/ zero pronouns 57.56 61.17 latent (5) w/ N ULLs 89.56 91.03 • Adding r.c. markers and zero pronouns has the largest impact • Latent annotation with N ULLs produces the best result 13 Parsing with tree substitution grammars 14 Tree substitution grammars SBARQ SBARQ WHNP SQ WHNP . WP WHNP . SQ . ? → who WP ? who • Nonterminals can be rewritten as tree fragments of any size • Learning TSG can be a challenge 15 Spinal grammar • Chiang (2000) proposed the heuristic for learning TAG Node with different head from its parent becomes new root • Learning TSG with spinal grammar requires head rules Created head rules that maximally project important morphemes S NP-SBJ VP SFN NPR NNC PAU NP-ADV VP 㣙㧗㭐 㞇㡐 㧹 DAN NNC Schwartz doctor TOPIC 㒔 㘏 NNC XSV EPF EFN that afterwards 㲩㑌 㖾㲠 㥽 㖰 discharge PASSIVE PAST DEC . VV 16 Bayesian Learning of TSG • Similar techniques are independently proposed by Cohn et al. (2009) O’Donnell et al. (2009) Post and Gildea (2009) – DP prior prevents learning unnecessarily large rules – Gibbs sampling makes space complexity manageable – Algorithm visits every node and decides to join or split according to probability given by the sampler 17 Parsing with TSG model F1 F1≤40 size CFG (coarse) 52.78 56.55 5.4K spinal (head left) 59.49 63.33 49K NP NPR NNC NNU 㨆㧙 spinal (head right) 66.05 69.96 29K spinal (head rules) 66.28 70.61 29K induced 68.93 73.79 16K NNX 㜼 • TSG shows improvement over CFG • Head rules show modest improvement over simple baseline • Induced TSG is the best and lends itself to further analysis • English parsing experiments show similar trend 18 Word order • Korean shows long distance scrambling (Rambow and Lee, 1994) It is permissible but is it common? • KTB is from newswire which maintains more rigid word order – SOV is the most common word order but OSV is also permitted – Analysis of KTB shows SOV sentences 63.5 times more numerous – However, order is not completely fixed even in the formal writing • Free word order does not apply to all constituents Morphemes always agglutinate in fixed order 19 Conclusion • KTB nonterminals are underspecified Refined annotations bring considerable improvement • Prevalence of N ULL elements is a challenge Possible solutions include: – Automatically insert N ULL elements – Special annotation for parent/sibling with deleted nodes • Word order may not be a huge problem further investigation is needed • Potential implications for machine translation 20 Questions? 21