Quasi-Synchronous Grammars Alignment by Soft Projection of Syntactic Dependencies David A. Smith and Jason Eisner Center for Language and Speech Processing Department of Computer Science Johns Hopkins University Synchronous Grammars Im Anfang war das Wort Synchronous grammars elegantly model In the beginning was the word P(T1, T2, A) Conditionalizing for Alignment Translation Training? Observe parallel trees? Impute trees/links? Project known trees… Projection Im Anfang war das Wort In the beginning was the word Train with bitext Parse one side Align words Project dependencies Many to one links? Non-projective and circular dependencies? Proposals in Hwa et al., Quirk et al., etc. Divergent Projection Auf diese Frage habe ich leider keine Antwort bekommen NULL I did not unfortunately receive an answer to null head-swapping monotonic this question siblings Free Translation Bad dependencies Tschernobyl könnte dann etwas später an die Reihe kommen NULL Parent-ancestors? Then we could deal with Chernobyl some time later Dependency Menagerie Overview Divergent & Sloppy Projection Modeling Motivation Quasi-Synchronous Grammars (QG) Basic Parameterization Modeling Experiments Alignment Experiments QG by Analogy Target HMM: noisy channel generating states Source Target MEMM: direct generative model of states Source CRF: undirected, globally normalized Words with Senses Now senses in a particular (German) sentence Ich habe die das Papier Veröffentlichung über… mit präsentiert I really mean “conference paper”. Veröffentlichung I have presented the paper about with Quasi-Synchronous Grammar QG: A target-language grammar that generates translations of a particular sourcelanguage sentence. A direct, conditional model of translation as P(T2, A | T1) This grammar can be CFG, TSG, TAG, etc. Generating QCFG from T1 U = Target language grammar nonterminals V = Nodes of given source tree T1 Binarized QCFG: A, B, C ∈ U; α, β, γ ∈ 2V <A, α> ⇒ <B, β> <C, γ> <A, α> ⇒ w Present modeling restrictions |α| ≤ 1 Dependency grammars (1 node per word) Tie parameters that depend on α, β, γ “Model 1” property: reuse of senses. Why? “senses” Modeling Assumptions Tie params for all tokens of “im” Im Anfang war das Wort At most 1 sense per English word Allow sense “reuse” In Dependency Grammar: one node/word the beginning was the word Dependency Relations + “none of the above” QCFG Generative Story observed Auf diese Frage habe ich leider keine Antwort bekommen NULL P(parent-child) P(breakage) P(I | ich) I did not unfortunately receive an answer to this question P(PRP | no left children of did) O(m2n3) Training the QCFG Rough surrogates for translation performance How can we best model target given source? How can we best match human alignments? German-English Europarl from SMT05 1k, 10k, 100k sentence pairs German parsed w/Stanford parser EM training of monolingual/bilingual parameters For efficiency, select alignments in training (not test) from IBM Model 4 union Cross-Entropy Results NULL+parentchild +child-parent 45 40 35 30 25 20 15 10 5 0 +same node +all breakages +siblings CE at 1k CE at 10k CE at 100k +grandparent +c-command AER Results 45 40 35 30 25 20 15 10 5 0 parent-child +child-parent +same node +all breakages +siblings +grandparent +c-command AER at AER at AER at 1k 10k 100k AER Comparison IBM4 German-English QG German-English IBM4 English-German Conclusions Strict isomorphism hurts for Breakages beyond local nodes help most Modeling translations Aligning bitext “None of the above” beats simple head-swapping and 2-to-1 alignments Insignificant gains from further breakage taxonomy Continuing Research Senses of more than one word should help Maintaining O(m2n3) Further refining monolingual features on monolingual data Comparison to other synchronizers Decoder in progress uses same direct model of P(T2 ,A | T1) Globally normalized and discriminatively trained Thanks David Yarowsky Sanjeev Khudanpur Noah Smith Markus Dreyer David Chiang Our reviewers The National Science Foundation Synchronous Grammar as QG Target nodes correspond to 1 or 0 source nodes ∀ <X0, α0> ⇒ <X1, α1> … <Xk, αk> (∀i ≠ j) αi ≠ αj unless αi = NULL (∀i > 0) αi is a child of α0 in T1 , unless αi = NULL STSG, STAG operate on derivation trees Cf. Gildea’s clone operation as a quasisynchronous move Say What You’ve Said Projection Synchronous grammars can explain s-t relation May need fancy formalisms, harder to learn Align as many fragments as possible: explain fragmentariness when target language requirements override Some regular phenomena: head-swapping, c-command (STAG), traces Monolingual parser Word alignment Project to other language Empirical model vs. decoding P(T2,A|T1) via synchronous dep. Grammar How do you train? Just look at your synchronous corpus … oops. Just look at your parallel corpus and infer the synchronous trees … oops. Just look at your parallel corpus aligned by Giza and project dependencies over to infer synchronous tree fragments. But how do you project over many-to-one? How do you resolve nonprojective links in the projected version? And can’t we use syntax to align better than Giza did, anyway? Deal with incompleteness in the alignments, unknown words (?) Talking Points Get advantages of a synchronous grammar without being so darn rigid/expensive: conditional distribution, alignment, decoding all taking syntax into account What is the generative process? How are the probabilities determined from parameters in a way that combines monolingual and cross-lingual preferences? How are these parameters trained? Did it work? What are the most closely related ideas and why is this one better? Cross-Entropy Results Configuration CE at 1k CE at 10k CE at 100k NULL 60.86 53.28 46.94 +parent-child 43.82 22.40 13.44 +child-parent 41.27 21.73 12.62 +same node 41.01 21.50 12.38 +all breakages 35.63 18.72 11.27 +siblings 34.59 18.59 11.21 +grandparent 34.52 18.55 11.17 +c-command 34.46 18.59 11.27 AER Results Configuration AER at 1k AER at 10k AER at 100k parent-child 40.69 39.03 33.62 +child-parent 43.17 39.78 33.79 +same node 43.22 40.86 34.38 +all breakages 37.63 30.51 25.99 +siblings 37.87 33.36 29.27 +grandparent 36.78 32.73 28.84 +c-command 37.04 33.51 27.45