Quasi-Synchronous Grammars Alignment by Soft Projection of Syntactic Dependencies

advertisement
Quasi-Synchronous
Grammars
Alignment by Soft Projection of
Syntactic Dependencies
David A. Smith and Jason Eisner
Center for Language and Speech Processing
Department of Computer Science
Johns Hopkins University
Synchronous Grammars

Im
Anfang
war
das
Wort
Synchronous grammars
elegantly model




In
the
beginning
was
the
word


P(T1, T2, A)
Conditionalizing for
 Alignment
 Translation
Training?
Observe parallel trees?
Impute trees/links?
Project known trees…
Projection


Im
Anfang
war
das
Wort




In
the
beginning
was
the
word

Train with bitext
Parse one side
Align words
Project dependencies
Many to one links?
Non-projective and
circular dependencies?
Proposals in Hwa et al.,
Quirk et al., etc.
Divergent Projection
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
NULL
I
did
not
unfortunately
receive
an
answer
to
null
head-swapping
monotonic
this
question
siblings
Free Translation
Bad
dependencies
Tschernobyl
könnte
dann
etwas
später
an
die
Reihe
kommen
NULL
Parent-ancestors?
Then
we
could
deal
with
Chernobyl
some
time
later
Dependency Menagerie
Overview






Divergent & Sloppy Projection
Modeling Motivation
Quasi-Synchronous Grammars (QG)
Basic Parameterization
Modeling Experiments
Alignment Experiments
QG by Analogy
Target
HMM: noisy channel generating states
Source
Target
MEMM: direct generative
model of states
Source
CRF: undirected,
globally normalized
Words with Senses
Now senses in a particular
(German) sentence
Ich
habe
die
das
Papier
Veröffentlichung
über…
mit
präsentiert
I really mean
“conference paper”.
Veröffentlichung
I
have
presented
the
paper
about
with
Quasi-Synchronous Grammar


QG: A target-language grammar that
generates translations of a particular sourcelanguage sentence.
A direct, conditional model of translation as


P(T2, A | T1)
This grammar can be CFG, TSG, TAG, etc.
Generating QCFG from T1



U = Target language grammar nonterminals
V = Nodes of given source tree T1
Binarized QCFG: A, B, C ∈ U; α, β, γ ∈ 2V



<A, α> ⇒ <B, β> <C, γ>
<A, α> ⇒ w
Present modeling restrictions




|α| ≤ 1
Dependency grammars (1 node per word)
Tie parameters that depend on α, β, γ
“Model 1” property: reuse of senses. Why?
“senses”
Modeling Assumptions
Tie params for
all tokens of
“im”
Im
Anfang
war
das
Wort
At most 1
sense per
English word
Allow
sense
“reuse”
In
Dependency
Grammar: one
node/word
the
beginning
was
the
word
Dependency Relations
+ “none of the above”
QCFG Generative Story
observed
Auf
diese
Frage
habe
ich
leider
keine
Antwort
bekommen
NULL
P(parent-child)
P(breakage)
P(I | ich)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP | no left children of did)
O(m2n3)
Training the QCFG

Rough surrogates for translation performance



How can we best model target given source?
How can we best match human alignments?
German-English Europarl from SMT05




1k, 10k, 100k sentence pairs
German parsed w/Stanford parser
EM training of monolingual/bilingual parameters
For efficiency, select alignments in training (not
test) from IBM Model 4 union
Cross-Entropy Results
NULL+parentchild
+child-parent
45
40
35
30
25
20
15
10
5
0
+same node
+all breakages
+siblings
CE at 1k CE at
10k
CE at
100k
+grandparent
+c-command
AER Results
45
40
35
30
25
20
15
10
5
0
parent-child
+child-parent
+same node
+all breakages
+siblings
+grandparent
+c-command
AER at AER at AER at
1k
10k
100k
AER Comparison
IBM4 German-English
QG German-English
IBM4 English-German
Conclusions

Strict isomorphism hurts for



Breakages beyond local nodes help most


Modeling translations
Aligning bitext
“None of the above” beats simple head-swapping
and 2-to-1 alignments
Insignificant gains from further breakage
taxonomy
Continuing Research

Senses of more than one word should help




Maintaining O(m2n3)
Further refining monolingual features on
monolingual data
Comparison to other synchronizers
Decoder in progress uses same direct model
of P(T2 ,A | T1)

Globally normalized and discriminatively trained
Thanks







David Yarowsky
Sanjeev Khudanpur
Noah Smith
Markus Dreyer
David Chiang
Our reviewers
The National Science Foundation
Synchronous Grammar as QG


Target nodes correspond to 1 or 0 source
nodes
∀ <X0, α0> ⇒ <X1, α1> … <Xk, αk>




(∀i ≠ j) αi ≠ αj unless αi = NULL
(∀i > 0) αi is a child of α0 in T1 , unless αi = NULL
STSG, STAG operate on derivation trees
Cf. Gildea’s clone operation as a quasisynchronous move
Say What You’ve Said
Projection

Synchronous grammars can explain s-t relation








May need fancy formalisms, harder to learn
Align as many fragments as possible: explain fragmentariness when target
language requirements override
Some regular phenomena: head-swapping, c-command (STAG), traces
Monolingual parser
Word alignment
Project to other language
Empirical model vs. decoding
P(T2,A|T1) via synchronous dep. Grammar






How do you train?
Just look at your synchronous corpus … oops.
Just look at your parallel corpus and infer the synchronous trees … oops.
Just look at your parallel corpus aligned by Giza and project dependencies over to infer
synchronous tree fragments.
But how do you project over many-to-one? How do you resolve nonprojective links in
the projected version? And can’t we use syntax to align better than Giza did, anyway?
Deal with incompleteness in the alignments, unknown words (?)
Talking Points






Get advantages of a synchronous grammar without
being so darn rigid/expensive: conditional
distribution, alignment, decoding all taking syntax
into account
What is the generative process?
How are the probabilities determined from
parameters in a way that combines monolingual and
cross-lingual preferences?
How are these parameters trained?
Did it work?
What are the most closely related ideas and why is
this one better?
Cross-Entropy Results
Configuration
CE at 1k
CE at 10k
CE at 100k
NULL
60.86
53.28
46.94
+parent-child
43.82
22.40
13.44
+child-parent
41.27
21.73
12.62
+same node
41.01
21.50
12.38
+all breakages
35.63
18.72
11.27
+siblings
34.59
18.59
11.21
+grandparent
34.52
18.55
11.17
+c-command
34.46
18.59
11.27
AER Results
Configuration
AER at 1k
AER at 10k
AER at 100k
parent-child
40.69
39.03
33.62
+child-parent
43.17
39.78
33.79
+same node
43.22
40.86
34.38
+all breakages
37.63
30.51
25.99
+siblings
37.87
33.36
29.27
+grandparent
36.78
32.73
28.84
+c-command
37.04
33.51
27.45
Download