Automatically Induced Syntactic Transfer Rules for Machine

advertisement
Learning Transfer Rules for Machine
Translation with Limited Data
Thesis Defense
Katharina Probst
Committee:
Alon Lavie (Chair)
Jaime Carbonell
Lori Levin
Bonnie Dorr, University of Maryland
Introduction (I)
• Why has Machine Translation been applied only to few
language pairs?
– Bilingual corpora available only for few language pairs
(English-French, Japanese-English, etc.)
– Natural Language Processing tools available only for
few language (English, German, Spanish, Japanese,
etc.)
– Scaling to other languages often difficult, timeconsuming, and knowledge-intensive
• What can we do to change this?
2
Introduction (II)
• This thesis presents a framework for automatic inference
of transfer rules
• Transfer rules capture syntactic and morphological
mappings between languages
• Learned from small, word-aligned training corpus
• Rules are learned for unbalanced language pairs, where
more data and tools are available for one language (L1)
than for the other (L2)
3
Training Data Example
Setting the Stage
Rule Learning
Experimental Results
Conclusions
SL: the widespread interest
NP
in the election
[the interest the widespread in the
election]
DET ADJ N
PP
TL: h &niin h rxb b h bxirwt
Alignment:((1,1),(1,3),(2,4),(3,2),(4,5 the widespread
),(5,6),(6,7))
interest
Type: NP
Parse: (<NP> (DET the-1)
PREP NP
(ADJ widespread-2) (N interest-3)
(<PP> (PREP in-4)
(<NP> (DET the-5)
in DET N
(N election-6))))
the election
4
Transfer Rule Formalism
;;L2: h &niin h rxb b h bxirwt
;;L1: the widespread interest
in the election
NP::NP
[“h” N “h” Adj PP] -> [“the” Adj N PP]
((X1::Y1)(X2::Y3)
(X3::Y1)(X4::Y2)
(X5::Y4)
((Y3 num) = (X2 num))
((X2 num) = sg)
((X2 gen) = m))
Training example
Rule type
Component sequences
Component alignments
Agreement constraints
Value constraints
5
Research Goals (I)
1. Develop a framework for learning transfer rules from
bilingual data
• Training corpus: set of sentences/phrases in one
language with translation into other language
(= bilingual corpus), word-aligned
• Rules include a) a context-free backbone and b)
unification constraints
2. Improve of the grammaticality of MT output by
automatically learned rules
• Learned rules improve translation quality in run-time
system
6
Research Goals (II)
3.
Learn rules in the absence of a parser for one of the
languages
• Infer syntactic knowledge about minor language
using a) projection from major language, b) analysis
of word alignments, c) morphology information, and
d) bilingual dictionary
4. Combine a set of different knowledge sources in a
meaningful way
• Resources (parser, morphology modules, dictionary,
etc.) often disagree
• Combine conflicting knowledge sources
7
Research Goals (III)
Address limited-data scenarios with `frugal‘ techniques
• “Unbalanced” language pairs with little or no
bilingual data
• Training corpus is small (~120 sentences and
phrases), but carefully designed
6. Pushing MT research in the direction of incorporating
syntax into statistical-based systems
• Infer highly involved linguistic information,
incorporate with statistical decoder in hybrid system
5.
8
Thesis Statement (I)
• Given bilingual, word-aligned data, and given a parser
for one of the languages in the translation pair, we can
learn a set of syntactic transfer rules for MT.
• The rules consist of a context-free backbone and
unification constraints, learned in two separate stages.
• The resulting rules form a syntactic translation grammar
for the language pair and are used in a statistical transfer
system to translate unseen examples.
9
Thesis Statement (II)
• The translation quality of a run-time system that uses the
learned rules is
– superior to a system that does not use the learned
rules
– comparable to the performance using a small manual
grammar written by an expert
on Hebrew->English and Hindi->English translation
tasks.
• The thesis presents a new approach to learning transfer
rules for Machine Translation in that the system learns
syntactic models from text in a novel way and in a rich
hypothesis space, aiming at emulating a human
grammar writer.
10
Talk Overview
•
•
•
•
Setting the Stage: related work, system overview,
training data
Rule Learning
– Step 1: Seed Generation
– Step 2: Compositionality
– Step 3: Unification Constraints
Experimental Results
Conclusion
11
Depth of Analysis
Related Work: MT overview
Source Language
Semantics-based
MT
Syntax-based
MT
Statistical MT,
EBMT
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Analyze
meaning
Analyze
structure
Analyze
sequence
Target Language
12
Related Work (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Traditional transfer-based MT: analysis, transfer,
generation (Hutchins and Somers 1992, Senellart et al.
2001)
• Data-driven MT:
– EBMT: store database of examples, possibly
generalized (Sato and Nagao 1990, Brown 1997)
– SMT: usually noisy channel model: translation model
+ target language model (Vogel et al. 2003, Och and
Ney 2002, Brown 2004)
• Hybrid (Knight et al. 1995, Habash and Dorr 2002)
13
Related Work (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Structure/syntax for MT
– EBMT (Alshawi et al. 2000, Watanabe et al. 2002)
– SMT (Yamada and Knight 2001, Wu 1997)
– Other approaches (Habash and Dorr 2002, Menezes
and Richardson 2001)
• Learning from elicited data / small datasets (Nirenburg
1998, McShane et al 2003, Jones and Havrilla 1998)
14
Training Data Example
Setting the Stage
Rule Learning
Experimental Results
Conclusions
SL: the widespread interest
NP
in the election
[the interest the widespread in the
election]
DET ADJ N
PP
TL: h &niin h rxb b h bxirwt
Alignment:((1,1),(1,3),(2,4),(3,2),(4,5 the widespread
),(5,6),(6,7))
interest
Type: NP
Parse: (<NP> (DET the-1)
PREP NP
(ADJ widespread-2) (N interest-3)
(<PP> (PREP in-4)
(<NP> (DET the-5)
in DET N
(N election-6))))
the election
15
Transfer Rule Formalism
;;L2: h &niin h rxb b h bxirwt
;;[the interest the widespread in the
election]
;;L1: the widespread interest
in the election
NP::NP
[“h” N “h” Adj PP] -> [“the” Adj N PP]
((X1::Y1)(X2::Y3)
(X3::Y1)(X4::Y2)
(X5::Y4)
((Y3 num) = (X2 num))
((X2 num) = sg)
((X2 gen) = m))
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Training example
Rule type
Component sequences
Component alignments
Agreement constraints
Value constraints
16
Training Data Collection
•
•
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Elicitation Corpora
– Generally designed to cover major linguistic
phenomena
– Bilingual user translates and word aligns
Structural Elicitation Corpus
– Designed to cover a wide variety of structural
phenomena (Probst and Lavie 2004)
– 120 sentences and phrases
– Targeting specific constituent types: AdvP, AdjP, NP,
PP, SBAR, S with subtypes
17
– Translated into Hebrew, Hindi
Resources
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• L1 parses: Either from statistical parser (Charniak 1999),
or use data from Penn Treebank
• L1 morphology: Can be obtained or created (I created
one for English)
• L1 language model: Trained on a large amount of
monolingual data
• L2 morphology: If available, use morphology module. If
not, use automated techniques, such as (Goldsmith
2001) or (Probst 2003).
• Bilingual lexicon: gives word-level correspondences,
created from training data or previously existing
18
Development and
Testing Environment
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Syntactic transfer engine: takes rules and lexicon and
produces all possible partial translations
• Statistical decoder: uses word-to-word probabilities and
TL language model to extract best combination of partial
translations (Vogel et al. 2003)
19
Setting the Stage
Rule Learning
Experimental Results
Conclusions
System Overview
Bilingual
training data
Training time
Rule Learner
L1 parses &
morphology
L2
morphology
Bilingual
Lexicon
Run time
Learned
Rules
Transfer Engine
Lattice
L1 Language
Model
Statistical Decoder
L2 test data
Final
Translation
20
Overview of Learning
Phases
1.
2.
3.
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Seed Generation: create initial guesses at rules based
on specific training examples
Compositionality: add context-free structure to rules,
rules can combine
Constraint learning: learn appropriate unification
constraints
21
Seed Generation
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• “Training example in rule format”
• Produce rules that closely reflect training examples
• But: generalize to POS level when words are 1-1
aligned
• Rules are fully functional, but little generalization
• Seed rules are intended as input for later two learning
phases
22
Seed Generation –
Sample Learned rule
Setting the Stage
Rule Learning
Experimental Results
Conclusions
;;L2: TKNIT H
@IPWL
H
HTNDBWTIT
;;[
plan the
care
the
voluntary]
;;L1: THE VOLUNTARY CARE PLAN
;;C-Structure:(<NP> (DET the-1)
(<ADJP> (ADJ voluntary-2))
(N care-3)(N plan-4))
NP::NP [N "H" N "H" ADJ] -> ["THE" ADJ N N]
(
(X1::Y4)
(X3::Y3)
(X5::Y2)
)
23
Seed Generation
Algorithm
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• For a given training example, produce a seed rule
• For all 1-1 aligned words, enter the POS tag (e.g. “N”)
into component sequences
– Get POS tags from morphology module and parse
– Hypothesis: on unseen data, any words of this POS
can fill this slot
• For all not 1-1 aligned words, put actual words in
component sequences
• L2 and L1 type are parse’s root label
• Derive alignments from training example
24
Compositionality
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Generalize seed rules to reflect structure
• Infer a partial constituent grammar for L2
• Rules map mixture of
– Lexical items (LIT)
– Parts of speech (PT)
– Constituents (NT)
• Analyze L1 parse to find generalizations
• Produced rules are context-free
25
Compositionality Example
Setting the Stage
Rule Learning
Experimental Results
Conclusions
;;L2: $
BTWK H
M&@PH
HIH $M
;;[
that inside the
envelope
was name]
;;L1: THAT INSIDE THE ENVELOPE WAS A NAME
;;C-Structure:(<SBAR> (SUBORD that-1)
(<SINV> (<PP> (PREP inside-2)
(<NP> (DET the-3)(N envelope-4)))
(<VP> (V was-5))
(<NP> (DET a-6)(N name-7))))
SBAR::SBAR
[SUBORD PP V NP] -> [SUBORD PP V NP]
(
(X1::Y1) (X2::Y2) (X3::Y3) (X4::Y4)
)
26
Basic Compositionality
Algorithm
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Traverse parse tree in order to partition sentence
• For each sub-tree, if there is previously learned rule that
can account for the subtree and its translation, introduce
compositional element
• Compositional element: subtree’s root label for both L1
and L2
• Adjust alignments
• Note: preference for maximum generalization, because
tree traversed from top
27
Maximum
Compositionality
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Assume that lower-level rules exist Assumption is
correct if training data is completely compositional
• Introduce compositional elements for direct children of
parse root node
• Results in higher level of compositionality, thus higher
generalization power
• Can overgeneralize, but because of strong decoder
generally preferable
28
Other Advanced
Compositionality
Techniques
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Techniques that allow you to generalize to POS not 1-1
aligned words
• Techniques that enhance the dictionary based on
training data
• Techniques that deal with noun compounds
• Rule filters to ensure that no learned rules violate axioms
29
Constraint Learning
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Annotate context-free compositional rules with
unification constraints
a) limit applicability of rules to certain contexts
(thereby limiting parsing ambiguity)
b) ensure the passing of a feature value from source
to target language (thereby limiting transfer
ambiguity)
c) disallow certain target language outputs (thereby
limiting generation ambiguity)
• Value constraints and agreement constraints are
learned separately
30
Constraint Learning Overview
1.
2.
3.
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Introduce basic constraints: use morphology
module(s) and parses to introduce constraints
for words in training example
Create agreement constraints (where
appropriate) by merging basic constraints
Retain appropriate value constraints: help in
constricting a rule to some contexts or
restricting output
31
Constraint Learning –
Agreement Constraints (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• For example: In an NP, do the adjective and the noun
agree in number?
• in Hebrew the good boys:
• Correct: H
ILDIM
@WBIM
the.det.def boy.pl.m
good.pl.m
“the good boys”
• Incorrect: H
ILDIM
@WB
the.det.def boy.pl.m
good.sg.m
“the good boys”
32
Constraint Learning –
Agreement Constraints (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• E.g. number in a determiner and the corresponding noun
• Use a likelihood ratio test to determine what value
constraints can be merged into agreement constraints
• The log-likelihood ratio is defined by proposing
distributions that could have given rise to the data:
– Null Hypothesis: The values are independently
distributed.
– Alternative Hypothesis: The values are not
independently distributed.
• For sparse data, use heuristic test: if more evidence for
than against agreement constraint
33
Constraint Learning –
Agreement Constraints (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Collect all instances in the training data where an
adjective and a noun mark for number
• Count how often the feature value is the same, how
often different
• Feature values are distributed by
– Two multinomial distributions (if they’re independent,
e.g. Null hypothesis)
– One multinomial distribution (if they should agree, e.g.
Alternate hypothesis)
• Compute log-likelihood under each scenario and perform
LL ratio or heuristic test
• Generalize to cross-lingual case
34
Constraint Learning –
Value Constraints
;;L2: ild
@wb
;;[ boy
good]
;;L1: a good boy
NP::NP [N ADJ] ->
[``A'' ADJ N]
(...
((X1 NUM) = SG)
((X2 NUM) = SG)
...)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
;;L2: ildim t@wbim
;;[ boys good]
;;L1: good boys
NP::NP [N ADJ] ->
[ADJ N]
(...
((X1 NUM) = PL)
((X2 NUM) = PL)
...)
Retain value constraints to distinguish
35
Constraint Learning –
Value Constraints
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Retain those value constraints that determine the
structure of the L2 translation
• If you have two rules with
– different L2 component sequences
– same L1 component sequence
– they differ in only a value constraint
• Retain the value constraint to distinguish
36
Constraint Learning –
Sample Learned Rule
;;L2: ANI AIN@LIGN@I
;;[
I
intelligent]
;;L1: I AM INTELLIGENT
S::S
[NP ADJP] -> [NP “AM” ADJP]
(
(X1::Y1) (X2::Y3)
((X1 NUM) = (X2 NUM))
((Y1 NUM) = (X1 NUM))
((Y1 PER) = (X1 PER))
(Y0 = Y2)
)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
37
Dimensions of Evaluation
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Learning Phases / Settings: default, Seed Generation
only, Compositionality, Constraint Learning
• Evaluation: rule-based evaluation + pruning
• Test Corpora: TestSet, TestSuite
• Run-time Settings: Lengthlimit
• Portability: Hindi→English translation
38
Test Corpora
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Test corpora:
1. Test Corpus: Newspaper text (Haaretz): 65
sentences, 1 reference translation
2. Test Suite: specific phenomena: 138 sentences, 1
reference translation
3. Hindi: 245 sentences, 4 reference translations
• Compare: statistical system only, system with manually
written grammar, system with learned grammar
• Manually written grammar: written by expert within about
a month (both Hebrew and Hindi)
39
Test Corpus Evaluation,
Default Settings (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Grammar
BLEU
METEOR
No Grammar
0.0565
0.3019
Manual Grammar
0.0817
0.3241
Learned Grammar
(With Constraints)
0.078
0.3293
40
Test Corpus Evaluation,
Default Settings (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Learned grammar performs statistically significantly better
than baseline
• Performed one-tailed paired t-test
• BLEU with resampling:
t-value: 81.98, p-value:0 (df=999)
→ Significant at 100% confidence level
Median of differences: -0.0217 with 95% confidence
interval [-0.0383,-0.0056]
• METEOR:
t-value: 1.73, p-value: 0.044 (df=61)
→ Significant at higher than 95% confidence level
41
Test Corpus Evaluation,
Default Settings (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
42
Test Corpus Evaluation,
Different Settings (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Grammar
BLEU
METEOR
No Grammar
0.0565
0.3019
Manual Grammar
0.0817
0.3241
Learned Grammar
(Seed Generation)
0.0741
0.3239
Learned Grammar
(Compositionality)
0.0777
0.3360
Learned Grammar
(With Constraints)
0.078
0.3293
43
Test Corpus Evaluation,
Different Settings (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
System times in seconds, lattice sizes:
Grammar
Learned Grammar
(Compositionality)
Learned Grammar
(With Constraints)
Transfer Engine
(in seconds
system time)
Lattice
size
(in mb)
Decoder
(in seconds
system time)
54.98
187
3123.38
33.28
140
2287.47
→ ~ 20% reduction in lattice size!
44
Evaluation with
Rule Scoring (I)
•
•
•
•
•
•
•
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Estimate translation power of the rules
Use training data: most training examples are actually
unseen data for a given rule
Match arc against the reference translation
A rule’s score is the average of all its arcs’ scores
Order the rules by precision score, prune
Goal of rule scoring: limit run-time
Note trade-off with decoder power
45
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Evaluation with
Rule Scoring (II)
Grammar
BLEU
ModBLEU METEOR
No Grammar
0.0565
0.1362
0.3019
Manual Grammar
0.0817
0.1546
0.3241
Learned Grammar (25%)
0.0565
0.1362
0.3019
Learned Grammar (50%)
0.0592
0.1389
0.3075
Learned Grammar (75%)
0.0800
0.1533
0.3296
Learned Grammar (full)
0.078
0.1524
0.3293
46
Evaluation with Rule
Scoring (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Grammar
TrEngine LatticeSize
Decoder
Learned Grammar (25%)
1.02
330342
22.55
Learned Grammar (50%)
1.81
13431206
189.89
Learned Grammar (75%)
5.91
29242597
397.06
Learned Grammar (full)
33.28
149713589 2287.47
47
Test Suite Evaluation (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
• Test suite designed to target specific constructions
– Conjunctions of PPs
– Adverb phrases
– Reordering of adjectives and nouns
– AdjP embedded in NP
– Possessives
–…
• Designed in English, translated into Hebrew
• 138 sentences, one reference translation
48
Test Suite Evaluation (II)
Grammar
BLEU
Setting the Stage
Rule Learning
Experimental Results
Conclusions
METEOR
Baseline
0.0746
0.4146
Manual grammar
0.1179
0.4471
Learned Grammar
0.1199
0.4655
49
Test Suite Evaluation (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Learned grammar performs statistically significantly better than
baseline
• Performed one-tailed paired t-test
• BLEU with resampling:
t-value: 122.53, p-value:0 (df=999)
→ Statistically significantly better at 100% confidence level
Median of differences: -0.0462 with 95% confidence
interval [-0.0245,-0.0721]
• METEOR:
t-value: 47.20, p-value: 0.0 (df=137)
→ Statistically significantly better at 100% confidence level
50
Test Suite Evaluation (IV)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
51
Hindi-English
Portability Test (I)
Grammar
BLEU
Setting the Stage
Rule Learning
Experimental Results
Conclusions
METEOR
Baseline
0.1003
0.3659
Manual grammar
0.1052
0.3676
Learned Grammar
0.1033
0.3685
52
Hindi-English
Portability Test (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Learned grammar performs statistically significantly better than
baseline
• Performed one-tailed paired t-test
• BLEU with resampling:
t-value: 37.20, p-value:0 (df=999)
→ Statistically significantly better at 100% confidence level
Median of differences: -0.0024 with 80% confidence
interval [-0.0052,0.0001]
• METEOR:
t-value: 1.72, p-value: 0.043 (df=244)
→ Statistically significantly better at higher than 95%
53
confidence level
Hindi-English
Portability Test (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
54
Discussion of Results
•
•
•
•
•
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Performance superior to standard SMT system
Learned grammar comparable to manual grammar
Learned grammar: higher METEOR score, indicating
that it is more general
Constraints: slightly lower performance in exchange for
higher run-time efficiency
Pruning: slightly lower performance in exchange for
higher run-time efficiency
55
Conclusions and
Contributions
1.
2.
3.
4.
5.
6.
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Framework for learning transfer rules from bilingual
data
Improvement of translation output in hybrid transfer
and statistical system
Addressing limited-data scenarios with ‘frugal’
techniques
Combining different knowledge sources in a
meaningful way
Pushing MT research in the direction of incorporating
syntax into statistical-based systems
Human-readable rules that can be improved by an
expert
56
Summary
Setting the Stage
Rule Learning
Experimental Results
Conclusions
“Take a bilingual word-aligned corpus, and learn transfer
rules with constituent transfer and unification
constraints.”
“Is it a big corpus?”
“Ahem. No.”
“Do I have a parser for both languages?”
“No, just for one.”
“… So I can use a dictionary, morphology modules, a
parser … But these are all imperfect resources. How
do I combine them?”
“We can do it!”
“Ok.”
57
Thank you!
58
Additional Slides
59
References (I)
Ayan, Fazil, Bonnie J. Dorr, and Nizar Habash. Application
of Alignment to Real-World Data: Combining Linguistic
and Statistical Techniques for Adaptable MT.
Proceedings of AMTA-2004.
Baldwin, Timothy and Aline Villavicencio. 2002. Extracting
the Unextractable: A case study on verb-particles.
Proceedings of CoNLL-2002.
Brown, Ralf D., A Modified Burrows-Wheeler Transform for
Highly-Scalable Example-Based Translation,
Proceedings of AMTA-2004.
Charniak, Eugene, Kevin Knight and Kenji Yamada. 2003.
Syntax-based Language Models for Statistical Machine
Translation. Proceedings of MT-Summit IX.
60
References (II)
Hutchins, John W. and Harold L. Somers. 1992. An
Introduction to Machine Translation. Academic Press,
London.
Jones, Douglas and R. Havrilla. Twisted Pair Grammar:
Support for Rapid Development of Machine Translation
for Low Density Languages. Proceedings of AMTA-98.
Menezes, Arul and Stephen D. Richardson. A best-first
alignment algorithm for automatic extraction of transfer
mappings from bilingual corpora. Proceedings of the
Workshop on Data-driven Machine Translation at ACL2001.
Nirenburg, Sergei. Project Boas: A Linguist in the Box as a
Multi-Purpose Language Resource. Proceedings of
LREC-98.
61
References (III)
Orasan, Constantin and Richard Evans. 2001. Learning to
identify animate references. Proceedings of CoNLL2001.
Probst, Katharina. 2003. Using ‘smart’ bilingual projection
to feature-tag a monolingual dictionary. Proceedings of
CoNLL-2003.
Probst, Katharina and Alon Lavie. A Structurally Diverse
Minimal Corpus for Eliciting Structural Mappings
between Languages. Proceedings of AMTA-04.
Probst, Katharina and Lori Levin. 2002. Challenges in
Automated Elicitation of a Controlled Bilingual Corpus.
Proceedings of TMI-02.
62
References (IV)
Senellart, Jean, Mirko Plitt, Christophe Bailly, and
Francoise Cardoso. 2001. Resource Alignment and
Implicit Transfer. Proceedings of MT-Summit VIII.
Vogel, Stephan and Alicia Tribble. 2002. Improving
Statistical Machine Translation for a Speech-to-Speech
Translation Task. Proceedings of ICSLP-2002.
Watanabe, Hideo, Sadao Kurohashi, and Eiji Aramaki.
2000. Finding Structural Correspondences from Bilingual
Parsed Corpus for Corpus-based Translation.
Proceedings of COLING-2000.
63
Log-likelihood test for
agreement constraints (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• create list of all possible index pairs that should be
considered for an agreement constraint:
• L1 only constraints:
– list of all head-head pairs that ever occur with the
same feature (not necessarily same value), and all
head-nonheads in the same constituent that occur
with the same feature (not necessarily same value).
– For example, possible agreement constraint: Num
agreement between Det and N in a NP where the Det
is a dependent of N
• L2 only constraints: same as L1 only constraints above.
• L2→L1 constraints: all situations where two aligned
indices mark the same feature
64
Log-likelihood test for
agreement constraints (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• Hypothesis 0: The values are independently distributed.
• Hypothesis 1: The values are not independently
distributed.
• Under the null hypothesis:
• Under the alternative hypothesis:
where ind is 1 if vxi1 = vxi2 and 0 otherwise.
65
Log-likelihood test for
agreement constraints (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• i1 and i2 drawn from a multinomial distribution.
where cvi is the number of times the value vi was
encountered for the given feature (e.g. PERS), and k is
the number of possible values for the feature (e.g. 1st,
2nd, 3rd).
• If strong evidence for Hypothesis 0, introduce agreement
constraint
• For cases where there is not enough evidence either
66
way (n<10), use heuristic test
Lexicon Enhancement for Hebrew
Adverbs (I)
•
•
•
•
•
Example 1: “B” “$MX”  “happily”
Example 2: “IWTR” “GBWH”  “taller”
These are not necessarily in the dictionary
Both processes are productive
How can we add these and similar entries to lexicon?
Automatically?
67
Lexicon Enhancement for
Hebrew Adverbs (II)
For all 1-2 (L1-L2) alignments in training data{
1. extract all cases of at least 2 instances
where one word is constant (constant word:
wL2c, non-constant word wL2v, non-constant word
wL1v)
2. For each word wL2v{
2.1. Get all L1 translations
2.2. Find the closest match wL1match to wL1v
2.3. Learn replacement rule wL1match->wL1v }
3. For each word wL2POS of same POS as wL2c{
3.1. For each possible translations wL1POS {
3.1.1. Apply all replacement rules
possible wL1POS->wL1POSmod
3.1.2. For each applied replacement rule,
insert into lexicon entry:
[“wc” wL2POS] -> [wL1POSmod] } }
68
Lexicon Enhancement for Hebrew
Adverbs (III)
• Example: B $MX -> happily
• Possible translations of $MX:
– joy
– happiness
• Use edit distance to find that happiness is
wL1match for happily
• Learn replacement rule ness->ly
69
Lexicon Enhancement for Hebrew
Adverbs (IV)
• For all L2 Nouns in the dictionary, get all
possible L1 translations, and apply the
replacement rule
• If replacement rule can be applied, add lexicon
entry
• Examples of new adverbs added to lexicon:
ADV::ADV |: ["B" "$APTNWT"] -> ["AMBITIOUSLY"]
ADV::ADV |: ["B" "$BIRWT"] -> ["BRITTLELY"]
ADV::ADV |: ["B" "$G&WN"] -> ["MADLY"]
ADV::ADV |: ["B" "$I@TIWT"] -> ["METHODICALLY"]70
Lexicon Enhancement for Hebrew
Comparatives
• Same process as for adverbs
• Examples of new comparatives added to
lexicon:
ADJ::ADJ |: ["IWTR" "MLA"] -> ["FULLER"]
ADJ::ADJ |: ["IWTR" "MPGR"] -> ["SLOWER"]
ADJ::ADJ |: ["IWTR" "MQCH"] -> ["HEATER"]
• All words are checked in the BNC
• Comment: automatic process, thus far from
perfect
71
Some notation
•
•
•
•
•
•
•
•
•
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
SL: Source Language, language to be translated from
TL: Target Language, language to be translated into
L1: language for which abundant information is available
L2: language for which less information is available
(Here:) SL = L2 = Hebrew, Hindi
(Here:) TL = L1 = English
POS: part of speech, e.g. noun, adjective, verb
Parse: structural (tree) analysis of sentence
Lattice: list of partial translations, arranged by length and
start index
72
Training Data Example
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
SL: the widespread interest in the election
TL: h &niin h rxb b h bxirwt
Alignment:((1,1),(1,3),(2,4),(3,2),(4,5),(5,6),(6,7))
Type: NP
Parse:
(<NP>
(DET the-1)(ADJ widespread-2)(N interest-3)
(<PP> (PREP in-4)
(<NP> (DET the-5)(N election-6))))
73
Seed Generation
Algorithm
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
for all training examples {
for all 1-1 aligned words {
get the L1 POS tag from the parse
get the L2 POS tag from the morphology
module and the dictionary
if the L1 POS and the L2 POS tags are not
the same, leave both words lexicalized }
for all other words {
leave the words lexicalized }
create rule word alignments from training example
set L2 type and L1 type to be the parse root’s label }
74
Taxonomy of Structural
Mappings (I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• Non-terminals (NT):
– used in two rule parts:
• type definition of a rule (both for SLand TL,
meaning X0 and Y0),
• constituent sequences for both languages.
– any label that can be the type of a rule
– describe higher-level structures such as sentences
(S), noun phrases (NP), or prepositional phrases(PP).
– can be filled with more than one word: filled by other
rules.
75
Taxonomy of Structural
Mappings (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• Pre-terminals (PT):
– used only in the constituent sequences of the rules,
not as X0 or Y0 types.
– filled with only one word, except phrasal lexicon
entries: filled by lexical entries, not by other grammar
rules.
• Terminals (LIT):
– lexicalized entries in the constituent sequences
– can be used on both the x- and the y-side
– can only be filled by the specified terminal itself.
76
Taxonomy of Structural
Mappings (III)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• NTs must not be aligned 1-0 or 0-1
• PTs must not be aligned 1-0 or 0-1.
• Any word in the bilingual training pair must participate in
exactly one LIT, PT, or NT.
• An L1 NT is assumed to translate into the same NT inL2.
77
Taxonomy of Structural
Mappings (IV)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• Transformation I (SL type into SL component sequence).
NT
→ (NT | PT | LIT)+
• Transformation II (SL type into TL type).
NTi → NTi (same type of NT)
• Transformation III (TL type into TL component sequence).
NT
→ (NT | PT | LIT)+
• Transformation IV (SL components into TL components).
NTi → NTi+ (same type of NT)
PT
→ PT+
LIT
→ ε
ε
→ LIT
78
Basic Compositionality
Pseudocode
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
traverse parse top-down
for each node i in parse {
extract the subtree rooted at i
extract the L1 chunk cL1 rooted at i and the
corresponding L2 chunk cL2 (using alignments)
if transfer engine can translate cL1 into cL2 using
previously learned rules {
introduce compositional element:
replace POS sequence for cL1 and
cL2 with label of node i
adjust alignments }
do not traverse already covered subtree } }
79
Co-Embedding
Resolution, Iterative Type
Learning
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
• Problem: looking for previously learned rules
– Must determine optimal learning ordering
• Co-Embedding Resolution:
– Tag each training example with depth of tree, i.e.
how many embedded elements
– Then learn lowest to highest
• Iterative Type Learning:
– Some types (e.g. PPs) are frequently embedded in
others (e.g. NP)
– Pre-determine the order in which types are learned 80
Compositionality – Sample
Learned Rules (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
;;L2: RQ AM H RKBT TGI&
;;L1: ONLY IF THE TRAIN ARRIVES
;;C-Structure:(<SBAR>
(<ADVP> (ADV only-1))
(SUBORD if-2)
(<S> (<NP> (DET the-3)(N train-4))
(<VP> (V arrives-5))))
SBAR::SBAR
[ADVP SUBORD S] -> [ADVP SUBORD S]
(
(X1::Y1) (X2::Y2) (X3::Y3)
)
81
Taxonomy of Constraints
(I)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
Parameter
Possible Values
value or agreement
value, agreement
level
POS, constituent,
POS/constituent
L2, L1, L2→L1
language
constrains head
head, non-head, head+nonhead
82
Co-Embedding
Resolution, Iterative Type
Learning
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
find highest co-embedding score in training data
find the number of types to learn, ntypes
for (i = 0; i < maxco-embedding; i++) {
for (j = 0; j < ntypes; j++) {
for all training examples with co-embedding
score i and of type j {
perform Seed Generation
perform Compositionality Learning } } }
83
Taxonomy of Constraints
(II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
Subtype
Value/ Agreement
Language
Level
Comment
1
value
x
POS
Group1-2
2
value
x
const
Group1-2
3
value
x
POS/const
can't exist
4
value
y
POS
Group4
5
value
y
const
Group5
6
value
y
POS/const
can't exist
7
value
xy
POS
can't exist
8
value
xy
const
can't exist
9
value
xy
POS/const
can't exist
10
agreement
x
POS
Group10-12
11
agreement
x
const
Group10-12
12
agreement
x
POS/const
Group10-12
13
agreement
y
POS
14
agreement
y
const
15
agreement
y
POS/const
Group13-15
16
agreement
xy
POS
Group16-18
17
agreement
xy
const
Group16-18
18
agreement
xy
POS/const
Group16-18
Group13-15
Group13-15
84
Taxonomy of Constraints
(III)
Subtype Value/ Agr Language
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
Level
1,2
value
x
POS or const
4,5
value
y
POS or const
10,11,12
agreement
x
POS or const or
POS/const
13,14,15
agreement
y
POS or const or
POS/const
16,17,18
agreement
xy
POS or const or
POS/const
85
Constraint Learning –
Sample Learned Rules (II)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
;;L2: H ILD AKL KI HWA HIH R&B
;;L1: THE BOY ATE BECAUSE HE WAS HUNGRY
S::S [NP V SBAR] -> [NP V SBAR]
(
(X1::Y1) (X2::Y2) (X3::Y3)
(X0 = X2)
((X1 GEN) = (X2 GEN))
((X1 NUM) = (X2 NUM))
((Y1 NUM) = (X1 NUM))
((Y2 TENSE) = (X2 TENSE))
((Y3 NUM) = (X3 NUM))
((Y3 TENSE) = (X3 TENSE))
(Y0 = Y2))
86
Evaluation with Different
Length Limits (I)
Grammar
1
2
No Grammar
0.171
0.2962 0.3016 0.3012 0.3019 0.3019
Manual
Grammar
Learned
Grammar
0.1744 0.297
0.171
3
4
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
5
6
0.3141 0.3182 0.3232 0.3241
0.2995 0.3072 0.3252 0.3282 0.3293
87
Evaluation with Different
Length Limits (II)
(METEOR score)
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
88
Discussion of Results:
Comparison of Translations
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
(back to Hebrew-English)
No grammar: the doctor helps to patients his
Learned grammar: the doctor helps to his patients
Reference translation: The doctor helps his patients
No grammar: the soldier writes many letters to the family of he
Learned grammar: the soldier writes many letters to his family
Reference translation: The soldier writes many letters to his
family
89
Time Complexity of
Algorithms
•
•
•
•
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Seed Generation: O(n)
Compositionality:
–Basic: O(n*max(tree_depth))
–Maximum Compositionality: O(n*max(num_children))
Constraint Learning: O(n*max(num_basic_constraints))
Practically: no issue
90
If I had 6 more months…
•
•
•
Setting the Stage
Rule Learning
Experimental Results
Conclusions
Future Work
Application to larger datasets
– Training data enhancement to obtain training
examples at different levels (NPs, PPs, etc.)
– More emphasis on rule scoring (more noise)
– More emphasis on context learning: constraints
Constraint learning as version space learning problem
Integrate rules into statistical system more directly,
without producing full lattice
91
Download