slides ppt

advertisement
Translating
from Morphologically Complex Languages:
A Paraphrase-Based Approach
Preslav Nakov & Hwee Tou Ng
Overview
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Overview
 Statistical Machine Translation systems
Typically assume that word is the basic token-unit of translation
 Problem
Data sparseness issues for languages with rich morphology.
 Our Solution
 Paraphrase-based approach to translating morphological variants.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
3
Introduction
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Morphology
in Statistical Machine Translation (SMT)
 Traditionally, word was the basic token-unit of translation
 The earliest SMT models (aka, IBM models) were proposed for
French and English, which have little morphology.
 Most subsequent models remain word-atomic
 phrase-based
 hierarchical
 treelet
 syntactic
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
5
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Morphology
in Statistical Machine Translation (SMT)
Word as an atomic token-unit of translation
 Fine for languages with little morphology:
 English, French, Spanish
 Chinese (almost no morphology)
 Inadequate for morphologically rich languages:
 Arabic, Turkish, Finnish
 word inflections
 word-attached clitics
 German
 compounds
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
6
ACL’2011 : Preslav Nakov & Hwee Tou Ng
The Case of Malay
 Malay language
 rich derivational morphology
 but poor in
 word inflections (unlike Arabic, Turkish, Finnish)
 word-attached clitics (unlike Arabic, Turkish, Finnish)
 concatenated compounds (unlike German, Finnish)
 Problem: classic methods do not work for Malay
 Solution: paraphrasing techniques
 word-level
 phrase-level
 sentence-level
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
7
Related Work
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Related Work
Two general lines of research
1. Inflected forms of the same word are used as equivalence classes or
as possible alternatives in translation
 stemming (Yang and Kirchhoff, 2006)
 lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer,
2007)
 direct clustering (Talbot and Osborne, 2006)
 factored models (Koehn and Hoang, 2007).
2. Word segmentation
 compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006)
 clitics attached to the preceding word (Habash and Sadat, 2006)
 morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009).
Do not work well for Malay

It has very little inflectional morphology, if any

compounds are not concatenated

clitics are rare
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
9
Malay Morphology
ACL’2011 : Preslav Nakov & Hwee Tou Ng
The Malay Language
 Malay
 Astronesian language
 ~180M speakers
 official in Malaysia, Indonesia, Singapore, and Brunei
 two major standard versions (mutually intelligible)
 Bahasa Malaysia (lit. ‘language of Malaysia’)
 Bahasa Indonesia (lit. ‘language of Indonesia’).
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
11
ACL’2011 : Preslav Nakov & Hwee Tou Ng
The Malay Language
 Malay – an agglutinative language
 very rich derivational morphology
 but nearly non-existent derivational morphology
 Inflectionally, Malay is like Chinese:
 no grammatical gender, number or tense,
 verbs are not marked for person, etc.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
12
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Malay Morphology
 New word formation processes
 affixation
 compounding
 reduplication
 Other morphological processes
 clitic attachment
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
13
ACL’2011 : Preslav Nakov & Hwee Tou Ng
New Word Formation Processes in Malay
 Affixation – attaching affixes, which are not words, to a word
 prefixes (e.g., ajar/‘teach’  pelajar/‘student’)
 suffixes (e.g., ajar  ajaran/‘teachings’)
 circumfixes (e.g., ajar  pengajaran/‘lesson’)
 infixes (e.g., gigi/‘teeth’  gerigi/‘toothed blade’)
 Compounding – putting two or more existing words together
 e.g., kereta/‘car’ + api/‘fire’  keretapi or kereta api
 typically not concatenated
 Reduplication – word repetition
 e.g., pelajar-pelajar/‘students’
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
14
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Clitics in Malay
 Examples
 duduk/‘sit down’ + lah  duduklah/‘please, sit down’,
 kereta + nya
 keretanya/‘his car’.
 Notes:
 Clitics are not affixes.
 Clitic attachment is NOT
 word inflection process
 word derivation process
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
15
Translating
Malay Morphology
A Paraphrase-based Approach
to Translating from Malay
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Paraphrase-based Approach
to Morphology
 Given a complex Malay word, we generate
 morphologically simpler words from which it can be derived
 alternative word segmentations
 We treat these forms as potential paraphrases of the
original word.
 We use paraphrasing techniques at three levels:
 word-level
 phrase-level
 sentence-level
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
17
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Generating
Simpler Morphological Variants
 Given a complex Malay word, we generate
1. words obtainable by affix stripping
 e.g., pelajaran  pelajar, ajaran, ajar
2. words that are part of a compound word
 e.g., kerjasama  kerja, sama
3. words appearing on either side of a dash
 e.g., adik-beradik  adik, beradik
4. words without clitics
 e.g., keretanya  kereta
5. clitic-segmented word sequences
 e.g., keretanya  kereta nya
6. dash-segmented wordforms
 e.g., aceh-nias  aceh – nias
7. combinations of the above.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
adik-beradiknya 
adik-beradiknya
adik-beradik nya
adik-beradik
beradiknya
beradik nya
beradik
adik nya
adik
berpelajaran 
berpelajaran
pelajaran
pelajar
ajaran
ajar
18
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Word-Level Paraphrases
 Given a dev/test sentence:
1. We generate a list of variants {w’} for each Malay word w.
2. We add them to the sentence, thus forming a lattice.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
19
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Word-Level Paraphrases (cont.)
 The lattice requires a weight for each arc.
 We set 1.0 for the original word w.
 For each paraphrase w’ of w, we use the probability Pr(w’|w),
estimated using word-level pivoting over English:
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
20
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Word-Level Paraphrases (cont.)
 Estimating the probability Pr(w’|w):
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
21
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Sentence-Level Paraphrases
 dev/test word-level paraphrases need matching phrases
 Paraphrase the training data at the sentence-level:
 For each paraphrasable word w & for each of its paraphrases w’:
we create a version of the sentence with w substituted by w’.
 Pair each paraphrased sentence with the original target
Paraphrased bi-text
dia
dia
dia
dia
mahu membeli keretanya
mahu
beli
keretanya
mahu membeli
kereta
mahu membeli kereta nya
. || she wants to buy his car .
. || she wants to buy his car .
. || she wants to buy his car .
. || she wants to buy his car .
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
22
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Sentence-Level Paraphrases (cont.)
 We build two phrase tables

Torig from the original training bi-text

Tpar from the paraphrased bi-text
 We merge these tables
1. Keep all entries from Torig.
2. Add those phrase pairs from Tpar that are not in Torig.
3. Add extra features:
 F1: 1 if the entry came from Torig, 0.5 otherwise.
 F2: 1 if the entry came from Tpar, 0.5 otherwise.
 F3: 1 if the entry was in both tables, 0.5 otherwise.
The feature weights are set using MERT, and the number of
features is optimized on the development set.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
23
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Phrase-Level Paraphrases
 We further augment the phrase table with an extra feature,
which is calculated using phrase-level pivoting:
 1, for phrase pairs coming from Torig
 maxp Pr(p’|p), for phrase pairs coming from Tpar
 where p’ is a paraphrase of some original Malay phrase p
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
24
Experiments
and Evaluation
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Data
 Training
bi-text:
 English:
 Malay:
350K sentence pairs
10.4M words
9.7M words
 Development
bi-text:
2,000 sentence pairs
 English:
63.4K words
 Malay:
58.5K words
 Testing
bi-text:
1,420 sentences
 Malay:
28.8K words.
 English:
32.8K, 32.4K, and 32.9K words (3 reference translations)
 LM
49.8M English words
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
26
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Evaluation Results: BLEU
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
27
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Detailed BLEU
Improvement
for all n-grams
used in BLEU
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
28
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Evaluation With 5 Measures
Consistent
improvement
for 5 measures
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
29
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Example Translations
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
30
Conclusion
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Conclusion
 Presented a novel approach to translating from a
morphologically complex language
uses paraphrases at three levels of translation
 word-level
 phrase-level
 sentence-level
 Demonstrated the potential of the approach to Malay
 derivationally rich
 but almost no inflectional morphology
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
32
ACL’2011 : Preslav Nakov & Hwee Tou Ng
Future Work
 Improve the paraphrasing models
 use a richer sense similarity model that combines monolingual and
bilingual similarity (Chen et al., 2010)
 Try phrase table paraphrasing
instead of sentence-level paraphrasing (Nakov, 2008)
 Try other
 morphologically complex languages
 SMT models
The presented work is supported
by research grant POD0713875.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
33
Download