Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng Overview ACL’2011 : Preslav Nakov & Hwee Tou Ng Overview Statistical Machine Translation systems Typically assume that word is the basic token-unit of translation Problem Data sparseness issues for languages with rich morphology. Our Solution Paraphrase-based approach to translating morphological variants. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 3 Introduction ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT) Traditionally, word was the basic token-unit of translation The earliest SMT models (aka, IBM models) were proposed for French and English, which have little morphology. Most subsequent models remain word-atomic phrase-based hierarchical treelet syntactic Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 5 ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT) Word as an atomic token-unit of translation Fine for languages with little morphology: English, French, Spanish Chinese (almost no morphology) Inadequate for morphologically rich languages: Arabic, Turkish, Finnish word inflections word-attached clitics German compounds Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 6 ACL’2011 : Preslav Nakov & Hwee Tou Ng The Case of Malay Malay language rich derivational morphology but poor in word inflections (unlike Arabic, Turkish, Finnish) word-attached clitics (unlike Arabic, Turkish, Finnish) concatenated compounds (unlike German, Finnish) Problem: classic methods do not work for Malay Solution: paraphrasing techniques word-level phrase-level sentence-level Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 7 Related Work ACL’2011 : Preslav Nakov & Hwee Tou Ng Related Work Two general lines of research 1. Inflected forms of the same word are used as equivalence classes or as possible alternatives in translation stemming (Yang and Kirchhoff, 2006) lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer, 2007) direct clustering (Talbot and Osborne, 2006) factored models (Koehn and Hoang, 2007). 2. Word segmentation compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006) clitics attached to the preceding word (Habash and Sadat, 2006) morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009). Do not work well for Malay It has very little inflectional morphology, if any compounds are not concatenated clitics are rare Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 9 Malay Morphology ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language Malay Astronesian language ~180M speakers official in Malaysia, Indonesia, Singapore, and Brunei two major standard versions (mutually intelligible) Bahasa Malaysia (lit. ‘language of Malaysia’) Bahasa Indonesia (lit. ‘language of Indonesia’). Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 11 ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language Malay – an agglutinative language very rich derivational morphology but nearly non-existent derivational morphology Inflectionally, Malay is like Chinese: no grammatical gender, number or tense, verbs are not marked for person, etc. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 12 ACL’2011 : Preslav Nakov & Hwee Tou Ng Malay Morphology New word formation processes affixation compounding reduplication Other morphological processes clitic attachment Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 13 ACL’2011 : Preslav Nakov & Hwee Tou Ng New Word Formation Processes in Malay Affixation – attaching affixes, which are not words, to a word prefixes (e.g., ajar/‘teach’ pelajar/‘student’) suffixes (e.g., ajar ajaran/‘teachings’) circumfixes (e.g., ajar pengajaran/‘lesson’) infixes (e.g., gigi/‘teeth’ gerigi/‘toothed blade’) Compounding – putting two or more existing words together e.g., kereta/‘car’ + api/‘fire’ keretapi or kereta api typically not concatenated Reduplication – word repetition e.g., pelajar-pelajar/‘students’ Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 14 ACL’2011 : Preslav Nakov & Hwee Tou Ng Clitics in Malay Examples duduk/‘sit down’ + lah duduklah/‘please, sit down’, kereta + nya keretanya/‘his car’. Notes: Clitics are not affixes. Clitic attachment is NOT word inflection process word derivation process Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 15 Translating Malay Morphology A Paraphrase-based Approach to Translating from Malay ACL’2011 : Preslav Nakov & Hwee Tou Ng Paraphrase-based Approach to Morphology Given a complex Malay word, we generate morphologically simpler words from which it can be derived alternative word segmentations We treat these forms as potential paraphrases of the original word. We use paraphrasing techniques at three levels: word-level phrase-level sentence-level Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 17 ACL’2011 : Preslav Nakov & Hwee Tou Ng Generating Simpler Morphological Variants Given a complex Malay word, we generate 1. words obtainable by affix stripping e.g., pelajaran pelajar, ajaran, ajar 2. words that are part of a compound word e.g., kerjasama kerja, sama 3. words appearing on either side of a dash e.g., adik-beradik adik, beradik 4. words without clitics e.g., keretanya kereta 5. clitic-segmented word sequences e.g., keretanya kereta nya 6. dash-segmented wordforms e.g., aceh-nias aceh – nias 7. combinations of the above. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach adik-beradiknya adik-beradiknya adik-beradik nya adik-beradik beradiknya beradik nya beradik adik nya adik berpelajaran berpelajaran pelajaran pelajar ajaran ajar 18 ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases Given a dev/test sentence: 1. We generate a list of variants {w’} for each Malay word w. 2. We add them to the sentence, thus forming a lattice. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 19 ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.) The lattice requires a weight for each arc. We set 1.0 for the original word w. For each paraphrase w’ of w, we use the probability Pr(w’|w), estimated using word-level pivoting over English: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 20 ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.) Estimating the probability Pr(w’|w): Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 21 ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases dev/test word-level paraphrases need matching phrases Paraphrase the training data at the sentence-level: For each paraphrasable word w & for each of its paraphrases w’: we create a version of the sentence with w substituted by w’. Pair each paraphrased sentence with the original target Paraphrased bi-text dia dia dia dia mahu membeli keretanya mahu beli keretanya mahu membeli kereta mahu membeli kereta nya . || she wants to buy his car . . || she wants to buy his car . . || she wants to buy his car . . || she wants to buy his car . Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 22 ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases (cont.) We build two phrase tables Torig from the original training bi-text Tpar from the paraphrased bi-text We merge these tables 1. Keep all entries from Torig. 2. Add those phrase pairs from Tpar that are not in Torig. 3. Add extra features: F1: 1 if the entry came from Torig, 0.5 otherwise. F2: 1 if the entry came from Tpar, 0.5 otherwise. F3: 1 if the entry was in both tables, 0.5 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 23 ACL’2011 : Preslav Nakov & Hwee Tou Ng Phrase-Level Paraphrases We further augment the phrase table with an extra feature, which is calculated using phrase-level pivoting: 1, for phrase pairs coming from Torig maxp Pr(p’|p), for phrase pairs coming from Tpar where p’ is a paraphrase of some original Malay phrase p Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 24 Experiments and Evaluation ACL’2011 : Preslav Nakov & Hwee Tou Ng Data Training bi-text: English: Malay: 350K sentence pairs 10.4M words 9.7M words Development bi-text: 2,000 sentence pairs English: 63.4K words Malay: 58.5K words Testing bi-text: 1,420 sentences Malay: 28.8K words. English: 32.8K, 32.4K, and 32.9K words (3 reference translations) LM 49.8M English words Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 26 ACL’2011 : Preslav Nakov & Hwee Tou Ng Evaluation Results: BLEU Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 27 ACL’2011 : Preslav Nakov & Hwee Tou Ng Detailed BLEU Improvement for all n-grams used in BLEU Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 28 ACL’2011 : Preslav Nakov & Hwee Tou Ng Evaluation With 5 Measures Consistent improvement for 5 measures Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 29 ACL’2011 : Preslav Nakov & Hwee Tou Ng Example Translations Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 30 Conclusion ACL’2011 : Preslav Nakov & Hwee Tou Ng Conclusion Presented a novel approach to translating from a morphologically complex language uses paraphrases at three levels of translation word-level phrase-level sentence-level Demonstrated the potential of the approach to Malay derivationally rich but almost no inflectional morphology Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 32 ACL’2011 : Preslav Nakov & Hwee Tou Ng Future Work Improve the paraphrasing models use a richer sense similarity model that combines monolingual and bilingual similarity (Chen et al., 2010) Try phrase table paraphrasing instead of sentence-level paraphrasing (Nakov, 2008) Try other morphologically complex languages SMT models The presented work is supported by research grant POD0713875. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach 33