Ibrahim Badr, Rabih Zbib, James Glass Introduction Experiment on English-to-Arabic SMT. Two domains: text news ,spoken travel conv. Explore the effect of Arabic segmentation, on the translation quality . Propose various schemes recombining (Not Trivial!) the segmented Arabic. Apply (basic) factored translation models Arabic Morphology Arabic is a morphologically rich language. Nouns and Adjectives inflect for gender (m,f) , number (pl,sg,du) and case (Nom,Acc,Gen) all comb’ are possible: ( العبa player, M), ( العبةa player, F), ( العبانtwo players, M), ( العبتانtwo players, F), ( العبونplayers, M,P,Nom), ( العبينplayers, M,P,Acc or gen) In addition to gender and number, verbs inflect for tense, voice, and person: ( لعبواplay, past, plM3P), ( يلعبونplay , present, plM3P), ( ستلعبونplayed, plM3P) • Addittional Pefixes: conjunction و, determiner أل, preposition (بwith,in) ( لto) (for) لل.. وباالبخبا • Additional Sufixes: - possessive pronouns (attach to nouns): (همtheir), ( كُمyour, pl,M), ( كُنyour, pl,F),… - object and subject pronouns attach to verbs: ( نيme), ( ُهنthem), ( وthey) وبسياراتهم Many surface forms sharing the same lemma! Arabic segmentation Use MADA for morphological decomposition of Arabic text. (typical) normalizaion: ى ي, أآإ ا 2 proposed segmentation : S1: Split all clitics mentioned in prev slide except plural and subject pronoun morphemes. S2: Same a S1, the split clitics are glued into one prefix and one suffix word = prefix+ stem+ suffix Example: ( وألوالدهand for his kids) s1: ه+ أوالد+ ل+ و s2: ه+أوالد+ ول Arabic Recombination • • Segmented output needs recombination! Why is it not a trivial: a) Letter ambiguity: we normalized ى ي ه+ مدىمداه ه+ في فيه b) Word Ambiguity: Some words can be grammatically recombined in more than one way: ي+ لكن #1 لكني#2 لكنني Propose two recombination schemes: 1. R: recombination rules define manualy. Resolve a: pick most frequent stem form in non-norm data. Resovle b: pick most frequent grammatical form. 2. T: Build a table derived from the training set: (surface, decomposed word) more than one surface choose randomly. can help in combining words segmented incorrectly . Factored Model &Data Factors: -Factors on the English Side: surface form+POS -Factors on the Arabic Side: Surface form+ POS&clitics -Build 3-gram LM on surface form, 7-gram for the POS&clitics. -Generation model : Surface+ POS&clitics Surface. Data: Newswire & spoken dialogue (travel) - Training Data Newswire: LDC: ~3M ,~1.6M, ~600K words. (Avg sent: 33 En,25 Ar, 36 SegAr Spoken dialogue : IWSLT (2007), 200k words (Avg sent: 9 En, 8 Ar, 10 SegAr) - LM: Newswire: ~3M Ar side+ 30M from Arabic Giga word Spoken dialogue: 200k words Ar side. - Tuning and test sets (1 En ref): Newwire: 2000 tune, 2000 test (chosen randomly,same source of trainnig) Spoken dialogue : 500 tune, 500 test Setup & Recombination Setup: Use GIZA++ for alignment (both unseg Ar, seg Ar), use MAXPHR = 15 for segAr! Decode using MOSES. SRI LM : - News wire: 4 -gram (unseg Ar), 6-gram (SegAr). - Spoken: 3-gram (unseg Ar), 4-gram (SegAr). MERT for tuning, optimize for BLEU. Define 2 tuning schemes for SegAr: - T1: Use segAr for ref -T2: Use UnsegAr for ref. Combine before scoring the n-best list Recombination Results: -Test on Newswire training and test sets .(Sent error!) - T was trained on the Training set. - Baseline: Glue pref and suff. - T+R: if word was seen use T, else use R Translation Results: News Results for Newswire (BLEU): Segmentation helps, but the gain diminishes as the training data size increases (less sparse model). • • • Segmentation S2 is slightly better than S1. Tuning scheme T2 performs better than T1 Factored models performs the best for the Largest system (at higher cost!) Translation Results: Spoken Dialogue Results for Spoken dialogue (BLEU): S2 performs slightly better than S1 T1 is better than T2 Conclusions: - Recombination based on both the training data and rules performs best. - Segmentation helps, but the gain diminishes as the training data size increases . - Recombining the segmented output during tuning helps. - Factored models perform best for the “Large” system. - What next: Explore the effect of Syntactic reordering on EnAr MT : Syntactic Phrase Reordering for English-to-Arabic Statistical Machine Translation, Badr et al., EACL 2009.