Ibrahim Badr, Rabih Zbib, James Glass

Ibrahim Badr, Rabih Zbib, James Glass Introduction  Experiment on English-to-Arabic SMT.  Two domains: text news ,spoken travel conv.  Explore the effect of Arabic segmentation, on the translation quality .  Propose various schemes recombining (Not Trivial!) the segmented Arabic.  Apply (basic) factored translation models Arabic Morphology  Arabic is a morphologically rich language.  Nouns and Adjectives inflect for gender (m,f) , number (pl,sg,du) and case (Nom,Acc,Gen) all comb’ are possible: ‫( العب‬a player, M), ‫( العبة‬a player, F), ‫( العبان‬two players, M), ‫( العبتان‬two players, F), ‫( العبون‬players, M,P,Nom), ‫( العبين‬players, M,P,Acc or gen)  In addition to gender and number, verbs inflect for tense, voice, and person: ‫( لعبوا‬play, past, plM3P), ‫( يلعبون‬play , present, plM3P), ‫( ستلعبون‬played, plM3P) • Addittional Pefixes: conjunction ‫ و‬, determiner ‫أل‬, preposition ‫(ب‬with,in) ‫( ل‬to) (for)‫ لل‬.. ‫وباالبخبا‬ • Additional Sufixes: - possessive pronouns (attach to nouns): ‫(هم‬their), ‫( كُم‬your, pl,M), ‫( كُن‬your, pl,F),… - object and subject pronouns attach to verbs: ‫( ني‬me), ‫( ُهن‬them), ‫( و‬they) ‫وبسياراتهم‬  Many surface forms sharing the same lemma! Arabic segmentation  Use MADA for morphological decomposition of Arabic text.  (typical) normalizaion: ‫ ى‬ ‫ ي‬, ‫ أآإ‬ ‫ا‬  2 proposed segmentation : S1: Split all clitics mentioned in prev slide except plural and subject pronoun morphemes. S2: Same a S1, the split clitics are glued into one prefix and one suffix word = prefix+ stem+ suffix Example: ‫( وألوالده‬and for his kids) s1: ‫ ه‬+ ‫ أوالد‬+ ‫ ل‬+ ‫و‬ s2: ‫ ه‬+‫أوالد‬+ ‫ول‬ Arabic Recombination • • Segmented output needs recombination! Why is it not a trivial: a) Letter ambiguity: we normalized ‫ ى‬ ‫ي‬ ‫ ه‬+‫ مدى‬‫مداه‬ ‫ ه‬+ ‫ في‬ ‫فيه‬ b) Word Ambiguity: Some words can be grammatically recombined in more than one way: ‫ي‬+‫ لكن‬ #1 ‫ لكني‬#2 ‫لكنني‬  Propose two recombination schemes: 1. R: recombination rules define manualy. Resolve a: pick most frequent stem form in non-norm data. Resovle b: pick most frequent grammatical form. 2. T: Build a table derived from the training set: (surface, decomposed word) more than one surface  choose randomly. can help in combining words segmented incorrectly . Factored Model &Data  Factors: -Factors on the English Side: surface form+POS -Factors on the Arabic Side: Surface form+ POS&clitics -Build 3-gram LM on surface form, 7-gram for the POS&clitics. -Generation model : Surface+ POS&clitics  Surface.  Data: Newswire & spoken dialogue (travel) - Training Data Newswire: LDC: ~3M ,~1.6M, ~600K words. (Avg sent: 33 En,25 Ar, 36 SegAr Spoken dialogue : IWSLT (2007), 200k words (Avg sent: 9 En, 8 Ar, 10 SegAr) - LM: Newswire: ~3M Ar side+ 30M from Arabic Giga word Spoken dialogue: 200k words Ar side. - Tuning and test sets (1 En ref): Newwire: 2000 tune, 2000 test (chosen randomly,same source of trainnig) Spoken dialogue : 500 tune, 500 test Setup & Recombination Setup:  Use GIZA++ for alignment (both unseg Ar, seg Ar), use MAXPHR = 15 for segAr!  Decode using MOSES.  SRI LM : - News wire: 4 -gram (unseg Ar), 6-gram (SegAr). - Spoken: 3-gram (unseg Ar), 4-gram (SegAr).  MERT for tuning, optimize for BLEU. Define 2 tuning schemes for SegAr: - T1: Use segAr for ref -T2: Use UnsegAr for ref. Combine before scoring the n-best list Recombination Results: -Test on Newswire training and test sets .(Sent error!) - T was trained on the Training set. - Baseline: Glue pref and suff. - T+R: if word was seen use T, else use R Translation Results: News  Results for Newswire (BLEU):  Segmentation helps, but the gain diminishes as the training data size increases (less sparse model). • • • Segmentation S2 is slightly better than S1. Tuning scheme T2 performs better than T1 Factored models performs the best for the Largest system (at higher cost!) Translation Results: Spoken Dialogue  Results for Spoken dialogue (BLEU):  S2 performs slightly better than S1  T1 is better than T2 Conclusions: - Recombination based on both the training data and rules performs best. - Segmentation helps, but the gain diminishes as the training data size increases . - Recombining the segmented output during tuning helps. - Factored models perform best for the “Large” system. - What next: Explore the effect of Syntactic reordering on EnAr MT : Syntactic Phrase Reordering for English-to-Arabic Statistical Machine Translation, Badr et al., EACL 2009.

Ibrahim Badr, Rabih Zbib, James Glass

Related documents

Products

Support

Ibrahim Badr, Rabih Zbib, James Glass

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib