Ibrahim Badr, Rabih Zbib, James Glass

Ibrahim Badr, Rabih Zbib, James Glass
 Experiment on English-to-Arabic SMT.
 Two domains: text news ,spoken travel conv.
 Explore the effect of Arabic segmentation, on the
translation quality .
 Propose various schemes recombining (Not Trivial!) the
segmented Arabic.
 Apply (basic) factored translation models
Arabic Morphology
 Arabic is a morphologically rich language.
 Nouns and Adjectives inflect for gender (m,f) , number (pl,sg,du) and
case (Nom,Acc,Gen) all comb’ are possible:
‫( العب‬a player, M), ‫( العبة‬a player, F), ‫( العبان‬two players, M),
‫( العبتان‬two players, F), ‫( العبون‬players, M,P,Nom), ‫( العبين‬players, M,P,Acc or gen)
In addition to gender and number, verbs inflect for tense, voice, and
‫( لعبوا‬play, past, plM3P), ‫( يلعبون‬play , present, plM3P), ‫( ستلعبون‬played, plM3P)
• Addittional Pefixes: conjunction ‫ و‬, determiner ‫أل‬, preposition ‫(ب‬with,in) ‫( ل‬to)
(for)‫ لل‬.. ‫وباالبخبا‬
• Additional Sufixes:
- possessive pronouns (attach to nouns): ‫(هم‬their), ‫( كُم‬your, pl,M), ‫( كُن‬your, pl,F),…
- object and subject pronouns attach to verbs: ‫( ني‬me), ‫( ُهن‬them), ‫( و‬they)
Many surface forms sharing the same lemma!
Arabic segmentation
 Use MADA for morphological decomposition of Arabic text.
 (typical) normalizaion: ‫ ى‬ ‫ ي‬, ‫ أآإ‬ ‫ا‬
 2 proposed segmentation :
S1: Split all clitics mentioned in prev slide except plural
and subject pronoun morphemes.
S2: Same a S1, the split clitics are glued into one prefix and one suffix
word = prefix+ stem+ suffix
‫( وألوالده‬and for his kids)
s1: ‫ ه‬+ ‫ أوالد‬+ ‫ ل‬+ ‫و‬
s2: ‫ ه‬+‫أوالد‬+ ‫ول‬
Arabic Recombination
Segmented output needs recombination!
Why is it not a trivial:
a) Letter ambiguity: we normalized ‫ ى‬ ‫ي‬
‫ ه‬+‫ مدى‬‫مداه‬
‫ ه‬+ ‫ في‬ ‫فيه‬
b) Word Ambiguity: Some words can be grammatically recombined in more than
one way:
‫ي‬+‫ لكن‬ #1 ‫ لكني‬#2 ‫لكنني‬
 Propose two recombination schemes:
1. R: recombination rules define manualy.
Resolve a: pick most frequent stem form in non-norm data.
Resovle b: pick most frequent grammatical form.
2. T: Build a table derived from the training set: (surface, decomposed word)
more than one surface  choose randomly.
can help in combining words segmented incorrectly .
Factored Model &Data
 Factors:
-Factors on the English Side: surface form+POS
-Factors on the Arabic Side: Surface form+ POS&clitics
-Build 3-gram LM on surface form, 7-gram for the POS&clitics.
-Generation model : Surface+ POS&clitics  Surface.
 Data: Newswire & spoken dialogue (travel)
- Training Data
Newswire: LDC: ~3M ,~1.6M, ~600K words. (Avg sent: 33 En,25 Ar, 36 SegAr
Spoken dialogue : IWSLT (2007), 200k words (Avg sent: 9 En, 8 Ar, 10 SegAr)
- LM:
Newswire: ~3M Ar side+ 30M from Arabic Giga word
Spoken dialogue: 200k words Ar side.
- Tuning and test sets (1 En ref):
Newwire: 2000 tune, 2000 test (chosen randomly,same source of trainnig)
Spoken dialogue : 500 tune, 500 test
Setup & Recombination
 Use GIZA++ for alignment (both unseg Ar, seg Ar), use MAXPHR = 15 for segAr!
 Decode using MOSES.
 SRI LM :
- News wire: 4 -gram (unseg Ar), 6-gram (SegAr).
- Spoken: 3-gram (unseg Ar), 4-gram (SegAr).
MERT for tuning, optimize for BLEU.
Define 2 tuning schemes for SegAr:
- T1: Use segAr for ref
-T2: Use UnsegAr for ref. Combine before scoring the n-best list
Recombination Results:
-Test on Newswire training and test sets .(Sent error!)
- T was trained on the Training set.
- Baseline: Glue pref and suff.
- T+R: if word was seen use T, else use R
Translation Results: News
Results for Newswire (BLEU):
Segmentation helps, but the gain diminishes as the training data size increases (less sparse
Segmentation S2 is slightly better than S1.
Tuning scheme T2 performs better than T1
Factored models performs the best for the Largest system (at higher cost!)
Translation Results: Spoken
 Results for Spoken dialogue (BLEU):
 S2 performs slightly better than S1
 T1 is better than T2
- Recombination based on both the training data and rules performs best.
- Segmentation helps, but the gain diminishes as the training data size increases .
- Recombining the segmented output during tuning helps.
- Factored models perform best for the “Large” system.
- What next: Explore the effect of Syntactic reordering on EnAr MT :
Syntactic Phrase Reordering for English-to-Arabic Statistical Machine
Translation, Badr et al., EACL 2009.