Ibrahim Badr, Rabih Zbib, James Glass

advertisement
Ibrahim Badr, Rabih Zbib, James Glass
Introduction
 Experiment on English-to-Arabic SMT.
 Two domains: text news ,spoken travel conv.
 Explore the effect of Arabic segmentation, on the
translation quality .
 Propose various schemes recombining (Not Trivial!) the
segmented Arabic.
 Apply (basic) factored translation models
Arabic Morphology
 Arabic is a morphologically rich language.
 Nouns and Adjectives inflect for gender (m,f) , number (pl,sg,du) and
case (Nom,Acc,Gen) all comb’ are possible:
‫( العب‬a player, M), ‫( العبة‬a player, F), ‫( العبان‬two players, M),
‫( العبتان‬two players, F), ‫( العبون‬players, M,P,Nom), ‫( العبين‬players, M,P,Acc or gen)

In addition to gender and number, verbs inflect for tense, voice, and
person:
‫( لعبوا‬play, past, plM3P), ‫( يلعبون‬play , present, plM3P), ‫( ستلعبون‬played, plM3P)
• Addittional Pefixes: conjunction ‫ و‬, determiner ‫أل‬, preposition ‫(ب‬with,in) ‫( ل‬to)
(for)‫ لل‬.. ‫وباالبخبا‬
• Additional Sufixes:
- possessive pronouns (attach to nouns): ‫(هم‬their), ‫( كُم‬your, pl,M), ‫( كُن‬your, pl,F),…
- object and subject pronouns attach to verbs: ‫( ني‬me), ‫( ُهن‬them), ‫( و‬they)
‫وبسياراتهم‬

Many surface forms sharing the same lemma!
Arabic segmentation
 Use MADA for morphological decomposition of Arabic text.
 (typical) normalizaion: ‫ ى‬ ‫ ي‬, ‫ أآإ‬ ‫ا‬
 2 proposed segmentation :
S1: Split all clitics mentioned in prev slide except plural
and subject pronoun morphemes.
S2: Same a S1, the split clitics are glued into one prefix and one suffix
word = prefix+ stem+ suffix
Example:
‫( وألوالده‬and for his kids)
s1: ‫ ه‬+ ‫ أوالد‬+ ‫ ل‬+ ‫و‬
s2: ‫ ه‬+‫أوالد‬+ ‫ول‬
Arabic Recombination
•
•
Segmented output needs recombination!
Why is it not a trivial:
a) Letter ambiguity: we normalized ‫ ى‬ ‫ي‬
‫ ه‬+‫ مدى‬‫مداه‬
‫ ه‬+ ‫ في‬ ‫فيه‬
b) Word Ambiguity: Some words can be grammatically recombined in more than
one way:
‫ي‬+‫ لكن‬ #1 ‫ لكني‬#2 ‫لكنني‬
 Propose two recombination schemes:
1. R: recombination rules define manualy.
Resolve a: pick most frequent stem form in non-norm data.
Resovle b: pick most frequent grammatical form.
2. T: Build a table derived from the training set: (surface, decomposed word)
more than one surface  choose randomly.
can help in combining words segmented incorrectly .
Factored Model &Data
 Factors:
-Factors on the English Side: surface form+POS
-Factors on the Arabic Side: Surface form+ POS&clitics
-Build 3-gram LM on surface form, 7-gram for the POS&clitics.
-Generation model : Surface+ POS&clitics  Surface.
 Data: Newswire & spoken dialogue (travel)
- Training Data
Newswire: LDC: ~3M ,~1.6M, ~600K words. (Avg sent: 33 En,25 Ar, 36 SegAr
Spoken dialogue : IWSLT (2007), 200k words (Avg sent: 9 En, 8 Ar, 10 SegAr)
- LM:
Newswire: ~3M Ar side+ 30M from Arabic Giga word
Spoken dialogue: 200k words Ar side.
- Tuning and test sets (1 En ref):
Newwire: 2000 tune, 2000 test (chosen randomly,same source of trainnig)
Spoken dialogue : 500 tune, 500 test
Setup & Recombination
Setup:
 Use GIZA++ for alignment (both unseg Ar, seg Ar), use MAXPHR = 15 for segAr!
 Decode using MOSES.
 SRI LM :
- News wire: 4 -gram (unseg Ar), 6-gram (SegAr).
- Spoken: 3-gram (unseg Ar), 4-gram (SegAr).

MERT for tuning, optimize for BLEU.
Define 2 tuning schemes for SegAr:
- T1: Use segAr for ref
-T2: Use UnsegAr for ref. Combine before scoring the n-best list
Recombination Results:
-Test on Newswire training and test sets .(Sent error!)
- T was trained on the Training set.
- Baseline: Glue pref and suff.
- T+R: if word was seen use T, else use R
Translation Results: News

Results for Newswire (BLEU):

Segmentation helps, but the gain diminishes as the training data size increases (less sparse
model).
•
•
•
Segmentation S2 is slightly better than S1.
Tuning scheme T2 performs better than T1
Factored models performs the best for the Largest system (at higher cost!)
Translation Results: Spoken
Dialogue
 Results for Spoken dialogue (BLEU):
 S2 performs slightly better than S1
 T1 is better than T2
Conclusions:
- Recombination based on both the training data and rules performs best.
- Segmentation helps, but the gain diminishes as the training data size increases .
- Recombining the segmented output during tuning helps.
- Factored models perform best for the “Large” system.
- What next: Explore the effect of Syntactic reordering on EnAr MT :
Syntactic Phrase Reordering for English-to-Arabic Statistical Machine
Translation, Badr et al., EACL 2009.
Download