Machine Translation Word Alignment Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation 1 Overview IBM 3: Fertility IBM 4: Relative Distortion Acknowledgement: These slides are based on slides by Hermann Ney and Franz Josef Och Stephan Vogel - Machine Translation 2 Fertility Models Basic concept: each word in one language can generate multiple words in the other language deseo – I would like übermorgen – the day after tomorrow departed – fuhr ab The same word can generate different number of words -> probability distribution F Alignment is function -> fertility only on one side In my terminology: target words have fertility, i.e. each target word can cover multiple source words Others say source word generates multiple target words Some source words are aligned to NULL word, i.e. NULL word has fertility Many target words are not aligned, i.e. have fertility 0 Stephan Vogel - Machine Translation 3 The Generative Story e0 e1 e2 e3 e4 e5 fertility generation 1 2 0 1 3 0 word generation f01 f11 f12 f31 f41 f42 f43 permutation generation f1 f2 f3 f4 f5 f6 Stephan Vogel - Machine Translation f7 4 Fertility Model Alignment model: Pr( f1J | e0I ) Pr( f1J , a1J | e0I ) a0J Select fertility for each English word: F (ei ) For each English word select a tablet of French words: ~ f i , 1,..., F(ei ) Select a permutation for the entire sequence of French words: : (i, ) j i Sum over all realizations: Pr( f1J , a1J | e0I ) ~ Pr( f , | e0I ) ~ ( f , )( f1J , a1J ) Stephan Vogel - Machine Translation 5 Fertility Model: Constraints Fertility bound to alignment: J Fi F(ei ) (i, a j ) j 1 Permutation: French words: i j i , 1,..., Fi : a i i ~ f i f i Stephan Vogel - Machine Translation 6 Fertility Model Decomposition into factors: ~ ~ I I ~ I I I I I Pr( f , | e0 ) Pr(F0 | e0 ) Pr( f | F0 , e0 ) Pr( | f , F0 , e0 ) Apply chain rule to each factor, limit dependencies: Fertility generation (IBM 3,4,5): I I i 1 i 1 Pr(F | e ) p(F 0 | e0 , F i ) p(F i | ei ) I 0 I 0 Word generation (IBM 3,4,5): I Fi ~ I I ~ Pr( f | F 0 , e0 ) p( f i | ei ) i 0 1 Permutation generation (only IBM 3): ~ I I 1 I Fi Pr( | f , F 0 , e0 ) p( i | i, I , J ) F 0! i 1 1 Note: 1/F0! results from special model for i = 0. Stephan Vogel - Machine Translation 7 Fertility Model: Some Issues Permutation model can not guaranty that p is a permutation -> Words ca be stacked on top of each other -> This leads to deficiency Position i = 0 is not a real position -> special alignment and fertility model for the empty word Stephan Vogel - Machine Translation 8 Fertility Model: Empty Position Alignment assumptions for the empty position i = 0 Uniform position distribution for each of the F0 French words generated from e0 Place these French words only after all other words have been placed Alignment model for the positions aligned to the Empty position: One position: p( j 0 0 if j is occupied 1 | i 0, I , J ) : 1 if j is vacant F0 All positions: F0 F0 1 1 F 0! 1 F 0 1 p( 0 | i 0, I , J ) 1 Stephan Vogel - Machine Translation 9 Fertility Model: Empty Position Fertility model for words generated by e0, i.e. by empty position We assume that each word from f1J requires the Empty word with probability [1 – p0] Probability that exactly F0 from the J words in f1J require the Empty word: J ' J 'F 0 p(F 0 | J ' , e0 ) p0 [1 p0 ]F 0 F0 I with J ': F i , i 1 J : F 0 J ' Stephan Vogel - Machine Translation 10 Deficiency Distortion model for real words is deficient Distortion model for empty word is non-deficient Deficiency can be reduced by aligning more words to the empty word Training corpus likelihood can be increased by aligning more words with empty word Play with p0! Stephan Vogel - Machine Translation 14 IBM 4: 1st Order Distortion Model Introduce more detailed dependencies into the alignment (permutation) model First order dependency along e-axis HMM IBM4 Stephan Vogel - Machine Translation 15 Inverted Alignment Consider alignments B :i Bi {1,..., j ,..., J } Dependency along I axis: jumps along the J axis Two first order models p1 (j | ...) and p1 (j | ...) for aligning first word in a set and for aligning remaining words We skip the math :-) Stephan Vogel - Machine Translation 16 Characteristics of Alignment Models Model Alignment Fertility E-step Deficient IBM1 Uniform No Exact No IBM2 0-order No Exact No HMM 1-order No Exact No IBM3 0-order Yes Approx Yes IBM4 1-order Yes Approx Yes IBM5 1-order Yes Approx No Stephan Vogel - Machine Translation 17 Consideration: Overfitting Training on data has always the danger of overfitting Model describes training data in too much detail But does not perform well on unseen test data Solution: Smoothing Lexicon: distribute some of the probability mass from seen events to unseen events for p( f | e ), do this for each e) For unseen e: uniform distribution or ??? Distortion: interpolate with uniform distribution p'(a j|a j 1,I) ( 1 α)p(a j|a j 1,I) α 1/I Fertility: for many languages ‘longer word’ = ‘more content’ E.g. compounds or agglutinative morphology Train a model p ( | g (e)) for fertility given word length and interpolate with p ( | g (e)) Interpolate fertility estimates based on word frequency: frequent word, use the word model, low frequency word bias towards the length model Stephan Vogel - Machine Translation 18 Extension: Using Manual Dictionaries Adding manual dictionaries Simple method 1: add as bilingual data Simple method 2: interpolate manual with trained dictionary Use constraint GIZA (Gao, Nguyen, Vogel, WMT 2010) Can put higher weight on word pairs from dictionary (Och, ACL 2000) Not so simple: “But dictionaries are data too” (Brown et al, HLT 93) Problem: manual dictionaries do not have inflected form Possible Solution: Generate additional word forms (Vogel and Monson, LREC 04) Stephan Vogel - Machine Translation 19 Extension: Using POS Use POS in distortion model We had: Pr(a j | a1j 1 , J , e0I ) p(a j | a j 1 , J , I ) Now we condition of word class of previous aligned target Pr(a j | a1j 1 , J , e0I ) p(a j | a j 1 , C(ea( j 1) ), J , I ) Available in GIZA++ Automatic clustering of vocabulary into word classes with mkcls Default: 50 classes Use POS as 2nd ‘Lexicon’ model (e.g. Zhao et al, ACL 2005) Train p( C(f) | C(d ), start with initial model trained with IBM1 just on word classes Align sentence pairs using p( C(f) | C(d ) and p( f | e ) Update both distributions from Viterbi path Stephan Vogel - Machine Translation 20 And Much More … Add fertilities to HMM model Symmetrize during training: i.e. update lexicon probabilities based on symmetrized alignment Benefit from shorter sentence pairs Split long sentences based on initial alignment and retrain Extract phrase pairs and add reliable ones to training data And then all the work on discriminative word alignment Stephan Vogel - Machine Translation 21 Alignment Results Alignment Correct Wrong Missing Precision Recall AER IBM4 S2T 202,898 72,488 134,097 73.7 60.2 33.7 IBM4 T2S 232,840 106,441 104,155 68.6 69.1 31.1 Combined 244,814 89,652 92,178 73.2 72.6 27.1 IBM4 S2T 186,620 172,865 341,183 52,91 35.4 57.9 IBM4 T2S 299,744 151,478 228,059 66.4 56.8 38.8 Combined 296,312 140.929 231,491 67.8 56.1 38.6 Arabic-English Chinese-English Unbalanced between wrong and missing -> unbalanced between precision and recall Chinese is harder, many missing links -> low precision One direction seems harder: related to which side has more words Alignment models generate one link per source word Stephan Vogel - Machine Translation 22 Unaligned Words Alignment NULL Alignment Not Aligned Manual Alignment 8.58 11.84 IBM4 S2T 3.49 30.02 IBM4 T2S 5.33 15.72 Combined 5.53 7.70 Manual Alignment 7.80 11.90 IBM4 S2T 5.46 23.84 IBM4 T2S 6.41 34.53 Combined 9.80 14.64 Arabic-English Chinese-Engish NULL Alignment explicit, part of the model; non-aligned happens This is serious: alignment model neglects 1/3 of target words Alignment is very asymmetric, therefore combination Stephan Vogel - Machine Translation 23 Alignment Errors for Most Frequent Words (CH-EN) Stephan Vogel - Machine Translation 24 Sentence Length Distribution Sentences are often unbalanced Wrong sentence alignment Bad translations But also language divergences May wanna remove unbalance sentences Sentence length model very weak SL 1 2 3 1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 5 9 19 34 43 47 31 21 16 7 2 4 2 0 1 Table: Target sentence length distribution for source sentence length 10 Stephan Vogel - Machine Translation 25 Summary Word Alignment Models Alignment is (mathematically) a function, i.e many source words to 1 target word, but not the other way round Symmetry by training in both directions Model IBM1 word-word probabilities Simple training with Expectation-Maximization Model IBM2 Position alignment Training also with EM Model HMM Relative positions (first order model) Training with Viterbi or Forward-Backward Algorithm Alignment errors reflect restrictions in generative alignment models Stephan Vogel - Machine Translation 26