(Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika Hewavitharana Language Technologies Institute Carnegie Mellon University 02/02/2006 Word Alignment Models We want to learn how to translate words and phrases Can learn it from parallel corpora Typically work with sentence aligned corpora Available from LDC, etc For specific applications new data collection required Model the associations between the different languages Word to word mapping -> lexicon Differences in word order -> distortion model ‘Wordiness’, i.e. how many words to express a concept -> fertility Statistical translation is based on word alignment models Alignment Example Observations: Often 1-1 Often monotone Some 1-to-many Some 1-to-nothing Word Alignment Models IBM1 IBM2 IBM3 IBM4 IBM5 – – – – – lexical probabilities only lexicon plus absolut position plus fertilities inverted relative position alignment non-deficient version of model 4 HMM – lexicon plus relative position BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003] Notation Source language f : source (French) word J : length of source sentence j : position in source sentence; j = 1,2,...,J f1J f1... f j ... f J : source sentence Target language e : target (English) word I : length of target sentence i : position in target sentence; i = 1,2,...,I e1I e1...ei ...eI : target sentence SMT - Principle Translate a ‘French’ string into an ‘English’ string f1J f1... f j ... f J e1I e1...ei ...eI Bayes’ decision rule for translation: eˆ1I arg max {Pr( e1I | f1J )} e1i arg max {Pr( e1I ) Pr( f1J | e1I )} e1i Based on Noisy channel model We will call f source and e target Alignment as Hidden Variable ‘Hidden alignments’ to capture word-to-word correspondences A {( j , i ) | j 1,..., J ; i 1,..., I } Number of connections: J * I (each source word with each target word) Number of alignments: 2JI Restricted alignment Each source word has one connection – a function i = aj: position i of ei which is connected to j Number of alignments is now: IJ a1J a1...a j ...aJ : whole alignment Relationship between Translation Model and Alignment Model Pr( f1J | e1I ) Pr( f1J , | e0I ) Empty Position (Null Word) Sometimes a word has no correspondence Alignment function aligns each source word to one target word, i.e. cannot skip source word Solution: Introduce empty position 0 with null word e0 ‘Skip’ source word fj by aligning it to e0 Target sentence is extended to: e0I e0 ...ei ...eI a0J a0 ...a j ...aJ Alignment is extended to: Translation Model Sum over all possible alignments Pr( f | e) Pr( f1J , a1J | e0I ) a0J Pr( f1J , a1J | e0I ) Pr( J | e0I ) Pr( f1J , a1J | J , e0I ) Pr( J | e0I ) Pr( a1J | J , e0I ) Pr( f1J | a1J , J , e0I ) 3 probability distributions: Length: Pr( J | e0I ) Alignment: Pr( a | J , e ) Pr( a j | a1j 1 , J , e0I ) J J 1 I 0 j 1 Lexicon: J Pr( f1 | a , J , e ) Pr( f j | f1 j 1 , a1J , J , e0I ) J J 1 I 0 j 1 Model Assumptions Decompose interaction into pairwise dependencies Length: Source length only dependent on target length (very weak) Pr( J | e0I ) p( J | I ) Alignment: Zero order model: target position only dependent on source position Pr(a j | a1j 1 , J , e0I ) p(a j | j, J , I ) First order model: target position only dependent on previous target position Pr(a j | a1j 1 , J , e0I ) p(a j | a j 1 , J , I ) Lexicon: source word only dependent on aligned word Pr( f j | f1 j 1 , a1J , J , e0I ) p( f j | ea j ) IBM Model 1 Length: Source length only dependent on target length Pr( J | e0I ) p( J | I ) Alignment: Assume uniform probability for position alignment p(i | j , I , J ) 1 ( I 1) Lexicon: source word only dependent on aligned word Pr( f j | f1 j 1 , a1J , J , e0I ) p( f j | ea j ) Alignment probability J I Pr( f1 | e ) p( J | I ) p(i | j , J , I ) p( f j | ei ) J I 1 j 1 i 1 1 p( J | I ) ( I 1) J J I p( f j 1 i 1 j | ei ) IBM Model 1 – Generative Process To generate a French string f1J from an English string e1I: J Step 1: Pick the length of f1 All lengths are equally probable; p( J | I ) is a constant 1 Step 2: Pick an alignment a with probability ( I 1) J J 1 Step 3: Pick the French words with probability J Pr( f1 | a , e ) p( f j | ei ) J J 1 I 1 j 1 Final Result: p( J | I ) J I Pr( f1 | e ) p( f j | ei ) J ( I 1) j 1 i 1 J I 1 IBM Model 1 – Training Parameters of the model: p( f | e) t ( f | e) Training data: parallel sentence pairs ( f1J , e1I ) We adjust parameters s.t. it maximize Normalized for each e: J I log Pr( f | e 1 1) ( f1J ,e1I ) t ( f | e) 1 f EM Algorithm used for the estimation Initialize the parameters uniformly Collect counts for each ( f , e) pair in the corpus Re-estimate parameters using counts Repeated for several iterations Model simple enough to compute over all alignments Parameters does not depend on initial values IBM Model 1 Training– Pseudo Code # Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum # Repeat for several iterations IBM Model 2 Only Difference from Model 1 is in Alignment Probability Length: Source length only dependent on target length Pr( J | e0I ) p( J | I ) Alignment: Target position depends on the source position (in addition to the source length and target length) Pr(a j | a1j 1 , J , e0I ) p(a j | j, J , I ) Model 1 is a special case of Model 2, where p (a j | j , J , I ) Lexicon: source word only dependent on aligned word Pr( f j | f1 j 1 , a1J , J , e0I ) p( f j | ea j ) 1 I 1 IBM Model 2 – Generative Process To generate a French string f1J from an English string e1I: J Step 1: Pick the length of f1 All lengths are equally probable; p( J | I ) is a constant J Step 2: Pick an alignment a1 with probability J p(a j | j, J , I ) j 1 Step 3: Pick the French words with probability J Pr( f1 | a , e ) p( f j | ei ) J J 1 I 1 j 1 Final Result: J I Pr( f1 | e ) p( J | I ) p(a j | j, J , I ) p( f j | ei ) J I 1 j 1 i 1 IBM Model 2 – Training Parameters of the model: p( f | e) t ( f | e) p(a j | j , J , I ) a(a j | j , J , I ) J I Training data: parallel sentence pairs ( f1 , e1 ) We maximize J I log Pr( f | e 1 1 ) w.r.t translation and alignment params. ( f1J ,e1I ) EM Algorithm used for the estimation Initialize alignment parameters uniformly, and translation probabilities from Model 1 Accumulate counts, re-estimate parameters Model simple enough to compute over all alignments Fertility-based Alignment Models Models 3-5 are based on Fertility Fertility: Number of source words connected with a target word i ( a j , i ) ei j 1I 1... j ...I : fertility values of e I 1 p ( | e) = probability that e is connected with source words Alignment: Defined in the reverse-direction (target to source) p( j | i, J , I ) = probability of French position j given English position is i IBM Model 3 – Generative Process To generate a French string f1J from an English string e1I: Step 1: Choose (I+1) fertilities 1I with probability Pr(1I | e1I ) I Pr( | e ) p (0 | ) p (i | ei ) I 0 I 0 I 1 i 1 I 1 I p 0 | i . p(i | ei ) i 1 0! i 1 I J 0 1 (1 p1 ) J 20 p1 0 . p(i | ei ) p 0! i 1 0 IBM Model 3 – Generative Process Step 2: For each ei , for k =1… i , choose a position and a French word f i , k with probability I i,k 1…J i p( i ,k | i, I , J ) p( f i ,k | ei ) i 1 k 1 For a given alignment, there are I ! orderings i i 0 I I i Pr( f1 , a | e ) p( | e )0!i ! p( i ,k | i, I , J ) p( f i ,k | ei ) i 1 i 1 k 1 J J 1 I 0 I 0 I 0 I J 0 0 J 20 (1 p1 ) p p1 p(i | ei )i * i 1 0 I i p( i ,k | i, I , J ) p( f i ,k | ei ) i 1 k 1 IBM Model 3 – Example e0 [Knight 99] Mary did not slap the green witch 1 1 0 1 3 1 1 1 [e] [choose fertility] Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Mary no daba una botefada a la verde bruja [fertility for e0] [choose translation] Mary no daba una botefada a la bruja verde 1 2 3 4 5 67 8 9 [choose target positions j ] 1 [aj ] 3 4 4 4 05 7 6 IBM Model 3 – Training Parameters of the model: p ( f | e) t ( f | e) p ( j | i, J , I ) d ( j | i, J , I ) p ( | e) n( | e) p1 EM Algorithm used for the estimation Not possible to compute exact EM updates Initialize n,d,p uniformly, and translation probabilities from Model 2 Accumulate counts, re-estimate parameters Cannot efficiently compute over all alignments Only Viterbi alignment is used Model 3 is deficient Probability mass is wasted on impossible translations IBM Model 4 Try to model re-ordering of phrases p( j | i, J , I ) is replaced with two sets of parameters: One for placing the first word (head) of a group of words One for placing rest of the words relative to the head Deficient Alignment can generate source positions outside of sentence length J Model 5 removes this deficiency HMM Alignment Model Idea: relative position model Target Source [Vogel 96] HMM Alignment First order model: target position dependent on previous target position (captures movement of entire phrases) Pr(a j | a1j 1 , J , e0I ) p(a j | a j 1 , J , I ) Alignment probability: J Pr( f1 | e ) p( J | I ) p(a j | a j 1 , I ) p( f j | ea j ) J I 1 a1J j 1 Alignment depends on relative position p(i | i' , I ) c(i i' ) I i ''1 c(i' 'i' ) Maximum approximation: J Pr( f1 | e ) p( J | I ) max p(a j | a j 1 , I ) p( f j | ea j ) J J I 1 a1 j 1 IBM2 vs HMM [Vogel 96] Enhancements to HMM & IBM Models HMM model with empty word Adding I empty words to the target side Model 6 IBM 4: predicts distance between subsequent target positions HMM: predicts distance between subsequent source positions Model 6: A log-linear combination of IBM 4 and HMM Models p4 ( f , a | e) pHMM ( f , a | e) p6 ( f , a | e) a ', f ' p4 ( f ' , a'| e' ) pHMM ( f ' , a'| e' ) Smoothing Alignment prob. – Interpolate with uniform dist. Fertility prob. – Depends of number of letters in a word Symmetrization Heuristic postprocessing to combine alignments in both directions Experimental Results [Franz 03] Refined models perform better Models 4,5,6 better than Model 1 or Dice coefficient model HMM better than IBM 2 Alignment quality based on the training method and bootstrap scheme used IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3 Smoothing and Symmetrization have a significant effect on alignment quality More alignments in training yields better results Using word classes Improvement for large corpora but not for small corpora References: Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2. Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMMbased Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp. 836-841. Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp. 19-51. Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf.