Approaches to Word Alignment

(Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika Hewavitharana Language Technologies Institute Carnegie Mellon University 02/02/2006 Word Alignment Models  We want to learn how to translate words and phrases  Can learn it from parallel corpora  Typically work with sentence aligned corpora  Available from LDC, etc  For specific applications new data collection required  Model the associations between the different languages  Word to word mapping -> lexicon  Differences in word order -> distortion model  ‘Wordiness’, i.e. how many words to express a concept -> fertility  Statistical translation is based on word alignment models Alignment Example Observations:  Often 1-1  Often monotone  Some 1-to-many  Some 1-to-nothing Word Alignment Models      IBM1 IBM2 IBM3 IBM4 IBM5 – – – – – lexical probabilities only lexicon plus absolut position plus fertilities inverted relative position alignment non-deficient version of model 4  HMM – lexicon plus relative position  BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation  Syntactical alignment models [Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003] Notation  Source language     f : source (French) word J : length of source sentence j : position in source sentence; j = 1,2,...,J f1J  f1... f j ... f J : source sentence  Target language     e : target (English) word I : length of target sentence i : position in target sentence; i = 1,2,...,I e1I  e1...ei ...eI : target sentence SMT - Principle  Translate a ‘French’ string into an ‘English’ string f1J  f1... f j ... f J e1I  e1...ei ...eI  Bayes’ decision rule for translation: eˆ1I  arg max {Pr( e1I | f1J )} e1i  arg max {Pr( e1I ) Pr( f1J | e1I )} e1i  Based on Noisy channel model  We will call f source and e target Alignment as Hidden Variable  ‘Hidden alignments’ to capture word-to-word correspondences A  {( j , i ) | j  1,..., J ; i  1,..., I }  Number of connections: J * I (each source word with each target word)  Number of alignments: 2JI  Restricted alignment     Each source word has one connection – a function i = aj: position i of ei which is connected to j Number of alignments is now: IJ a1J  a1...a j ...aJ : whole alignment  Relationship between Translation Model and Alignment Model Pr( f1J | e1I )   Pr( f1J ,  | e0I )  Empty Position (Null Word)  Sometimes a word has no correspondence  Alignment function aligns each source word to one target word, i.e. cannot skip source word  Solution:  Introduce empty position 0 with null word e0  ‘Skip’ source word fj by aligning it to e0  Target sentence is extended to: e0I  e0 ...ei ...eI a0J  a0 ...a j ...aJ  Alignment is extended to: Translation Model  Sum over all possible alignments Pr( f | e)   Pr( f1J , a1J | e0I ) a0J Pr( f1J , a1J | e0I )  Pr( J | e0I ) Pr( f1J , a1J | J , e0I )  Pr( J | e0I ) Pr( a1J | J , e0I ) Pr( f1J | a1J , J , e0I )  3 probability distributions:  Length: Pr( J | e0I )  Alignment: Pr( a | J , e )   Pr( a j | a1j 1 , J , e0I ) J J 1 I 0 j 1  Lexicon: J Pr( f1 | a , J , e )   Pr( f j | f1 j 1 , a1J , J , e0I ) J J 1 I 0 j 1 Model Assumptions Decompose interaction into pairwise dependencies  Length: Source length only dependent on target length (very weak) Pr( J | e0I )  p( J | I )  Alignment:  Zero order model: target position only dependent on source position Pr(a j | a1j 1 , J , e0I )  p(a j | j, J , I )  First order model: target position only dependent on previous target position Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , J , I )  Lexicon: source word only dependent on aligned word Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j ) IBM Model 1  Length: Source length only dependent on target length Pr( J | e0I )  p( J | I )  Alignment: Assume uniform probability for position alignment p(i | j , I , J )  1 ( I  1)  Lexicon: source word only dependent on aligned word Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j )  Alignment probability J I Pr( f1 | e )  p( J | I ) p(i | j , J , I ) p( f j | ei ) J I 1 j 1 i 1 1  p( J | I ) ( I  1) J J I  p( f j 1 i 1 j | ei ) IBM Model 1 – Generative Process To generate a French string f1J from an English string e1I: J  Step 1: Pick the length of f1  All lengths are equally probable; p( J | I ) is a constant 1  Step 2: Pick an alignment a with probability ( I  1) J J 1  Step 3: Pick the French words with probability J Pr( f1 | a , e )   p( f j | ei ) J J 1 I 1 j 1  Final Result: p( J | I ) J I Pr( f1 | e )  p( f j | ei ) J  ( I  1) j 1 i 1 J I 1 IBM Model 1 – Training  Parameters of the model: p( f | e)  t ( f | e)  Training data: parallel sentence pairs ( f1J , e1I )  We adjust parameters s.t. it maximize  Normalized for each e: J I log Pr( f | e  1 1) ( f1J ,e1I )  t ( f | e)  1 f  EM Algorithm used for the estimation     Initialize the parameters uniformly Collect counts for each ( f , e) pair in the corpus Re-estimate parameters using counts Repeated for several iterations  Model simple enough to compute over all alignments  Parameters does not depend on initial values IBM Model 1 Training– Pseudo Code # Accumulation (over corpus) For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum # Re-estimate probabilities (over count table) For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum # Repeat for several iterations IBM Model 2 Only Difference from Model 1 is in Alignment Probability  Length: Source length only dependent on target length Pr( J | e0I )  p( J | I )  Alignment: Target position depends on the source position (in addition to the source length and target length) Pr(a j | a1j 1 , J , e0I )  p(a j | j, J , I )  Model 1 is a special case of Model 2, where p (a j | j , J , I )   Lexicon: source word only dependent on aligned word Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j ) 1 I 1 IBM Model 2 – Generative Process To generate a French string f1J from an English string e1I: J  Step 1: Pick the length of f1  All lengths are equally probable; p( J | I ) is a constant J  Step 2: Pick an alignment a1 with probability J  p(a j | j, J , I ) j 1  Step 3: Pick the French words with probability J Pr( f1 | a , e )   p( f j | ei ) J J 1 I 1 j 1  Final Result: J I Pr( f1 | e )  p( J | I ) p(a j | j, J , I ) p( f j | ei ) J I 1 j 1 i 1 IBM Model 2 – Training  Parameters of the model: p( f | e)  t ( f | e) p(a j | j , J , I )  a(a j | j , J , I ) J I  Training data: parallel sentence pairs ( f1 , e1 )  We maximize J I log Pr( f | e  1 1 ) w.r.t translation and alignment params. ( f1J ,e1I )  EM Algorithm used for the estimation  Initialize alignment parameters uniformly, and translation probabilities from Model 1  Accumulate counts, re-estimate parameters  Model simple enough to compute over all alignments Fertility-based Alignment Models  Models 3-5 are based on Fertility  Fertility: Number of source words connected with a target word i    ( a j , i ) ei j 1I  1... j ...I : fertility values of e I 1 p ( | e) = probability that e is connected with  source words  Alignment: Defined in the reverse-direction (target to source) p( j | i, J , I ) = probability of French position j given English position is i IBM Model 3 – Generative Process To generate a French string f1J from an English string e1I:  Step 1: Choose (I+1) fertilities 1I with probability Pr(1I | e1I ) I Pr( | e )  p (0 |  ) p (i | ei ) I 0 I 0 I 1 i 1 I   1 I  p 0 |  i .  p(i | ei ) i 1   0! i 1 I  J  0  1  (1  p1 ) J 20 p1 0 .  p(i | ei )  p 0! i 1  0  IBM Model 3 – Generative Process  Step 2: For each ei , for k =1… i , choose a position and a French word f i , k with probability I  i,k 1…J i  p( i ,k | i, I , J ) p( f i ,k | ei ) i 1 k 1 For a given alignment, there are I   ! orderings i i 0 I    I i Pr( f1 , a | e )  p( | e )0!i !  p( i ,k | i, I , J ) p( f i ,k | ei )   i 1  i 1 k 1  J J 1 I 0 I 0 I 0 I  J  0   0  J  20 (1  p1 )  p p1   p(i | ei )i  *  i 1   0   I i    p( i ,k | i, I , J ) p( f i ,k | ei )   i 1 k 1  IBM Model 3 – Example e0 [Knight 99] Mary did not slap the green witch 1 1 0 1 3 1 1 1 [e] [choose fertility] Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch Mary no daba una botefada a la verde bruja [fertility for e0] [choose translation] Mary no daba una botefada a la bruja verde 1 2 3 4 5 67 8 9 [choose target positions j ] 1 [aj ] 3 4 4 4 05 7 6 IBM Model 3 – Training  Parameters of the model: p ( f | e)  t ( f | e) p ( j | i, J , I )  d ( j | i, J , I ) p ( | e)  n( | e) p1  EM Algorithm used for the estimation  Not possible to compute exact EM updates  Initialize n,d,p uniformly, and translation probabilities from Model 2  Accumulate counts, re-estimate parameters  Cannot efficiently compute over all alignments  Only Viterbi alignment is used  Model 3 is deficient  Probability mass is wasted on impossible translations IBM Model 4  Try to model re-ordering of phrases  p( j | i, J , I ) is replaced with two sets of parameters:  One for placing the first word (head) of a group of words  One for placing rest of the words relative to the head  Deficient  Alignment can generate source positions outside of sentence length J  Model 5 removes this deficiency HMM Alignment Model  Idea: relative position model Target Source [Vogel 96] HMM Alignment  First order model: target position dependent on previous target position (captures movement of entire phrases) Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , J , I )  Alignment probability: J Pr( f1 | e )  p( J | I ) p(a j | a j 1 , I ) p( f j | ea j ) J I 1 a1J j 1  Alignment depends on relative position p(i | i' , I )  c(i  i' )  I i ''1 c(i' 'i' )  Maximum approximation: J Pr( f1 | e )  p( J | I ) max  p(a j | a j 1 , I ) p( f j | ea j ) J J I 1 a1 j 1 IBM2 vs HMM [Vogel 96] Enhancements to HMM & IBM Models  HMM model with empty word  Adding I empty words to the target side  Model 6  IBM 4: predicts distance between subsequent target positions  HMM: predicts distance between subsequent source positions  Model 6: A log-linear combination of IBM 4 and HMM Models p4 ( f , a | e) pHMM ( f , a | e) p6 ( f , a | e)  a ', f ' p4 ( f ' , a'| e' ) pHMM ( f ' , a'| e' )  Smoothing  Alignment prob. – Interpolate with uniform dist.  Fertility prob. – Depends of number of letters in a word  Symmetrization  Heuristic postprocessing to combine alignments in both directions Experimental Results [Franz 03]  Refined models perform better  Models 4,5,6 better than Model 1 or Dice coefficient model  HMM better than IBM 2  Alignment quality based on the training method and bootstrap scheme used  IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3  Smoothing and Symmetrization have a significant effect on alignment quality  More alignments in training yields better results  Using word classes  Improvement for large corpora but not for small corpora References:  Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2.  Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMMbased Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp. 836-841.  Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp. 19-51.  Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf.

Approaches to Word Alignment

Related documents

Products

Support

Approaches to Word Alignment

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib