Approaches to Word Alignment

advertisement
(Statistical) Approaches to Word Alignment
11-734 Advanced Machine Translation Seminar
Sanjika Hewavitharana
Language Technologies Institute
Carnegie Mellon University
02/02/2006
Word Alignment Models
 We want to learn how to translate words and phrases
 Can learn it from parallel corpora
 Typically work with sentence aligned corpora
 Available from LDC, etc
 For specific applications new data collection required
 Model the associations between the different languages
 Word to word mapping -> lexicon
 Differences in word order -> distortion model
 ‘Wordiness’, i.e. how many words to express a concept -> fertility
 Statistical translation is based on word alignment models
Alignment Example
Observations:
 Often 1-1
 Often monotone
 Some 1-to-many
 Some 1-to-nothing
Word Alignment Models





IBM1
IBM2
IBM3
IBM4
IBM5
–
–
–
–
–
lexical probabilities only
lexicon plus absolut position
plus fertilities
inverted relative position alignment
non-deficient version of model 4
 HMM – lexicon plus relative position
 BiBr – Bilingual Bracketing, lexical probabilites plus
reordering via parallel segmentation
 Syntactical alignment models
[Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada
et al. 2003]
Notation
 Source language




f : source (French) word
J : length of source sentence
j : position in source sentence; j = 1,2,...,J
f1J  f1... f j ... f J : source sentence
 Target language




e : target (English) word
I : length of target sentence
i : position in target sentence; i = 1,2,...,I
e1I  e1...ei ...eI : target sentence
SMT - Principle
 Translate a ‘French’ string
into an ‘English’ string
f1J  f1... f j ... f J
e1I  e1...ei ...eI
 Bayes’ decision rule for translation:
eˆ1I  arg max {Pr( e1I | f1J )}
e1i
 arg max {Pr( e1I ) Pr( f1J | e1I )}
e1i
 Based on Noisy channel model
 We will call f source and e target
Alignment as Hidden Variable
 ‘Hidden alignments’ to capture word-to-word correspondences
A  {( j , i ) | j  1,..., J ; i  1,..., I }
 Number of connections: J * I (each source word with each target word)
 Number of alignments: 2JI
 Restricted alignment




Each source word has one connection – a function
i = aj: position i of ei which is connected to j
Number of alignments is now: IJ
a1J  a1...a j ...aJ : whole alignment
 Relationship between Translation Model and Alignment Model
Pr( f1J | e1I )   Pr( f1J ,  | e0I )

Empty Position (Null Word)
 Sometimes a word has no correspondence
 Alignment function aligns each source word to one target word, i.e.
cannot skip source word
 Solution:
 Introduce empty position 0 with null word e0
 ‘Skip’ source word fj by aligning it to e0
 Target sentence is extended to: e0I  e0 ...ei ...eI
a0J  a0 ...a j ...aJ
 Alignment is extended to:
Translation Model
 Sum over all possible alignments
Pr( f | e)   Pr( f1J , a1J | e0I )
a0J
Pr( f1J , a1J | e0I )
 Pr( J | e0I ) Pr( f1J , a1J | J , e0I )
 Pr( J | e0I ) Pr( a1J | J , e0I ) Pr( f1J | a1J , J , e0I )
 3 probability distributions:
 Length:
Pr( J | e0I )
 Alignment:
Pr( a | J , e )   Pr( a j | a1j 1 , J , e0I )
J
J
1
I
0
j 1
 Lexicon:
J
Pr( f1 | a , J , e )   Pr( f j | f1 j 1 , a1J , J , e0I )
J
J
1
I
0
j 1
Model Assumptions
Decompose interaction into pairwise dependencies
 Length: Source length only dependent on target length (very weak)
Pr( J | e0I )  p( J | I )
 Alignment:
 Zero order model: target position only dependent on source position
Pr(a j | a1j 1 , J , e0I )  p(a j | j, J , I )
 First order model: target position only dependent on previous target
position
Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , J , I )
 Lexicon: source word only dependent on aligned word
Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j )
IBM Model 1
 Length: Source length only dependent on target length
Pr( J | e0I )  p( J | I )
 Alignment: Assume uniform probability for position alignment
p(i | j , I , J ) 
1
( I  1)
 Lexicon: source word only dependent on aligned word
Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j )
 Alignment probability
J
I
Pr( f1 | e )  p( J | I ) p(i | j , J , I ) p( f j | ei )
J
I
1
j 1 i 1
1
 p( J | I )
( I  1) J
J
I
 p( f
j 1 i 1
j
| ei )
IBM Model 1 – Generative Process
To generate a French string f1J from an English string e1I:
J
 Step 1: Pick the length of f1
 All lengths are equally probable;
p( J | I ) is a constant
1
 Step 2: Pick an alignment a with probability
( I  1) J
J
1
 Step 3: Pick the French words with probability
J
Pr( f1 | a , e )   p( f j | ei )
J
J
1
I
1
j 1
 Final Result:
p( J | I ) J I
Pr( f1 | e ) 
p( f j | ei )
J 
( I  1) j 1 i 1
J
I
1
IBM Model 1 – Training
 Parameters of the model: p( f | e)  t ( f | e)
 Training data: parallel sentence pairs ( f1J , e1I )
 We adjust parameters s.t. it maximize
 Normalized for each
e:
J
I
log
Pr(
f
|
e

1
1)
( f1J ,e1I )
 t ( f | e)  1
f
 EM Algorithm used for the estimation




Initialize the parameters uniformly
Collect counts for each ( f , e) pair in the corpus
Re-estimate parameters using counts
Repeated for several iterations
 Model simple enough to compute over all alignments
 Parameters does not depend on initial values
IBM Model 1 Training– Pseudo Code
# Accumulation (over corpus)
For each sentence pair
For each source position j
Sum = 0.0
For each target position i
Sum += p(fj|ei)
For each target position i
Count(fj,ei) += p(fj|ei)/Sum
# Re-estimate probabilities (over count table)
For each target word e
Sum = 0.0
For each source word f
Sum += Count(f,e)
For each source word f
p(f|e) = Count(f,e)/Sum
# Repeat for several iterations
IBM Model 2
Only Difference from Model 1 is in Alignment Probability
 Length: Source length only dependent on target length
Pr( J | e0I )  p( J | I )
 Alignment: Target position depends on the source position
(in addition to the source length and target length)
Pr(a j | a1j 1 , J , e0I )  p(a j | j, J , I )
 Model 1 is a special case of Model 2, where p (a j | j , J , I ) 
 Lexicon: source word only dependent on aligned word
Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j )
1
I 1
IBM Model 2 – Generative Process
To generate a French string f1J from an English string e1I:
J
 Step 1: Pick the length of f1
 All lengths are equally probable;
p( J | I ) is a constant
J
 Step 2: Pick an alignment a1 with probability
J
 p(a
j
| j, J , I )
j 1
 Step 3: Pick the French words with probability
J
Pr( f1 | a , e )   p( f j | ei )
J
J
1
I
1
j 1
 Final Result:
J
I
Pr( f1 | e )  p( J | I ) p(a j | j, J , I ) p( f j | ei )
J
I
1
j 1 i 1
IBM Model 2 – Training
 Parameters of the model: p( f | e)  t ( f | e)
p(a j | j , J , I )  a(a j | j , J , I )
J
I
 Training data: parallel sentence pairs ( f1 , e1 )
 We maximize
J
I
log
Pr(
f
|
e

1
1 ) w.r.t translation and alignment params.
( f1J ,e1I )
 EM Algorithm used for the estimation
 Initialize alignment parameters uniformly, and
translation probabilities from Model 1
 Accumulate counts, re-estimate parameters
 Model simple enough to compute over all alignments
Fertility-based Alignment Models
 Models 3-5 are based on Fertility
 Fertility: Number of source words connected with a target word
i    ( a j , i )
ei
j
1I  1... j ...I : fertility values of e I
1
p ( | e) = probability that e is connected with  source words
 Alignment: Defined in the reverse-direction (target to source)
p( j | i, J , I ) = probability of French position j given
English position is i
IBM Model 3 – Generative Process
To generate a French string f1J from an English string e1I:
 Step 1: Choose (I+1) fertilities
1I with probability Pr(1I | e1I )
I
Pr( | e )  p (0 |  ) p (i | ei )
I
0
I
0
I
1
i 1
I

 1 I
 p 0 |  i .  p(i | ei )
i 1

 0! i 1
I
 J  0 
1

(1  p1 ) J 20 p1 0 .  p(i | ei )
 p
0! i 1
 0 
IBM Model 3 – Generative Process
 Step 2: For each ei , for k =1… i , choose a position
and a French word f i , k with probability
I
 i,k 1…J
i
 p(
i ,k
| i, I , J ) p( f i ,k | ei )
i 1 k 1
For a given alignment, there are
I
  ! orderings
i
i 0
I


 I i
Pr( f1 , a | e )  p( | e )0!i !  p( i ,k | i, I , J ) p( f i ,k | ei ) 
 i 1  i 1 k 1

J
J
1
I
0
I
0
I
0
I
 J  0 

0 
J  20
(1  p1 )
 p
p1   p(i | ei )i  *
 i 1

 0 
 I i

  p( i ,k | i, I , J ) p( f i ,k | ei ) 
 i 1 k 1

IBM Model 3 – Example
e0
[Knight 99]
Mary did not slap the green witch
1
1
0 1
3
1
1
1
[e]
[choose fertility]
Mary not slap slap slap the green witch
Mary not slap slap slap NULL the green witch
Mary no daba una botefada a la verde bruja
[fertility for e0]
[choose translation]
Mary no daba una botefada a la bruja verde
1 2 3 4
5
67 8 9
[choose target
positions j ]
1
[aj ]
3
4
4
4
05
7
6
IBM Model 3 – Training
 Parameters of the model: p ( f | e)  t ( f | e)
p ( j | i, J , I )  d ( j | i, J , I )
p ( | e)  n( | e)
p1
 EM Algorithm used for the estimation
 Not possible to compute exact EM updates
 Initialize n,d,p uniformly, and translation probabilities from Model 2
 Accumulate counts, re-estimate parameters
 Cannot efficiently compute over all alignments
 Only Viterbi alignment is used
 Model 3 is deficient
 Probability mass is wasted on impossible translations
IBM Model 4
 Try to model re-ordering of phrases

p( j | i, J , I ) is replaced with two sets of parameters:
 One for placing the first word (head) of a group of words
 One for placing rest of the words relative to the head
 Deficient
 Alignment can generate source positions outside of sentence length J
 Model 5 removes this deficiency
HMM Alignment Model
 Idea: relative position model
Target
Source
[Vogel 96]
HMM Alignment
 First order model: target position dependent on previous target position
(captures movement of entire phrases)
Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , J , I )
 Alignment probability:
J
Pr( f1 | e )  p( J | I ) p(a j | a j 1 , I ) p( f j | ea j )
J
I
1
a1J
j 1
 Alignment depends on relative position
p(i | i' , I ) 
c(i  i' )

I
i ''1
c(i' 'i' )
 Maximum approximation:
J
Pr( f1 | e )  p( J | I ) max
 p(a j | a j 1 , I ) p( f j | ea j )
J
J
I
1
a1
j 1
IBM2 vs HMM
[Vogel 96]
Enhancements to HMM & IBM Models
 HMM model with empty word
 Adding I empty words to the target side
 Model 6
 IBM 4: predicts distance between subsequent target positions
 HMM: predicts distance between subsequent source positions
 Model 6: A log-linear combination of IBM 4 and HMM Models
p4 ( f , a | e) pHMM ( f , a | e)
p6 ( f , a | e) 
a ', f ' p4 ( f ' , a'| e' ) pHMM ( f ' , a'| e' )
 Smoothing
 Alignment prob. – Interpolate with uniform dist.
 Fertility prob. – Depends of number of letters in a word
 Symmetrization
 Heuristic postprocessing to combine alignments in both directions
Experimental Results
[Franz 03]
 Refined models perform better
 Models 4,5,6 better than Model 1 or Dice coefficient model
 HMM better than IBM 2
 Alignment quality based on the training method and bootstrap
scheme used
 IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3
 Smoothing and Symmetrization have a significant effect on alignment
quality
 More alignments in training yields better results
 Using word classes
 Improvement for large corpora but not for small corpora
References:
 Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra,
Robert L. Mercer (1993). The Mathematics of Statistical Machine
Translation , Computational Linguistics, vol. 19, no. 2.
 Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMMbased Word Alignment in Statistical Translation , COLING, The 16th
Int. Conf. on Computational Linguistics, Copenhagen, Denmark,
August, pp. 836-841.
 Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of
Various Statistical Alignment Models , Computational Linguistics,
vol. 29, no.1, pp. 19-51.
 Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available
at http://www.isi.edu/natural-language/mt/wkbk.rtf.
Download