ppt

advertisement
Machine Translation
Word Alignment
Stephan Vogel
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Overview
 IBM 3: Fertility
 IBM 4: Relative Distortion
Acknowledgement: These slides are based on slides by
Hermann Ney and Franz Josef Och
Stephan Vogel - Machine Translation
2
Fertility Models
 Basic concept: each word in one language can generate
multiple words in the other language
deseo – I would like
übermorgen – the day after tomorrow
departed – fuhr ab
 The same word can generate different number of words ->
probability distribution F
 Alignment is function -> fertility only on one side
 In my terminology: target words have fertility, i.e. each target word
can cover multiple source words
 Others say source word generates multiple target words
 Some source words are aligned to NULL word, i.e. NULL word
has fertility
 Many target words are not aligned, i.e. have fertility 0
Stephan Vogel - Machine Translation
3
The Generative Story
e0
e1
e2
e3
e4
e5
fertility
generation
1
2
0
1
3
0
word
generation
f01
f11
f12
f31
f41
f42
f43
permutation
generation
f1
f2
f3
f4
f5
f6
Stephan Vogel - Machine Translation
f7
4
Fertility Model
Alignment model:
Pr( f1J | e0I )   Pr( f1J , a1J | e0I )
a0J
Select fertility for each English word:
F (ei )
For each English word select a tablet of French words:
~
f i ,   1,..., F(ei )
Select a permutation for the entire sequence of French words:
 : (i,  )  j   i
Sum over all realizations:
Pr( f1J , a1J | e0I ) 
~
Pr(
 f ,  | e0I )
~
( f , )( f1J , a1J )
Stephan Vogel - Machine Translation
5
Fertility Model: Constraints
Fertility bound to alignment:
J
Fi  F(ei )    (i, a j )
j 1
Permutation:
French words:
i  j   i ,   1,..., Fi : a i  i
~
f i  f i
Stephan Vogel - Machine Translation
6
Fertility Model
Decomposition into factors:
~
~ I I
~ I I
I
I
I
Pr( f ,  | e0 )  Pr(F0 | e0 )  Pr( f | F0 , e0 )  Pr( | f , F0 , e0 )
Apply chain rule to each factor, limit dependencies:
Fertility generation (IBM 3,4,5):
I
I
i 1
i 1
Pr(F | e )  p(F 0 | e0 ,  F i )   p(F i | ei )
I
0
I
0
Word generation (IBM 3,4,5):
I Fi
~ I I
~
Pr( f | F 0 , e0 )   p( f i | ei )
i 0  1
Permutation generation (only IBM 3):
~ I I
1 I Fi
Pr( | f , F 0 , e0 ) 
p( i | i, I , J )

F 0! i 1  1
Note: 1/F0! results from special model for i = 0.
Stephan Vogel - Machine Translation
7
Fertility Model: Some Issues
 Permutation model can not guaranty that p is a permutation
-> Words ca be stacked on top of each other
-> This leads to deficiency
 Position i = 0 is not a real position
-> special alignment and fertility model for the empty word
Stephan Vogel - Machine Translation
8
Fertility Model: Empty Position
 Alignment assumptions for the empty position i = 0
 Uniform position distribution for each of the F0 French words generated from e0
 Place these French words only after all other words have been placed
 Alignment model for the positions aligned to the Empty position:
 One position:
p( j   0
0
if j is occupied

 1
| i  0, I , J ) : 
   1 if j is vacant

 F0
 All positions:
F0
F0
1
1

F 0!
 1 F 0    1
 p( 0 | i  0, I , J ) 
 1
Stephan Vogel - Machine Translation
9
Fertility Model: Empty Position
 Fertility model for words generated by e0, i.e. by empty position
 We assume that each word from f1J requires the Empty word with
probability [1 – p0]
 Probability that exactly F0 from the J words in f1J require the Empty word:
 J '  J 'F 0
p(F 0 | J ' , e0 )    p0 [1  p0 ]F 0
 F0 
I
with
J ':  F i ,
i 1
J : F 0  J '
Stephan Vogel - Machine Translation
10
Deficiency
 Distortion model for real words is deficient
 Distortion model for empty word is non-deficient
 Deficiency can be reduced by aligning more words to the
empty word
 Training corpus likelihood can be increased by aligning more
words with empty word
 Play with p0!
Stephan Vogel - Machine Translation
14
IBM 4: 1st Order Distortion Model
 Introduce more detailed dependencies into the alignment
(permutation) model
 First order dependency along e-axis
HMM
IBM4
Stephan Vogel - Machine Translation
15
Inverted Alignment
 Consider alignments
B :i 
 Bi  {1,..., j ,..., J }
 Dependency along I axis: jumps along the J axis
 Two first order models
p1 (j | ...) and p1 (j | ...)
for aligning first word in a set and for aligning remaining words
 We skip the math :-)
Stephan Vogel - Machine Translation
16
Characteristics of Alignment Models
Model
Alignment
Fertility
E-step
Deficient
IBM1
Uniform
No
Exact
No
IBM2
0-order
No
Exact
No
HMM
1-order
No
Exact
No
IBM3
0-order
Yes
Approx
Yes
IBM4
1-order
Yes
Approx
Yes
IBM5
1-order
Yes
Approx
No
Stephan Vogel - Machine Translation
17
Consideration: Overfitting
 Training on data has always the danger of overfitting
 Model describes training data in too much detail
 But does not perform well on unseen test data
 Solution: Smoothing
 Lexicon: distribute some of the probability mass from seen events to unseen
events
 for p( f | e ), do this for each e)
 For unseen e: uniform distribution or ???
 Distortion: interpolate with uniform distribution
p'(a j|a j 1,I)  ( 1  α)p(a j|a j 1,I)  α 1/I
 Fertility: for many languages ‘longer word’ = ‘more content’
 E.g. compounds or agglutinative morphology
 Train a model p ( | g (e)) for fertility given word length and interpolate with p ( | g (e))
 Interpolate fertility estimates based on word frequency: frequent word, use the word
model, low frequency word bias towards the length model
Stephan Vogel - Machine Translation
18
Extension: Using Manual Dictionaries
 Adding manual dictionaries





Simple method 1: add as bilingual data
Simple method 2: interpolate manual with trained dictionary
Use constraint GIZA (Gao, Nguyen, Vogel, WMT 2010)
Can put higher weight on word pairs from dictionary (Och, ACL 2000)
Not so simple: “But dictionaries are data too” (Brown et al, HLT 93)
 Problem: manual dictionaries do not have inflected form
 Possible Solution:
 Generate additional word forms (Vogel and Monson, LREC 04)
Stephan Vogel - Machine Translation
19
Extension: Using POS
 Use POS in distortion model
 We had:
Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , J , I )
 Now we condition of word class of previous aligned target
Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , C(ea( j 1) ), J , I )
 Available in GIZA++
 Automatic clustering of vocabulary into word classes with mkcls
 Default: 50 classes
 Use POS as 2nd ‘Lexicon’ model (e.g. Zhao et al, ACL 2005)
 Train p( C(f) | C(d ), start with initial model trained with IBM1 just on word
classes
 Align sentence pairs using p( C(f) | C(d ) and p( f | e )
 Update both distributions from Viterbi path
Stephan Vogel - Machine Translation
20
And Much More …
 Add fertilities to HMM model
 Symmetrize during training: i.e. update lexicon probabilities
based on symmetrized alignment
 Benefit from shorter sentence pairs
 Split long sentences based on initial alignment and retrain
 Extract phrase pairs and add reliable ones to training data
 And then all the work on discriminative word alignment
Stephan Vogel - Machine Translation
21
Alignment Results
Alignment
Correct
Wrong
Missing
Precision
Recall
AER
IBM4 S2T
202,898
72,488
134,097
73.7
60.2
33.7
IBM4 T2S
232,840
106,441
104,155
68.6
69.1
31.1
Combined
244,814
89,652
92,178
73.2
72.6
27.1
IBM4 S2T
186,620
172,865
341,183
52,91
35.4
57.9
IBM4 T2S
299,744
151,478
228,059
66.4
56.8
38.8
Combined
296,312
140.929
231,491
67.8
56.1
38.6
Arabic-English
Chinese-English
 Unbalanced between wrong and missing -> unbalanced between
precision and recall
 Chinese is harder, many missing links -> low precision
 One direction seems harder: related to which side has more words
 Alignment models generate one link per source word
Stephan Vogel - Machine Translation
22
Unaligned Words
Alignment
NULL Alignment
Not Aligned
Manual Alignment
8.58
11.84
IBM4 S2T
3.49
30.02
IBM4 T2S
5.33
15.72
Combined
5.53
7.70
Manual Alignment
7.80
11.90
IBM4 S2T
5.46
23.84
IBM4 T2S
6.41
34.53
Combined
9.80
14.64
Arabic-English
Chinese-Engish
 NULL Alignment explicit, part of the model; non-aligned happens
 This is serious: alignment model neglects 1/3 of target words
 Alignment is very asymmetric, therefore combination
Stephan Vogel - Machine Translation
23
Alignment Errors for Most Frequent Words (CH-EN)
Stephan Vogel - Machine Translation
24
Sentence Length Distribution
 Sentences are often unbalanced
 Wrong sentence alignment
 Bad translations
 But also language divergences
 May wanna remove unbalance sentences
 Sentence length model very weak
SL
1
2
3
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
5
9
19
34
43
47
31
21
16
7
2
4
2
0
1
Table: Target sentence length distribution for source sentence length 10
Stephan Vogel - Machine Translation
25
Summary
 Word Alignment Models
 Alignment is (mathematically) a function, i.e many source words to 1
target word, but not the other way round
 Symmetry by training in both directions
 Model IBM1
 word-word probabilities
 Simple training with Expectation-Maximization
 Model IBM2
 Position alignment
 Training also with EM
 Model HMM
 Relative positions (first order model)
 Training with Viterbi or Forward-Backward Algorithm
 Alignment errors reflect restrictions in generative alignment
models
Stephan Vogel - Machine Translation
26
Download