ppt

advertisement
Machine Translation
Minimum Error Rate Training
Stephan Vogel
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Overview
 Optimization approaches
 Simplex
 MER
 Avoiding local minima
 Additional considerations
 Tuning towards different metrics
 Tuning on different development sets
Stephan Vogel - Machine Translation
2
Tuning the SMT System
 We use different models in SMT system
 Models have simplifications
 Trained on different amounts of data
 => Models have different levels of reliability
and scores have different ranges
 => Give different weight to different Models
Q = c1 Q1 + c2 Q2 + … + cn Qn
 Find optimal scaling factors (feature weights) c1 … cn
 Optimal means: Highest score for chosen evaluation metric M
ie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high
 Metric M is our objective function
Stephan Vogel - Machine Translation
3
Problems
 The surface of the objective function is not nice
 Not convex -> local minima (actually, many local minima)
 Not differentiable -> gradient descent methods not readily applicable
 There may be dangerous
areas (‘boundary cliffs’)
 Tune on Dev set with
short reference translations
 Optimization leads towards
short translations
 New test set has long reference translations
 Translations are now too short ->length penalty
Stephan Vogel - Machine Translation
Big effect
 Example:
Small change
4
Brute Force Approach – Manual Tuning
 Decode with different scaling factors
 Get feeling for range of good values
 Get feeling for importance of models
 LM is typically most important
 Sentence length (word count feature) to balance shortening effect of LM
 Word reordering is more or less effective depending on language
 Narrow down range in which scaling factors are tested
 Essentially multi-linear optimization
 Works good for small number of models
 Time consuming (CPU wise) if decoding takes long time
Stephan Vogel - Machine Translation
5
Automatic Tuning
 Many algorithms to find (near) optimal solutions available





Simplex
Powell (line search)
MIRA (Margin Infused Relaxed Algorithm)
Specially designed minimum error training (Och 2003)
Genetic algorithm
 Note: models are not improved, only their combination
 Note: some parameters change performance of decoder, but
are not in Q
 Number of alternative translation
 Beam size
 Word reordering restrictions
Stephan Vogel - Machine Translation
6
Automatic Tuning on N-best List
 Optimization algorithm need many iterations – too expensive
to run full translations
 => Use n-best lists
 e.g. for each of 500 source sentences 1000 translations
 Change scaling factors results in re-ranking the n-best lists
 Evaluate new 1-best translations
 Apply any of the standard optimization techniques
 Advantage: much faster
 Can pre-calculate the counts (e.g. n-gram matches) for each
translation to speed up evaluation
Stephan Vogel - Machine Translation
7
Simplex (Nelder-Mead)
 Start with n+1 random configurations
 Get 1-best translation for each configuration -> objective
function
 Sort points xk according to objective function:
f(x1) < f(x2) < … < f(xn+1)
 Calculate x0 as center of gravity for x1 … xn
 Replace worst point with a point reflected through the centroid
xr = x0 + r * (x0 – xn+1)
Stephan Vogel - Machine Translation
8
Demo
8
6
9
7
12
9
11
 Obviously, we need to change the size of the simplex to
enforce convergence
 Also, want to adjust the step size
 If new point is best point – increase step size
 If new point is worse then x1 … xn – decrease step size
Stephan Vogel - Machine Translation
9
Expansion and Contraction
 Reflection:
Calculate xr = x0 + r * (x0 – xn+1)
if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration
 Expansion:
If reflected point is better then best, i.e. f(xr) < f(x1)
Calculate xe = x0 + e * (x0 – xn+1)
If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr
Next iteration
else Contract
 Contraction:
Reflected point f(xr) >= f(xn)
Calculate xc = xn+1 + c * (x0 – xn+1)
If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink
 Shrinking:
For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1)
Next iteration
Stephan Vogel - Machine Translation
10
Changing the Simplex
x0
xn+1
x0
reflection
expansion
xn+1
x0
xn+1
xn+1
contraction
x1
shrinking
Stephan Vogel - Machine Translation
11
Powell Line Search
 Select directions in search space, then
Loop until convergence
Loop over directions d
Perform line search for direction d until convergence
 Many variants
 Select directions
 Easiest is to use the model scores
 Or combine multiple scores
 Step size in line search
 MER (Och 2003) is line search along models with smart
selection of steps
Stephan Vogel - Machine Translation
12
Minimum Error Training
 For each hypothesis we have
Q = S ck*Qk
 Select one
Q\k = ck Qk +
Sn\k cn*Qn = ck Qk + QRest
Total
Model
Score
Metric Score
WER = 8
Qk
1
Individual model score
gives slope
QRest
ck
Stephan Vogel - Machine Translation
13
Minimum Error Training
 Source sentence 1
 Depending on scaling factor ck, different hyps are in 1-best position
 Set ck to have metric-best hyp also being model-best
Model
Score
h11: WER = 8
h12 : WER = 5
h13 : WER = 4
best hyp:
h11
h12
h13
8
5
4
Stephan Vogel - Machine Translation
ck
14
Minimum Error Training
 Select minimum number of evaluation points
 Calculate intersection point
 Keep only if hyps are minimum at that point
 Choose evaluation points between intersection points
Model
Score
h11: WER = 8
h12 : WER = 5
h13 : WER = 4
best hyp:
h11
h12
h13
8
5
4
Stephan Vogel - Machine Translation
ck
15
Minimum Error Training
 Source sentence 1, now different error scores
 Optimization would find a different ck
 => Different metrics lead to different scaling factors
Model
Score
h11: WER = 8
h12 : WER = 2
h13 : WER = 4
best hyp:
h11
h12
h13
8
2
4
Stephan Vogel - Machine Translation
ck
16
Minimum Error Training
 Sentence 2
 Best ck in a different range
 No matter which ck, h22 would newer be 1-best
h21: WER = 2
Model
Score
h22 : WER = 0
h23 : WER = 5
best hyp:
h21
2
h23
ck
5
Stephan Vogel - Machine Translation
17
Minimum Error Training
 Multiple sentences
h21: WER = 2
h22 : WER = 0
Model
Score
h23 : WER = 5
h11: WER = 8
h12 : WER = 5
h13 : WER = 4
h22
h21
best hyp:
h12
h11
10
7
h13
10
ck
9
Stephan Vogel - Machine Translation
18
Iterate Decoding - Optimization
 N-best list is (very restricted) substitute for search space
 With updated feature weights we may have generated other (better)
translations
 Some of the hyps in the n-best list would have been pruned
 Iterate




Re-translate with new feature weights
Merge new translations with old translations (increases stability)
Run optimizer over larger n-best lists
Repeat until no new translations, or improvement < epsilon, or just k
times (typically 5-10 iterations)
Stephan Vogel - Machine Translation
19
Avoiding Local Minima
 Optimization can get stuck in local minimum
 Remedies




Fiddle around with the parameters of your optimization algorithm
Larger n-best list -> more evaluation points
Combine with Simulated Annealing type approach (Smith & Eisner, 2007)
Restart multiple times
Stephan Vogel - Machine Translation
20
Random Restarts
 Comparison Simplex/Powell (Alok, unpublished)
 Comparison Simplex/ext. Simplex/MER (Bing Zhao, unpubl.)
 Observations:
 Alok: Simplex ‘jumpier’ then Powell
 Bing: Simplex better than MER
 Both: you need many restarts
Stephan Vogel - Machine Translation
21
Optimizing NOT Towards References
 Ideally, we want system output identical to reference
translations
 But there is not guarantee that system can generate
reference translations (under realistic conditions)
 E.g. we restrict reordering window
 We have unknown words
 Reference translations may have words unknow to the system
 Instead of forcing decoder towards reference translations
optimize towards best translations generated by the system
 Find hypotheses with best metric score
 Use those as pseudo references
 Optimize towards the pseudo references
Stephan Vogel - Machine Translation
22
Optimizing Towards Different Metrics
 Automatic metrics have different characteristics
 Optimizing towards one does not mean that other metric scores will
also go up
 Esp. Different metrics prefer shorter or longer translations
Typically: TER < BLEU < METEOR (< means ‘shorter translation’)
 Mauser et al (2007) on Ch-En NIST 2005 test set
 Reasonably well behaved
 Resulting length of translation differs by more than 15%
Stephan Vogel - Machine Translation
23
Generalization to other Test Sets
 Optimize on one set, test on multiple other sets
 Again Mauser et al, Ch-En
 Shown is behavior over
Simplex optimization iterations
 Nice, nearly parallel development
of metric scores
 However, we had also observed brittle behavior
 Esp. when ratio src_length / ref_length is very different between dev
and eval test sets
Stephan Vogel - Machine Translation
24
Large Weight = Important Feature?
 Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2
 Which feature is most important?
 Cannot say!!!
 We want to re-rank the n-best lists
 Feature weights scale feature values such that they can compete
 Example:
 Variation in LM and TM larger
then for WC
 Need large weight for WC to make
small differences effective
QLM
QTM
QWC
Q
H1
22
83
7
112
H2
29
77
8
116
H3
26
85
9
120
 To know if feature is important, remove it and look at drop in
metric score
Stephan Vogel - Machine Translation
25
Open Issues
 Should not all optimizers get the same results, if done right
 The models are the same, it’s just finding the right mix
 If local minima can be avoided, then similar good optima should be
found
 How to stay save
 Avoid good optima close to ‘cliffs’
 Different configurations give very similar metric scores, pick one which
is more stable
 One hat fits all?
 Why one set of feature weights?
 How about different sets for




Good/bad translations (tuning on tail: mixed results so far)
Short/long sentences
Begin/middle/end of sentence
...
Stephan Vogel - Machine Translation
26
Summary
 Optimizing system by modifying scaling factors (feature
weights)
 Different optimization approaches can be used
 Simplex, Powell most common
 MERT (Och) is similar to Powell, with pre-calculation of grid points
 Many local optima, avoid getting stuck early
 Most effective: many restarts
 Generalization
 To unseen test data: mostly ok, sometimes selection of dev set has big
impact (length penalty!)
 To different metrics: reasonably stable (metrics are reasonably
correlated in most cases)
 Still open questions => more research needed 
Stephan Vogel - Machine Translation
27
Download