Machine Translation Minimum Error Rate Training Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation 1 Overview Optimization approaches Simplex MER Avoiding local minima Additional considerations Tuning towards different metrics Tuning on different development sets Stephan Vogel - Machine Translation 2 Tuning the SMT System We use different models in SMT system Models have simplifications Trained on different amounts of data => Models have different levels of reliability and scores have different ranges => Give different weight to different Models Q = c1 Q1 + c2 Q2 + … + cn Qn Find optimal scaling factors (feature weights) c1 … cn Optimal means: Highest score for chosen evaluation metric M ie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high Metric M is our objective function Stephan Vogel - Machine Translation 3 Problems The surface of the objective function is not nice Not convex -> local minima (actually, many local minima) Not differentiable -> gradient descent methods not readily applicable There may be dangerous areas (‘boundary cliffs’) Tune on Dev set with short reference translations Optimization leads towards short translations New test set has long reference translations Translations are now too short ->length penalty Stephan Vogel - Machine Translation Big effect Example: Small change 4 Brute Force Approach – Manual Tuning Decode with different scaling factors Get feeling for range of good values Get feeling for importance of models LM is typically most important Sentence length (word count feature) to balance shortening effect of LM Word reordering is more or less effective depending on language Narrow down range in which scaling factors are tested Essentially multi-linear optimization Works good for small number of models Time consuming (CPU wise) if decoding takes long time Stephan Vogel - Machine Translation 5 Automatic Tuning Many algorithms to find (near) optimal solutions available Simplex Powell (line search) MIRA (Margin Infused Relaxed Algorithm) Specially designed minimum error training (Och 2003) Genetic algorithm Note: models are not improved, only their combination Note: some parameters change performance of decoder, but are not in Q Number of alternative translation Beam size Word reordering restrictions Stephan Vogel - Machine Translation 6 Automatic Tuning on N-best List Optimization algorithm need many iterations – too expensive to run full translations => Use n-best lists e.g. for each of 500 source sentences 1000 translations Change scaling factors results in re-ranking the n-best lists Evaluate new 1-best translations Apply any of the standard optimization techniques Advantage: much faster Can pre-calculate the counts (e.g. n-gram matches) for each translation to speed up evaluation Stephan Vogel - Machine Translation 7 Simplex (Nelder-Mead) Start with n+1 random configurations Get 1-best translation for each configuration -> objective function Sort points xk according to objective function: f(x1) < f(x2) < … < f(xn+1) Calculate x0 as center of gravity for x1 … xn Replace worst point with a point reflected through the centroid xr = x0 + r * (x0 – xn+1) Stephan Vogel - Machine Translation 8 Demo 8 6 9 7 12 9 11 Obviously, we need to change the size of the simplex to enforce convergence Also, want to adjust the step size If new point is best point – increase step size If new point is worse then x1 … xn – decrease step size Stephan Vogel - Machine Translation 9 Expansion and Contraction Reflection: Calculate xr = x0 + r * (x0 – xn+1) if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration Expansion: If reflected point is better then best, i.e. f(xr) < f(x1) Calculate xe = x0 + e * (x0 – xn+1) If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr Next iteration else Contract Contraction: Reflected point f(xr) >= f(xn) Calculate xc = xn+1 + c * (x0 – xn+1) If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink Shrinking: For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1) Next iteration Stephan Vogel - Machine Translation 10 Changing the Simplex x0 xn+1 x0 reflection expansion xn+1 x0 xn+1 xn+1 contraction x1 shrinking Stephan Vogel - Machine Translation 11 Powell Line Search Select directions in search space, then Loop until convergence Loop over directions d Perform line search for direction d until convergence Many variants Select directions Easiest is to use the model scores Or combine multiple scores Step size in line search MER (Och 2003) is line search along models with smart selection of steps Stephan Vogel - Machine Translation 12 Minimum Error Training For each hypothesis we have Q = S ck*Qk Select one Q\k = ck Qk + Sn\k cn*Qn = ck Qk + QRest Total Model Score Metric Score WER = 8 Qk 1 Individual model score gives slope QRest ck Stephan Vogel - Machine Translation 13 Minimum Error Training Source sentence 1 Depending on scaling factor ck, different hyps are in 1-best position Set ck to have metric-best hyp also being model-best Model Score h11: WER = 8 h12 : WER = 5 h13 : WER = 4 best hyp: h11 h12 h13 8 5 4 Stephan Vogel - Machine Translation ck 14 Minimum Error Training Select minimum number of evaluation points Calculate intersection point Keep only if hyps are minimum at that point Choose evaluation points between intersection points Model Score h11: WER = 8 h12 : WER = 5 h13 : WER = 4 best hyp: h11 h12 h13 8 5 4 Stephan Vogel - Machine Translation ck 15 Minimum Error Training Source sentence 1, now different error scores Optimization would find a different ck => Different metrics lead to different scaling factors Model Score h11: WER = 8 h12 : WER = 2 h13 : WER = 4 best hyp: h11 h12 h13 8 2 4 Stephan Vogel - Machine Translation ck 16 Minimum Error Training Sentence 2 Best ck in a different range No matter which ck, h22 would newer be 1-best h21: WER = 2 Model Score h22 : WER = 0 h23 : WER = 5 best hyp: h21 2 h23 ck 5 Stephan Vogel - Machine Translation 17 Minimum Error Training Multiple sentences h21: WER = 2 h22 : WER = 0 Model Score h23 : WER = 5 h11: WER = 8 h12 : WER = 5 h13 : WER = 4 h22 h21 best hyp: h12 h11 10 7 h13 10 ck 9 Stephan Vogel - Machine Translation 18 Iterate Decoding - Optimization N-best list is (very restricted) substitute for search space With updated feature weights we may have generated other (better) translations Some of the hyps in the n-best list would have been pruned Iterate Re-translate with new feature weights Merge new translations with old translations (increases stability) Run optimizer over larger n-best lists Repeat until no new translations, or improvement < epsilon, or just k times (typically 5-10 iterations) Stephan Vogel - Machine Translation 19 Avoiding Local Minima Optimization can get stuck in local minimum Remedies Fiddle around with the parameters of your optimization algorithm Larger n-best list -> more evaluation points Combine with Simulated Annealing type approach (Smith & Eisner, 2007) Restart multiple times Stephan Vogel - Machine Translation 20 Random Restarts Comparison Simplex/Powell (Alok, unpublished) Comparison Simplex/ext. Simplex/MER (Bing Zhao, unpubl.) Observations: Alok: Simplex ‘jumpier’ then Powell Bing: Simplex better than MER Both: you need many restarts Stephan Vogel - Machine Translation 21 Optimizing NOT Towards References Ideally, we want system output identical to reference translations But there is not guarantee that system can generate reference translations (under realistic conditions) E.g. we restrict reordering window We have unknown words Reference translations may have words unknow to the system Instead of forcing decoder towards reference translations optimize towards best translations generated by the system Find hypotheses with best metric score Use those as pseudo references Optimize towards the pseudo references Stephan Vogel - Machine Translation 22 Optimizing Towards Different Metrics Automatic metrics have different characteristics Optimizing towards one does not mean that other metric scores will also go up Esp. Different metrics prefer shorter or longer translations Typically: TER < BLEU < METEOR (< means ‘shorter translation’) Mauser et al (2007) on Ch-En NIST 2005 test set Reasonably well behaved Resulting length of translation differs by more than 15% Stephan Vogel - Machine Translation 23 Generalization to other Test Sets Optimize on one set, test on multiple other sets Again Mauser et al, Ch-En Shown is behavior over Simplex optimization iterations Nice, nearly parallel development of metric scores However, we had also observed brittle behavior Esp. when ratio src_length / ref_length is very different between dev and eval test sets Stephan Vogel - Machine Translation 24 Large Weight = Important Feature? Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2 Which feature is most important? Cannot say!!! We want to re-rank the n-best lists Feature weights scale feature values such that they can compete Example: Variation in LM and TM larger then for WC Need large weight for WC to make small differences effective QLM QTM QWC Q H1 22 83 7 112 H2 29 77 8 116 H3 26 85 9 120 To know if feature is important, remove it and look at drop in metric score Stephan Vogel - Machine Translation 25 Open Issues Should not all optimizers get the same results, if done right The models are the same, it’s just finding the right mix If local minima can be avoided, then similar good optima should be found How to stay save Avoid good optima close to ‘cliffs’ Different configurations give very similar metric scores, pick one which is more stable One hat fits all? Why one set of feature weights? How about different sets for Good/bad translations (tuning on tail: mixed results so far) Short/long sentences Begin/middle/end of sentence ... Stephan Vogel - Machine Translation 26 Summary Optimizing system by modifying scaling factors (feature weights) Different optimization approaches can be used Simplex, Powell most common MERT (Och) is similar to Powell, with pre-calculation of grid points Many local optima, avoid getting stuck early Most effective: many restarts Generalization To unseen test data: mostly ok, sometimes selection of dev set has big impact (length penalty!) To different metrics: reasonably stable (metrics are reasonably correlated in most cases) Still open questions => more research needed Stephan Vogel - Machine Translation 27