ppt

Machine Translation Minimum Error Rate Training Stephan Vogel Spring Semester 2011 Stephan Vogel - Machine Translation 1 Overview  Optimization approaches  Simplex  MER  Avoiding local minima  Additional considerations  Tuning towards different metrics  Tuning on different development sets Stephan Vogel - Machine Translation 2 Tuning the SMT System  We use different models in SMT system  Models have simplifications  Trained on different amounts of data  => Models have different levels of reliability and scores have different ranges  => Give different weight to different Models Q = c1 Q1 + c2 Q2 + … + cn Qn  Find optimal scaling factors (feature weights) c1 … cn  Optimal means: Highest score for chosen evaluation metric M ie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high  Metric M is our objective function Stephan Vogel - Machine Translation 3 Problems  The surface of the objective function is not nice  Not convex -> local minima (actually, many local minima)  Not differentiable -> gradient descent methods not readily applicable  There may be dangerous areas (‘boundary cliffs’)  Tune on Dev set with short reference translations  Optimization leads towards short translations  New test set has long reference translations  Translations are now too short ->length penalty Stephan Vogel - Machine Translation Big effect  Example: Small change 4 Brute Force Approach – Manual Tuning  Decode with different scaling factors  Get feeling for range of good values  Get feeling for importance of models  LM is typically most important  Sentence length (word count feature) to balance shortening effect of LM  Word reordering is more or less effective depending on language  Narrow down range in which scaling factors are tested  Essentially multi-linear optimization  Works good for small number of models  Time consuming (CPU wise) if decoding takes long time Stephan Vogel - Machine Translation 5 Automatic Tuning  Many algorithms to find (near) optimal solutions available      Simplex Powell (line search) MIRA (Margin Infused Relaxed Algorithm) Specially designed minimum error training (Och 2003) Genetic algorithm  Note: models are not improved, only their combination  Note: some parameters change performance of decoder, but are not in Q  Number of alternative translation  Beam size  Word reordering restrictions Stephan Vogel - Machine Translation 6 Automatic Tuning on N-best List  Optimization algorithm need many iterations – too expensive to run full translations  => Use n-best lists  e.g. for each of 500 source sentences 1000 translations  Change scaling factors results in re-ranking the n-best lists  Evaluate new 1-best translations  Apply any of the standard optimization techniques  Advantage: much faster  Can pre-calculate the counts (e.g. n-gram matches) for each translation to speed up evaluation Stephan Vogel - Machine Translation 7 Simplex (Nelder-Mead)  Start with n+1 random configurations  Get 1-best translation for each configuration -> objective function  Sort points xk according to objective function: f(x1) < f(x2) < … < f(xn+1)  Calculate x0 as center of gravity for x1 … xn  Replace worst point with a point reflected through the centroid xr = x0 + r * (x0 – xn+1) Stephan Vogel - Machine Translation 8 Demo 8 6 9 7 12 9 11  Obviously, we need to change the size of the simplex to enforce convergence  Also, want to adjust the step size  If new point is best point – increase step size  If new point is worse then x1 … xn – decrease step size Stephan Vogel - Machine Translation 9 Expansion and Contraction  Reflection: Calculate xr = x0 + r * (x0 – xn+1) if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration  Expansion: If reflected point is better then best, i.e. f(xr) < f(x1) Calculate xe = x0 + e * (x0 – xn+1) If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr Next iteration else Contract  Contraction: Reflected point f(xr) >= f(xn) Calculate xc = xn+1 + c * (x0 – xn+1) If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink  Shrinking: For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1) Next iteration Stephan Vogel - Machine Translation 10 Changing the Simplex x0 xn+1 x0 reflection expansion xn+1 x0 xn+1 xn+1 contraction x1 shrinking Stephan Vogel - Machine Translation 11 Powell Line Search  Select directions in search space, then Loop until convergence Loop over directions d Perform line search for direction d until convergence  Many variants  Select directions  Easiest is to use the model scores  Or combine multiple scores  Step size in line search  MER (Och 2003) is line search along models with smart selection of steps Stephan Vogel - Machine Translation 12 Minimum Error Training  For each hypothesis we have Q = S ck*Qk  Select one Q\k = ck Qk + Sn\k cn*Qn = ck Qk + QRest Total Model Score Metric Score WER = 8 Qk 1 Individual model score gives slope QRest ck Stephan Vogel - Machine Translation 13 Minimum Error Training  Source sentence 1  Depending on scaling factor ck, different hyps are in 1-best position  Set ck to have metric-best hyp also being model-best Model Score h11: WER = 8 h12 : WER = 5 h13 : WER = 4 best hyp: h11 h12 h13 8 5 4 Stephan Vogel - Machine Translation ck 14 Minimum Error Training  Select minimum number of evaluation points  Calculate intersection point  Keep only if hyps are minimum at that point  Choose evaluation points between intersection points Model Score h11: WER = 8 h12 : WER = 5 h13 : WER = 4 best hyp: h11 h12 h13 8 5 4 Stephan Vogel - Machine Translation ck 15 Minimum Error Training  Source sentence 1, now different error scores  Optimization would find a different ck  => Different metrics lead to different scaling factors Model Score h11: WER = 8 h12 : WER = 2 h13 : WER = 4 best hyp: h11 h12 h13 8 2 4 Stephan Vogel - Machine Translation ck 16 Minimum Error Training  Sentence 2  Best ck in a different range  No matter which ck, h22 would newer be 1-best h21: WER = 2 Model Score h22 : WER = 0 h23 : WER = 5 best hyp: h21 2 h23 ck 5 Stephan Vogel - Machine Translation 17 Minimum Error Training  Multiple sentences h21: WER = 2 h22 : WER = 0 Model Score h23 : WER = 5 h11: WER = 8 h12 : WER = 5 h13 : WER = 4 h22 h21 best hyp: h12 h11 10 7 h13 10 ck 9 Stephan Vogel - Machine Translation 18 Iterate Decoding - Optimization  N-best list is (very restricted) substitute for search space  With updated feature weights we may have generated other (better) translations  Some of the hyps in the n-best list would have been pruned  Iterate     Re-translate with new feature weights Merge new translations with old translations (increases stability) Run optimizer over larger n-best lists Repeat until no new translations, or improvement < epsilon, or just k times (typically 5-10 iterations) Stephan Vogel - Machine Translation 19 Avoiding Local Minima  Optimization can get stuck in local minimum  Remedies     Fiddle around with the parameters of your optimization algorithm Larger n-best list -> more evaluation points Combine with Simulated Annealing type approach (Smith & Eisner, 2007) Restart multiple times Stephan Vogel - Machine Translation 20 Random Restarts  Comparison Simplex/Powell (Alok, unpublished)  Comparison Simplex/ext. Simplex/MER (Bing Zhao, unpubl.)  Observations:  Alok: Simplex ‘jumpier’ then Powell  Bing: Simplex better than MER  Both: you need many restarts Stephan Vogel - Machine Translation 21 Optimizing NOT Towards References  Ideally, we want system output identical to reference translations  But there is not guarantee that system can generate reference translations (under realistic conditions)  E.g. we restrict reordering window  We have unknown words  Reference translations may have words unknow to the system  Instead of forcing decoder towards reference translations optimize towards best translations generated by the system  Find hypotheses with best metric score  Use those as pseudo references  Optimize towards the pseudo references Stephan Vogel - Machine Translation 22 Optimizing Towards Different Metrics  Automatic metrics have different characteristics  Optimizing towards one does not mean that other metric scores will also go up  Esp. Different metrics prefer shorter or longer translations Typically: TER < BLEU < METEOR (< means ‘shorter translation’)  Mauser et al (2007) on Ch-En NIST 2005 test set  Reasonably well behaved  Resulting length of translation differs by more than 15% Stephan Vogel - Machine Translation 23 Generalization to other Test Sets  Optimize on one set, test on multiple other sets  Again Mauser et al, Ch-En  Shown is behavior over Simplex optimization iterations  Nice, nearly parallel development of metric scores  However, we had also observed brittle behavior  Esp. when ratio src_length / ref_length is very different between dev and eval test sets Stephan Vogel - Machine Translation 24 Large Weight = Important Feature?  Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2  Which feature is most important?  Cannot say!!!  We want to re-rank the n-best lists  Feature weights scale feature values such that they can compete  Example:  Variation in LM and TM larger then for WC  Need large weight for WC to make small differences effective QLM QTM QWC Q H1 22 83 7 112 H2 29 77 8 116 H3 26 85 9 120  To know if feature is important, remove it and look at drop in metric score Stephan Vogel - Machine Translation 25 Open Issues  Should not all optimizers get the same results, if done right  The models are the same, it’s just finding the right mix  If local minima can be avoided, then similar good optima should be found  How to stay save  Avoid good optima close to ‘cliffs’  Different configurations give very similar metric scores, pick one which is more stable  One hat fits all?  Why one set of feature weights?  How about different sets for     Good/bad translations (tuning on tail: mixed results so far) Short/long sentences Begin/middle/end of sentence ... Stephan Vogel - Machine Translation 26 Summary  Optimizing system by modifying scaling factors (feature weights)  Different optimization approaches can be used  Simplex, Powell most common  MERT (Och) is similar to Powell, with pre-calculation of grid points  Many local optima, avoid getting stuck early  Most effective: many restarts  Generalization  To unseen test data: mostly ok, sometimes selection of dev set has big impact (length penalty!)  To different metrics: reasonably stable (metrics are reasonably correlated in most cases)  Still open questions => more research needed  Stephan Vogel - Machine Translation 27

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib