Tuning SMT June 3 2014 Overview • Brief recap about SMT • A new approach • Tuning based on MERT • Tuning based on PRO • Tuning based on MIRA SMT and genera1ve/noisy channel model • We want to translate from f (source) to e (target) • Noisy channel model • p (f | e) is the translaKon model • p (e) is the language model • Search for best target e = argmaxe p (f | e) * p(e) • Assume independences and • Use MLE to define the parameters for the models • Enhanced version where p (f | e) is decomposed into: • Phrase based model • DistorKon model SMT and discrimina1ve model Example of features Log linear model an example • We have idenKfied a set of ”basic features” for each (f, e) • Language model p(e) is the feature h1 for (e,f) • Phrase model p(pm) is the feature h2 for (e,f) • DistorKon model p(d) is the feature h3 for (e,f) • We assume that we have defined the weights • P(ei | f) gets a score and a probability • sum(h1i*w1+h2i*w2+h3i*w3) • p (ei| f) = (1/Z )* exp(sum(h1i*w1+h2i*w2+h3i*w3)) • e* = argmax_e (p(e|f)) Log-­‐linear model, ques1ons? • What is the meaning of the scores? • How do we define good features? • Feature selecKon or feature engineering process • How do we define good weights (or parameters)? • parameter tuning/supervised machine learning Parameter Tuning, an overview • OpKmize the weights in the log-­‐linear model • Assume that we have defined features • Examples of features are phrase transla6on model, language model, reordering model, backward phrase transla6on probability, etc. • Metrics to evaluate translaKon quality automaKcally • Tuning set • Online or batch • Online opKmizes the weights aaer processing each sentence • Batch opKmizes the weights aaer processing the whole data • Algorithms/methods to perform tuning • Minimum Error Rate Training (MERT) • Pairwise Ranking OpKmizaKon (PRO) • Margin Infused Relaxed Algorithm (MIRA) Automa1c evalua1on criteria of transla1on quality Tuning Set • Limited amount of sentences (1000-­‐2000) • Each sentence has been translated into corresponding n-­‐best list Minimum Error Rate Training (MERT) MERT overview in Moses models Tuning set Decoder n-­‐best Scorer/OpKmizer inner loop ini6al weights weights Outer loop opKmal weights MERT summary • RelaKvely simple and very established • Moses /MERT support different automaKc evaluaKon metrics • • • • BLEU TranslaKon Edit Rate (TER) PosiKon Independent Error Rate (PER) Cover Disjoint Error Rate (CDER) • MERT only opKmizes based on the best hypothesis • Does not scale well (search all direcKons) • Can support up 15-­‐30 features (Moses) • Batch opKmizaKon Pairwise Ranking Op1miza1on (PRO) • Purpose of PRO is to support scalability • OpKmizaKon by ranking pair of translaKons • Feature space based on difference of features for each translaKon pair • Linear binary classficaKon • Define a gold scoring funcKon to rank translaKons • BLEU+1 used to score each sentence • Only impact the opKmizaKon funcKon ”inner loop” • Batch opKmizaKon PRO defini1on (1) • Define the relaKon: • g(e1) > g(e2) ó h(e1) > h(e2) • g(e) is the gold evaluaKon metric • h(e) is the model score based wT.x(e,f) • where w is the weight vector • x(e,f) is the feature vector • h(e1) – h(e2) > 0 ó wT.x(e1,f) – wT.x(e2,f) > 0 • wT.(x(e1,f) – x(e2,f)) > 0 PRO defini1on (2) • Define binary classifier using a new feature space • wT.x_d > 0 is a binary classifier • x(e_i,f) – x(e_k,f) > 0 corresponds to ”+” • x(e_l,f) – x(e_t,f) < 0 corresponds to ”-­‐” • A sampling funcKon reduces the number of difference vectors otherwise too much data Summary of PRO process 1. 2. 3. 4. 5. 6. 7. Generate the n-­‐translaKon hypotheses for each sentence Calculate the gold scores for each translaKon Sample the score differences to define training data Feed the training data to a linear classifier The weights generated by the classfier are the opKmal weights Go back to the decoder with the new weights Go to 1 Margin Infused Relaxed Algorithm (MIRA) MIRA Summary • Tuning translaKon system with very large number of features • Basic MIRA is online • There is a version of MIRA that is batch based