The State of the Art in Phrase-Based Statistical Machine Translation (SMT) Roland Kuhn, George Foster, Nicola Ueffing February 2007 Tutorial Plan A. Overview B. Details & research topics NOTE: best overall reference for SMT hasn’t been published yet – Philipp Koehn’s « Statistical Machine Translation » (to be published by Cambridge University Press). Some of the material presented here is from a draft of that book. Tutorial Plan A.Overview The MT Task & Approaches to it Examples of SMT output SMT Research: Culture, Evaluations, & Metrics SMT History: IBM Models Phrase-based SMT Phrase-Based Search Loglinear Model Combination Target Language Model P(T) Flaws of Phrase-based, Loglinear Systems PORTAGE: a Typical SMT System The MT Task & Approaches to it • Core MT task: translate a sentence from a source language S to target language T • Conventional expert system approach: hire experts to write rules for translating S to T • Statistical approach: using a bilingual text corpus (lots of S sentences & their translations into T), train a statistical translation model that will map each new S sentence into a T sentence The MT Task & Approaches to it Statistical System Experts Expert System Bilingual parallel corpus S T + + Manually coded rules If « … » then … If « … » then … …… …… Else …. Machine Learning S: Mais où sont les neiges d’antan? Expert system output T: But where are the snows of yesteryear? Statistical system output T1: But where are the snows of yesteryear? P = 0.41 T2: However, where are yesterday’s snows? P = 0.33 T3: Hey - where did the old snow go? P = 0.18 … Statistical rules P(but | mais)=0.7 P(however | mais)=0.3 P(where | où)=1.0 …… The MT Task & Approaches to it “Expert” vs. “Statistical” systems • Expert systems incorporate deep linguistic knowledge • They still yield top performance for well-studied language pairs in non-specialized domains • Computationally cheap (compared to statistical MT) BUT • Brittle • Expensive to maintain (messy software engineering) • Expensive to port to new semantic domains or new language pairs • Typically yield only one T sentence for each S sentence The MT Task & Approaches to it “Expert” vs. “Statistical” systems • More E-text, better algorithms, stronger machines quality of SMT output approaching that of expert systems • Statistical approach has beaten expert systems in related areas e.g., automatic speech recognition • SMT is robust (does well on frequent phenomena) • Easy to maintain • Easily ported to new semantic domain or new language pairs – IF training corpora available • For each S sentence, yields many T sentences (each with a probabilistic score) – useful for semi-supervised translation The MT Task & Approaches to it Structure of Typical SMT System Bilingual parallel corpus S: Mais où sont les neiges d’antan? S Extra Target Corpora T offline training Preprocessor Phrase Translation Model (optional extra LM training corpora) Target Language Model mais où sont les neiges d’ antan ? Initial N-best hypotheses T1: however where are the snows #d’ antan# P = 0.22 T2: but where are the snows #d’ antan# P = 0.21 T3: but where did the #d’ antan# snow go P = 0.13 … Decoder Other Knowledge Final N-best hypotheses T1: But where are the snows Sources of yesteryear? P=0.41 T2: However, where are yesterday’s snows? P = 0.33 Reordering Postprocessor … The MT Task & Approaches to it Commercial Systems • Systran, biggest MT company, uses expert systems; so do most MT companies. However, Systran has recently begun exploring possibility of adding a statistical component to their system. • Important exception: LanguageWeaver, new company based on SMT (closely linked to researchers at ISI, U. Southern California) • Google has superb SMT research team – but online, they still mainly use Systran (probably because of computational cost of online SMT). Seem to be gradually swapping in SMT systems for language pairs with lower traffic. Examples of SMT output Chinese → English output: REF: Hong Kong citizens jumped for joy when they knew Beijing's bid for 2008 Olympic games was successful. PORTAGE Dec. 2004: The public see that Beijing's hosting of the Olympic Games in 2008 excited. PORTAGE Nov. 2006: Hong Kong people see Beijing's successful bid for the 2008 Olympic Games, very happy. REF: The U.S. delegation includes a China expert from Stanford University, two Senate foreign policy aides and a former State Department official who has negotiated with North Korea. PORTAGE Dec. 2004: The United States delegation comprising members from the Stanford University, one of the Chinese experts, two of the Senate foreign policy as well as assistant who was responsible for dealing with Pyongyang authorities of the former State Department officials. PORTAGE Nov. 2006: The US delegation included members from Stanford University and an expert on China, two Senate foreign policy, and one who is responsible for dealing with Pyongyang authorities, a former State Department officials. REF: Kuwait foreign minister Mohammad Al Sabah and visiting Jordan foreign minister Muasher jointly presided the first meeting of the joint higher committee of the two countries on that day. PORTAGE Dec. 2004: Kuwaiti Foreign Secretary Sabah on that day and visiting Jordan Foreign Secretary maasher cochaired the section about the two countries mixed Committee at the inaugural meeting. PORTAGE Nov. 2006: Kuwaiti Foreign Minister Sabah day and visiting Jordanian Foreign Minister of Malaysia, cochaired by the two countries, the joint commission met for the first time. REF: The Beagle 2 was scheduled to land on Mars on Christmas Day, but its signal is still difficult to pin down. PORTAGE Dec. 2004: small dog meat, originally scheduled for Christmas landing Mars, but it is a signal remains elusive. PORTAGE Nov. 2006: 2 small dog meat for Christmas landing on Mars, but it signals is still unpredictable. Examples of SMT output And a silly English → German example from Google (Jan. 25, 2007): the hotel has a squash court das Hotel hat ein Kürbisgericht (think “zucchini tribunal”) * but this kind of error – perfect syntax, never-seen word combination – isn’t typical of a statistical system, so this was probably a rule-based system SMT Research: Culture, Evaluations, & Metrics Culture • SMT research is very engineering-oriented; driven by performance in NIST & other evaluations (see later slides) if a heuristic yields a big improvement in BLEU scores & a wonderful new theoretical approach doesn’t, expect the former to get much more attention than the latter • Advantages of SMT culture: open-minded to new ideas that can be tested quickly; researchers who count have working systems with reasonably well-written software (so they can participate in evaluations) • Disadvantages of SMT culture: closed-minded to ideas not tested in a working system if you have a brilliant theory that doesn’t show a BLEU score improvement in a reasonable baseline system, don’t expect SMT researchers to read your paper! SMT Research: Culture, Evaluations, & Metrics The NIST MT Evaluations • Since 2001, US National Institute of Standards & Technology (NIST) has been evaluating MT systems • Participants include MIT , IBM , CMU , RWTH , Hong Kong UST , ATR , IRST , others … - and NRC: NRC’s system is called PORTAGE (in NIST evaluation 2005 & 2006). • Main NIST language pairs: ChineseEnglish, ArabicEnglish • Semantic domains: news stories & multigenre • Training corpora released each fall, test corpus each spring; participants have 1 working week to submit target sentences • NIST evaluates systems comparatively In 2005 http://www.nist.gov/speech/tests/mt/mt05eval_official_results_release_20050801_v3.html & 2006 http://www.nist.gov/speech/tests/mt/mt06eval_official_results.html statistical systems beat expert systems according to BLEU metric SMT Research: Culture, Evaluations, & Metrics Other MT Evaluations • WPT/WMT usually organized each spring by Philipp Koehn & Christoph Monz – smaller training corpora than NIST, European language pairs. In 2006, evaluated on French <-> English, German <> English, Spanish <->English. http://www.statmt.org/wmt06/proceedings/ • TC-STAR Evaluation for spoken language translation. In 2006, evaluated on Chinese->English (one direction only) and Spanish <->English http://www.elda.org/tcstar-workshop/2006eval.htm • IWSLT Evaluation for spoken language translation. In 2006, evaluated on Arabic->English, Chinese->English, Italian->English, Japanese->English http://www.slt.atr.jp/IWSLT2006_whatsnew/index.html SMT Research: Culture, Evaluations, & Metrics GALE Project • Huge DARPA-sponsored project: $50 million per year for 5 years. Three consortia: BBN-led « Agile », IBM-led « Rosetta », SRI-led « Nightingale ». • NRC team is in MT working group of Nightingale. (Arabic or Chinese) speech Automatic speech recognition (ASR) (Arabic or Chinese) documents (Arabic or Chinese) transcriptions Machine translation (MT) English text Distillation IR/database component SMT Research: Culture, Evaluations, & Metrics What is BLEU? • Human evaluation of automatic translation quality hard & expensive. BLEU metric (invented at IBM) compares MT output with human-generated reference translations via N-gram matches. • N-gram precision = # (N-grams in MT output seen in ref.) # (N-grams in MT output) • Example (from P. Koehn): REF = Israeli officials are responsible for airport security 1-gram Sys A = Israeli officials responsibility of airport safety 2-gram match matches Sys B = airport security Israeli officials are responsible 4-gram match SMT Research: Culture, Evaluations, & Metrics What is BLEU? • REF = Israeli officials are responsible for airport security Sys A = Israeli officials responsibility of airport safety Sys B = airport security Israeli officials are responsible • Sys A: 1-gram precision = 3/6 (Israeli, officials, airport); 2-gram precision = 2/5 (Israeli officials); 3-gram precision = 0/4 = 4-gram precision = 0/3. Sys B: 1-gram precision = 6/6; 2-gram precision = 4/5; 3-gram precision = 2/4; 4-gram precision = 1/3. • BLEU-N multiplies together the N N-gram precisions – the higher the value, the better the translation. But, could cheat by having very few words in MT output – so, brevity penalty. SMT Research: Culture, Evaluations, & Metrics What is BLEU? BLEU-N = (brevity-penalty)*Πi=1N(precisioni)i, where brevity-penalty = min(1,output-length/ref-length) . Usually, we set N=4 and all i = 1, so we have BLEU-4 = (min(1,output-length/ref-length))*Πi=14precisioni. • If any MT output has no N-grams matching ref., for some N=1, …, 4, BLEU-4 is zero. So, normally compute BLEU over whole test set of at least a hundred or so sentences. • Multiple references: if an N-gram has K occurrences in output, look for single ref. that has K or more copies of that N-gram. If find such a single ref., that N-gram has matched K times. If not, look for a ref. that has the highest # of copies (L) of that N-gram; use L in precision calculation. Ref-length = closest length. SMT Research: Culture, Evaluations, & Metrics Quality score: 0 = terrible, 3 = excellent Does BLEU correlate with human judgment? Translator Identity * BLEU kind of correlates with human judgment ; works best with multiple references. SMT Research: Culture, Evaluations, & Metrics Why BLEU Is Controversial • If system produces a brilliant translation that uses many Ngrams not found in the references, it will receive a low score. • Proponents of the expert system approach argue that BLEU is biased against this approach, & favours SMT • Partial confirmation: 1. in NIST 2006 Arabic-to-English evaluation, AppTek hybrid system (rule-based + SMT system) did best according to human evaluators, but not according to BLEU. 2. in 2006 WMT evaluation Systran was scored comparably to other systems for some European language pairs (e.g., French-English) by human evaluators, but had much lower indomain BLEU scores (see graphs in http://www.statmt.org/wmt06/proceedings/pdf/WMT14.pdf). SMT Research: Culture, Evaluations, & Metrics Other Automatic Metrics • SMT systems need an automatic metric for tuning (must try out thousands of variants). Automatic metrics compare MT output with human-generated reference translations. • Rivals of BLEU: * translation edit rate (TER) – how many edit ops to match references? http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf * METEOR – compares MT output with references in way that’s less dependent on word choice (via stemming, WordNet, etc.) Gaining credibility: correlates better than BLEU with human scores. However, METEOR only defined for translation into English. http://www.cs.cmu.edu/~alavie/METEOR/. SMT Research: Culture, Evaluations, & Metrics Manual Metrics • Human evaluation of SMT preferable to automatic evaluation, but much slower & more expensive. Can’t use for system tuning. • Ask humans to rank systems by adequacy and fluency. Adequacy: does MT output convey same meaning as source? Fluency: does MT output look like normal target-language text? (Good syntax & idiom). • Metrics based on human postediting of MT output. E.g., HTER. • Metrics based on human understanding of MT output. Related to adequacy, but less subjective. E.g., Lincoln Labs metric: give English output of Arabic MT system to unilingual English analyst, then test him with standard « Defense Language Proficiency Test » (see Jones05). SMT Research: Culture, Evaluations, & Metrics Who Uses Which Metric When? • Many groups use BLEU for automatic system tuning • NIST, WPT/WMT, TC-STAR, & other evaluations often have BLEU as official metric, with some human reality checks. Koehn & Monz WPT/WMT: participants do human fluency/adequacy evaluations - nice analyses! • Many « expert/rule-based MT » researchers hate BLEU (can become excuse not to evaluate system competitively) • In theory, manual metrics should be related to MT task: e.g., adequacy for browsing/gisting, Lincoln Labs metric for intelligence community, HTER if MT output will be post-edited. So why is HTER GALE’s official metric? HTER = Human Translation Edit Rate: MT output hand-edited by humans; measure # of operations performed. SMT History: IBM Models • In the late 1980s, members of IBM’s speech recognition group applied statistical learning techniques to bilingual corpora. These American researchers worked mainly with the Canadian Hansard – bilingual transcription of parliamentary proceedings. • These researchers quit IBM around 1991 for a hedge fund, Renaissance Technologies – they are now very rich! • Renewed interest in their work sparked the revival of research into statistical learning for MT that occurred from late 1990s onward. Newer « phrase-based » approach still partially relies on IBM models. • The IBM approach used Bayes’s Theorem to define the « Fundamental Equation » of MT (Brown et al. 1993) SMT History: IBM Models Fundamental Equation of MT The best-fit translation of a source-language (French) sentence S into a target-language (English) sentence T is: ^ = argmax [P(T)*P(S|T)] T T search task language model word translation model Job of language model: ensure well-formed target-language T Job of translation model: ensure T could have generated S Search task: find T maximizing product P(T)*P(S|T) SMT History: IBM Models • The IBM researchers defined five statistical translation models (numbered in order of complexity) • Each defines a mechanism for generation of text in one language (e.g., French or foreign = F) from another (e.g., English = E) • Most general many-to-many case is not covered by IBM models; in this forbidden case, a group of E words generates a group of F words, e.g. : The poor don’t have any money Les pauvres sont démunis SMT History: IBM Models • The IBM models only allow one-to-many generation, e.g.: And Ø the Le program has been implemented programme a été mis en application • IBM models 1 & 2 – all lengths for F sentence equally likely • Model 1 is « bag of words » - word order in F & E doesn’t matter • In model 2, chance that an E word generates given F word(s) depends on position • IBM models 3, 4, & 5 are fertility-based SMT History: IBM Models IBM model 1: « bag of words » f1 e1 (draw with uniform probability) e2 …. eL f2 …. fM P(L→M) IBM model 2: « position-dependent bag of words » e1 e2 …. eL P(1 →1) P(1→M) f1 P(2 →1) P(2 →M) P(L→1) P(L→M) …. (draw with positiondep. prob) …. f2 …. fM SMT History: IBM Models Parameters: φ(ei) = fertility of ei = prob. will produce 0, 1, 2 … words in F; t(f|ei) = probability that ei can generate f; Π(j | i, k) = distortion prob. = prob. that kth word generated by ei ends up in pos. j of F IBM model 3 φ(e1) φ(e2) Distortion model Π P(1→1), P(1→2), …, P(M→M) 2 t e1 t 0 e2 φ(eL) …. Ø NOTE: phrases can be broken up, but with lower prob. than in model 3 fM …. 1 t e1 f2 φ(e2) f2 …. 0 e2 …. (phrase) Ø φ(eL) 1 eL t fM f1 Π φ(e1) 3 f1 Distortion model IBM model 4 eL t f1 fM f2 f1 f3 …. fM f2 f3 IBM model 5: cleaned-up version of model 4 (e.g., two F words can’t be given same position) Phrase-based SMT Four key ideas • phrase-based models (Och04, Koehn03, Marcu02) • dynamic programming search algorithms (Koehn04) • loglinear model combination (Och02) • error-driven learning (Och03) Phrase-based SMT Example: « cul de sac » Phrase-based approach introduced around 1998 by Franz Josef Och & others (Ney, Wong, Marcu): many-words-to-many-words (improvement on IBM one-to-many) word-based translation = « ass of bag » (N. Am), « arse of bag » (British) phrase-based translation = « dead end » (N. Am.), « blind alley » (British) This knowledge is stored in a phrase table : collection of conditional probabilities of form P(S|T) = backward phrase table or P(T|S) = forward phrase table. Recall Bayes: ^ T = argmaxT [P(T)*P(S|T)] backward table essential, forward table used for heuristics. Tables for French->English: backward: P(S|T) p(sac|bag) = 0.9 p(sacoche|bag) = 0.1 … p(cul de sac|dead end) = 0.7 p(impasse|dead end) = 0.3 … forward: P(T|S) p(bag|sac) = 0.5 p(hand bag|sac) = 0.2 … p(cul|ass) = 0.5 p(dead end|cul de sac) = 0.85 … Phrase-based SMT Overall Phrase Pair Extraction Algorithm 1. Run a sentence aligner on a parallel bilingual corpus (won’t go over this) 2. Run word aligner (e.g., one based on IBM models) on each aligned sentence pair – see next slide. 3. From each aligned sentence pair, extract all phrase pairs with no external links - see two slides ahead. Phrase-based SMT Symmetrized Word Alignment using IBM Models Alignments produced by IBM models are asymmetrical: source words have at most one connection, but target words may have many connections. To improve quality, use symmetrization heuristic (Och00): 1. Perform two separate alignments, one in each different translation direction. 2. Take intersection of links as starting point. 3. Add neighbouring links from union until all words are covered. S: I want to go home T: Je veux aller chez moi S: Je veux aller chez moi T: I want to go home I want to go home Je veux aller chez moi Phrase-based SMT « Diag-And » phrase extraction Input: aligned sentence pair Output: set of consistent phrases Je l’ ai vu à la télévision I saw him on television Extract all phrase pairs with no external links, for example: Good pairs: (Je, I) (Je l’ ai vu, I saw him) (ai vu, saw) (l’ ai vu à la, saw him on) Bad pairs: (Je l’ ai vu, I saw) (l’ ai vu à, saw him on) (la télévision, television) Phrase-Based Search Generative process: 1. Split source sentence into “phrases” (N-grams). 2. Translate each source phrase (one-to-one). 3. Permute target phrases to get final translation. much simpler and more intuitive than the IBM process, but the price of this is no provision for gaps, e.g., ne VERB pas 1 2 Je l’ ai vu à la télévision Je l’ ai vu à la télévision 3 I him saw on television *** NOTE: XRCE’s Matrax does handle gaps I saw him on television Phrase-Based Search Order: Target hypotheses grow left->right, from source segments consumed in any order Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 (pick s2 s3 first) Backward Table Segmentation P(S|T) p(s2 s3 | t8) p(s2 s3 | t5 t3) … p(s3 s4 | t4 t9) … (pick s3 s4 first) Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 (phrase transl) Tgt hyp: t8| … Tgt hyp: t5 t3| … Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 (pick s5 s6 s7) … (phrase transl) Tgt hyp: t4 t9| … … Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 (phrase transl) Tgt hyp: t8| t6 t2| … … Language Model P(T) language model: scores growing target hypotheses left -> right phrase table: 1. suggests possible segments 2. supplies phrase translation scores Loglinear Model Combination Previous slides show basic system that ranks hypotheses by P(S|T)*P(T). Now let’s introduce an alignment/reordering variable A (aligns T & S phrases). We want ^ T = argmaxT P(T|S) ≈ argmaxT ,AP(T, A|S) = argmaxT, A f1(T,A,S)λ1* f2(T,A,S)λ2 * … * fM(T,A,S)λM = argmax exp (∑i λi log fi(T,A,S)). The fi now typically include not only functions related to P(S|T) and language model P(T), but also to A « distortion », P(T|S), length(T), etc. The λi serve as reliability weights. This change in score computation doesn’t fundamentally change the search algorithm. Loglinear Model Combination Advantages Very flexible! Anyone can devise dozens of features. • E.g., if lots of mismatched brackets in output, include feature function that outputs +1 if no mismatched brackets, -1 if have mismatched brackets. • So lots of new features being tried in somewhat haphazard way. • But systems steadily improving – outputs from NIST 2006 look much better than those from NIST 2002. SMT not good enough to replace human translators, but good enough for, e.g., most Web browsing. Using 1000 machines and massive quantities of data, Google got 45.4 BLEU for Arabic to English, 35.0 for Chinese to English – very high scores! Loglinear Model Combination Typical Loglinear Components for SMT Decoding • Joint counts C(S,T) from phrase extraction yield estimates P(S|T) stored in “backward” phrase table and estimates P(T|S) stored in “forward” phrase table. These are typically relative frequency estimates (but we’ve looked at smoothed variants). • Distortion model D(T,A,S) assigns score to amount of phrase reordering incurred in going from S to hypothesis T. Can be based purely on displacement, or be lexicalized (identity of words in S & T is important). • Length model L(T,S) scores probability that hypothesis of length |T| generated from source of length |S|. • Language model P(T) gives probability of word sequence T in target language – see next few slides. NOTE: these are just for decoding – you can use lots more components for N-best/lattice reordering! Target Language Model P(T) The Stupidest Thing Noam Chomsky Ever Said « It must be recognized that the notion of a ‘probability of a sentence’ is an entirely useless one, under any interpretation of this term ». Chomsky, 1969. Target Language Model P(T) • Language model helps generate fluent output by 1. assigning higher probability to correct word order – e.g., PLM(the house is small) >> PLM(small the is house) 2. assigning higher probability to correct word choices – e.g., PLM(i am going home) >> PLM(I am going house) • Almost everyone in both SMT and ASR (automatic speech recognition) communities uses N-gram language models. Start with P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|w1,…,wi-1)*…*P(wm|w1,…,wm-1), then limit window to N words. E.g., for N=3, trigram LM: P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|wi-2,wi-1)*…*P(wm|wm-2,wm-1). Target Language Model P(T) • Estimation is done by relative frequency on large corpus : P(wi|wi-2,wi-1) ≈ f(wi|wi-2,wi-1) = C(wi-2,wi-1,wi)/Σw C(wi-2,wi-1,w). E.g., in Europarl corpus, see 225 trigrams starting « the red … »: C(the red cross)=123, C(the red tape)=31, C(the red army)=9, C(the red card)=7, C(the red ,)=5 (and 50 other trigrams). So estimate P(cross | the red) = 123/225 = 0.547 . • But need to reserve probability mass for unseen events - maybe never saw « the red planet » in Europarl, but don’t want to have estimate P(planet | the red) = 0. Also, want estimates whose variance isn’t too high. Smoothing techniques are used to solve both problems. E.g., could linearly smooth trigrams with bigrams & unigrams: P(wi|wi-2,wi-1) ≈ *f (wi|wi-2,wi-1) + μ*f(wi|wi-1) + (1--μ)*f(wi); 0 < , μ < 1. Target Language Model P(T) Measuring Language Model Quality • Perplexity: metric that measures predictive power of an LM on new data as an average branching factor. E.g., model that says «any digit 0, …, 9 has equal probability of occurrence » will yield perplexity of 10.0 on digit sequence generated randomly from these 10 digits. • Perplexity of LM measured on corpus W = (w1 … wN) is Perp (T) = (Π P(w |LM))-1/N = 1/(average per word prob.) LM wi i The better the LM is as a model for W, the less « surprised » it is by words of W higher estimated prob. lower entropy. Typical perplexities for well-trained English trigram LMs with lexica of about 25K words for various dictation domains: Perp(radiology)=20, Perp(emergency medicine)=60, Perp(journalism)=105, Perp(general English)=247 . Target Language Model P(T) • « A Bit of Progress in Language Modeling » (Goodman01) is good summary of state of the art in N-gram language modeling. • Consistently superior method: Kneser-Ney. Intuition: if «Francisco» & «eggplant» each seen 103 times in our corpus of 106 words, and neither «eggplant Francisco» nor «eggplant stew» seen, which should be higher, P(Francisco|eggplant) or P(stew|eggplant)? Interpolation answer: P(wi|wi-1) ≈ *f(wi|wi-1) + (1-)*f(wi ). So P(Francisco|eggplant) ≈ *0 + (1- )*10-3 = P(stew|eggplant). Kneser-Ney answer: no, «Francisco» only occurs after «San», but 1,000 occurrences of « stew » preceded by 100 different words. So when (wi-1 wi) has never been seen before, wi = «stew» more probable than wi = «Francisco» P(stew|eggplant) >> P(Francisco|eggplant). Target Language Model P(T) • Kneser-Ney formula (for bigrams – easily extended to N-grams): PKN(wi | wi-1) = max [C(wi-1 wi)-D, 0]/C(wi-1) + (wi-1)*#{v | C(v wi) > 0}/w #{v | C(v w) > 0} , where D is a discount factor < 1, (wi-1) is a normalization constant, #{v | C(v wi) > 0} is the number of different words that precede wi in the training corpus, and w #{v | C(v w) > 0} is the number of different bigrams in the training corpus. Flaws of Phrase-based, Loglinear Systems • Loglinear feature function combination is too flexible! Makes it easy not to think about theoretical properties of models. • The IBM models were true models: given arbitrary source sentence S and target sentence T, could estimate non-zero P(T|S). Phrase-based “models” are not models: in general, for T which is a good translation of S, they give P(T|S) = 0. They don’t guarantee existence of an alignment between T and S. Thus, the only translations T’ to which a phrase-based system is guaranteed to assign P(T’|S) > 0 are T’ output by same system. • This has practical consequences: in general, a phrase-based MT system can’t be used for analyzing pre-existing translations. This rules out many useful forms of assistance to human translators - e.g., spotting potential errors in translations based on regions of low P(T|S). PORTAGE: A Typical SMT System 1. Sentence-align a big bilingual corpus 2. On each sentence pair, use IBM models to align words 3. Build phrase tables from word alignments via “diag-and” or similar heuristic (Koehn03). Backwards phrase table gives P(S|T) (& is implicit segmentation model). 4. Build language model (LM) for target language: estimates P(T) , based on n-grams in T 5. P(S|T) and P(T) are sufficient for decoding, but one often adds other loglinear feature functions such as a distortion penalty 6. Use (Och03) method to find good weights λi for loglinear features 7. Optionally, include reordering step: i.e., decoder outputs many hypotheses (via N-best list or lattice) which are rescored by larger set of feature functions PORTAGE: A Typical SMT System Core Engine « Small » set of information sources – for Canoe decoder (at least one language model) LM (number-of-words (at least 1 phrase (at least one distortion model) (any # of additional info. model) translation model) sources - for rescorer only) TM DM Weights for « small » set Canoe decoder A1 A2 A3 feature functions Source sentence mais où sont les neiges d’ antan ? NM Weighted « small » info wLM*LM wTM*TM … wNM*NM N-best hypotheses H1: hey , where did the old snow go ? P = 0.41 H2: yet where are yesterday’s snows ? P = 0.33 H3: but where are the snows of yesteryear ? P = 0.18 … « Large » set of information Weighted sources – for Rescorer « large »info Weights for « large » set Rescorer kLM*LM kTM*TM … kA3*A3 Rescored N-best H1: but where are the snows of yesteryear ? P = 0.53 H2: however , where are yesterday’s snows ? P = 0.20 … Training Core Components of PORTAGE Preprocessing Raw parallel corpus src-lang text tgt-lang text src preproc. tgt. preproc. Clean, aligned parallel corpus IBM training (models 1 & 2) phrase translation language model model phrase pair extraction PT model3 large set info rescorer wt optimizer … modelK extra models for large set modelK+1 large set wts w1’, …, wM’ Additional monolingual corpora Tgt-lang text … … modelM Tgt-lang text lang. model builder dev1 corpus LM other small set models dev2 corpus src tgt Src-lang text Tgt-lang text sentence aligner small set info only src tgt decoder wt optimizer small set wts w1, …, wK Canoe Optimization of Weights (COW) Purpose: find weights [w1, …, ws] on « small » set of information sources (N around 100) Dev corpus for COW (D sentences) Initial Weights D source-language sentences S1: hé quoi ? S2: charmante élise , vous devenez mélancolique . …. SD: la fin . [w1i , w2i ,…, wsi] (first call to Canoe) D target-language ref. translations T1: what’s this ? T2: charming élise , you’re becoming melancholy . …. TD: the end . (based on top hyp.) H1(S1), …, HN(S1), … (>N hyp. for S1) H1(S2), …, HN(S2), … (>N hyp. for S2) … H1(SD), …, HN(SD), … (>N hyp. for SD) union of old & new hypotheses (first call to rescore-train) … WK=[w1 w2 K,…, Rescore_train ws K] … IS (2nd & New Weights (from subsequent « rescore-train ») calls to Canoe) (union: 2nd & subsequent calls to rescore-train) K random wt. vectors W1=[w11 , w21,…, ws1] K, I2 I1 [w1r , w2r,…, wsr] Canoe decoder Expanded list BLEU scoring « Small » set of information sources Powell’s alg. … Powell’s alg. List of D N-best hyp. H1(S1): what’s up ? … HN (S1): are you OK ? H1(S2): cute élise , you’re bummed out . … … HN(SD): all done . Ŵ1 … }Ŵ ŴK Rescoring = Finding Weights on « Large » Info. Set for Rescorer (N around 1000) I1 « Large » set I2 … IS Dev corpus for « large » wts (D sent) IS+1 … IL D source-language sentences S1: hé quoi ? S2: charmante élise , vous devenez mélancolique . …. SD: la fin . « Small » set Weights for « small » fixed by previous COW step Initial Weights Weighted « small » info w1* I1 … wS*IS [w1i , w2i,…, wLi] feature functions D target-language ref. translations T1: what’s this ? T2: charming élise , you’re becoming melancholy . …. TD: the end . Weighted « large » info w1* I1 … wL*IL Canoe decoder BLEU scoring Final « large » wts (based on top hyp.) [w1f , w2f,…, wLf] List of D N-best hyp. H1(S1): what’s up ? … HN (S1): are you OK ? H1(S2): cute élise , you’re bummed out . … … HN(SD): all done . K random wt. vectors W1=[w11 , w21,…, wL1] … K, WK=[w1 w2 K,…, Rescore_train wLK] Powell’s alg. … Powell’s alg. Ŵ1 … }Ŵ ŴK Tutorial Plan B. Details & research topics Named entities Large-scale discriminative training (George Foster) Decoding for SMT (prepared by Nicola Ueffing) Hierarchical models (George Foster) System combination Named entity recognition & transliteration Chinese Example « Secretary-General Wong appeared with Larry Ellison, Chief Executive Officer of Oracle Corporation, at a press conference to announce Oracle’s investment of $100 million dollars in a new research centre in Szechuan Province ». Personal names: “Wong”, “Larry Ellison”. Titles: “Secretary-General”, “Chief Executive Officer”. Organization name: “Oracle Corporation”. Place name: “Szechuan Province”. Recognition problem: detect these entities in a continuous stream of ideograms. Transliteration problem: when ideograms are used phonetically (esp. for nonChinese names like “Larry Ellison”) become aware of that, & map them onto Latin characters. Named entity recognition & transliteration Made-up Chinese Transliteration Example How to translate “唐纳德·拉姆斯菲尔德”? 唐 [táng] (surname) - Tang Dynasty; 纳(F納) [nà] receive, accept, enjoy, pay, sew; 德 [dé] virtue 拉 [lā] pull, drag, haul; 姆 [mǔ] nurse; 斯 [sī] (thus; now used mostly for sound:) 菲 [fěi] 菲薄 humble; 尔(F爾) [ěr] (archaic:) you; 德 [dé] virtue “After receiving virtue from the Tang Dynasty, you thus pulled the humble nurse away from virtue” (????). No – « tang na de la mu si fei de » = DONALD RUMSFELD. Actual ChineseEnglish example generated by PORTAGE “Outgoing president Iliescu has also congratulated Basescu.” “Outgoing president of Iraq, has also been made to the road to the public.” Named entity recognition & transliteration Other Examples ArabicEnglish: Muammar Ghadafy = Moammar Khaddafi = Muamar Qadafy = …; Azeddine = Elzedine = Alsuddin = Ahzudin = … (depending on region, pronounced differently & thus transliterated into Latin alphabet differently) EnglishFrench (Google Translate Jan. 24, 2007): “The Englishman John Snow thought cholera was transmitted by small, living organisms.” “Le choléra de pensée de neige de John d'Anglais a été transmis par la petite, organique matière.” System Combination Introduction • • • • • Different systems make different errors – why not combine information? This worked well for ASR … But, because of reordering, synonyms, etc., system combination not as easy for MT! RWTH (Aachen) is SMT powerhouse – has recently been working on parallel system combination (Evgeny Matusov). NRC has been working on serial system combination. Both teams now getting good results. System Combination Parallel System Combination (RWTH Aachen) • • • • Hypotheses from different systems aligned; some word reordering allowed; use of synonyms Generate confusion network choices at each position scored with system weights and word confidence scores N-best consensus translations are generated from confusion network & rescored with various information sources A year ago, results unimpressive. Since then, added new information sources (e.g., LMs trained on N-best lists from contributing systems) that encourage preservation of original phrases. Nice preliminary Arabic results: improvement of +2-3 BLEU points over best individual system in combination. System Combination Example of RWTH Parallel Combination Ref: Chinese president directs unprecedented criticism at leaders of Hong Kong. Best System: Chinese president slams unprecedented leaders to Hong Kong. System Comb.: Chinese president sends unprecedented criticism of the leaders of Hong Kong. System Combination Serial System Combination (NRC) • Use SMT to correct mistakes made by another method (e.g., a rulebased one) Source text MT1 Initial target text SMT Final target text Training Procedure • • Use MT1 to produce initial target translation of source half of a parallel human-translated corpus, thus giving a corpus of MT1 target output in parallel with good target versions of same sentences; use parallel corpus of (MT1 target || human target) sentences to train SMT. Even better, if can get humans to post-edit MT1 output, have MT1 target in parallel with corrected target as SMT training corpus. System Combination Serial System Combination (NRC) System Combination Serial System Combination (NRC) System Combination Discussion & Future Work • • Parallel combination probably best for similar systems of good quality, serial combination for systems that are very different Future work for serial combination: allow SMT both direct & indirect (via MT1) access to source text. Could do this using, e.g.: Rescoring Parallel phrasetables Parallel LMs Parallel decoding … (etc.) Source text MT1 Initial target text SMT Final target text References (1) Best overall reference Philipp Koehn, « Statistical Machine Translation », University of Edinburgh (textbook to appear 2007 or 2008, Cambridge University Press). Papers (NOTE: short summary of key papers available from Kuhn/Foster) Brown93 Peter F. Brown, Stephen A. Della Pietra, Vincent Della J. Pietra, and Robert L. Mercer. The mathematics of Machine Translation: Parameter estimation. Computational Linguistics, 19(2):263-312, June 1993. Chomsky69 Noam Chomsky. Quine’s Empirical Assertions. In Words and Objections – Essays on the Work of W.V. Quine (ed. D. Davidson and J. Hintikka). Dordrecht, Netherlands, 1969. Foster06 George Foster, Roland Kuhn, and Howard Johnson. Phrasetable Smoothing for Statistical Machine Translation. EMNLP 2006, Sydney, Australia, July 22-23, 2006. Germann01 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), Toulouse, July 2001. References (2) Goodman01 Joshua Goodman. A Bit of Progress in Language Modeling (extended version). Microsoft Research Technical Report, Aug. 2001. Downloadable from research.microsoft.com/~joshuago/publications.htm Jones05 Douglas Jones, Edward Gibson, et al. Measuring Human Readability of Machine Generated Text: Studies in Speech Recognition and Machine Translation. In Proceedings of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, March 2005 (Special Session on Human Language Technology: Applications and Challenge of Speech Processing). Knight99 Kevin Knight. Decoding complexity in word-replacement translation models. Computational Linguistics, Squibs and Discussion, 25(4), 1999. Koehn04 Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, Georgetown University, Washington D.C., October 2004. Springer-Verlag. KoehnDec03 Philipp Koehn. PHARAOH - a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models (User Manual and Description). USC Information Sciences Institute, Dec. 2003. References (3) KoehnMay03 Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrasebased translation. In Eduard Hovy, editor, Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 127-133, Edmonton, Alberta, Canada, May 2003. Marcu02 Daniel Marcu and William Wong. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, 2002. OchJHU04 Franz Josef Och, Daniel Gildea, et al. Final Report of the Johns Hopkins 2003 Summer Workshop on Syntax for Statistical Machine Translation (revised version). http://www.clsp.jhu.edu/ws03/groups/translate (JHU-syntax-for-SMT.pdf), Feb. 2004. Och04 Franz Och and Hermann Ney. The alignment template approach to statistical machine translation. Computational Linguistics, V. 30, pp. 417-449, 2004. Och03 Franz Josef Och. Minimum error rate training for statistical machine translation. In Proceedings of the 41th Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, July 2003. References (4) Och02 Franz Josef Och and Hermann Ney. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002. Och01 Franz Josef Och, Nicola Ueffing, and Hermann Ney. An Efficient A* Search Algorithm for Statistical Machine Translation. In Proc. Data-Driven Machine Translation Workshop, July 2001. Och00 Franz Josef Och and Hermann Ney. A Comparison of Alignment Models for Statistical Machine Translation. Int. Conf. on Computational Linguistics (COLING), Saarbrucken, Germany, August 2000. Papineni01 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of Machine Translation. Technical Report RC22176, IBM, September 2001. Ueffing02 Nicola Ueffing, Franz Josef Och, and Hermann Ney. Generation of Word Graphs in Statistical Machine Translation. Empirical Methods in Natural Language Processing, July 2002.