Statistical Machine Translation SMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011 Stephan Vogel - Machine Translation 1 Overview Deciphering foreign text – an example Principles of SMT Data processing Stephan Vogel - Machine Translation 2 Deciphering Example Apinaye – English Apinaye belongs to the Ge family of Brazil Spoken by 800 (according to SIL, 1994) http://www.ethnologue.com/show_family.asp?subid=90784 http://www.language-museum.com/a/apinaye.php Example from Linguistic Olympics 2008, see http://www.naclo.cs.cmu.edu Parallel Corpus (some characters adapted) Kukre kokoi Ape kre Ape kokoi rats Ape mi mets Ape mets kra Ape punui mi pinjets The monkey eats The child works The big monkey works The good man works The child works well The old man works badly Can we translate new sentence? Stephan Vogel - Machine Translation 3 Deciphering Example Parallel Corpus (some characters adapted) Kukre kokoi The monkey eats Ape kra The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly Can we build a lexicon from these sentence pairs? Observations: Apinaye: Kukre (1) Ape (5), English: The (6), works (5) Aha! -> first guess: Ape – works monkey in 1, 3; child in 2, 4; man in 4, 6 different distribution over corpus: do we find words with similar distribution on the Apinaye side? Stephan Vogel - Machine Translation 4 … Vocabularies Corpus Vocabularies Kukre kokoi The monkey eats Apinaye English Ape kra The child works kukre The Ape kokoi rats The big monkey works kokoi monkey Ape mi mets The good man works ape eats Ape mets kra The child works well kra child Ape punui mi pinjets The old man works badly rats works mi big mets good punui man pinjets well Observations: 9 Apinaye words, 11 English words old badly Expectations: English words without translation? Apinaye words corresponding to more then 1 English word? Stephan Vogel - Machine Translation 5 … Word Frequencies Corpus Vocabularies, with frequencies Kukre kokoi The monkey eats Apinaye Ape kra The child works kukre 1 The 6 Ape kokoi rats The big monkey works kokoi 2 monkey 2 Ape mi mets The good man works ape 5 eats 1 Ape mets kra The child works well kra 2 child 2 Ape punui mi pinjets The old man works badly rats 1 works 5 mi 1 big 1 mets 2 good 1 punui 1 man 2 pinjets 1 well 1 old 1 badly 1 Suggestions: ‘ape’ (5) could align to ‘The’ (6) or ‘works’ (5) More likely that content word ‘works’ has match, i.e. ‘ape’ = ‘works’ Other word pairs difficult to predict – too many similar frequencies Stephan Vogel - Machine Translation English 6 … Location in Corpus Corpus Vocabularies, with occurrences Kukre kokoi The monkey eats Apinaye Sentences English Sentences Ape kra The child works kukre 1 The 123456 Ape kokoi rats The big monkey works kokoi 13 monkey 13 Ape mi mets The good man works ape 23456 eats 1 Ape mets kra The child works well kra 25 child 25 Ape punui mi pinjets The old man works badly rats 3 works 23456 mi 46 big 3 mets 45 good 4 punui 6 man 46 pinjets 6 well 5 old 6 badly 6 Observations: Same sentences: ‘kukre’ – ‘eats’, ‘kokoi’ – ‘monkey’, ‘ape’ – ‘works’, ‘kra’ – ‘child’, ‘rats’ – ‘big’, ‘mi’ – ‘man’ ‘mets’ (4 and 5) =? ‘good’ (4) and ‘well’ (5); makes sense ‘punui’ and ‘pinjets’ match ‘old’ and ‘badly’ – which is which? Stephan Vogel - Machine Translation 7 … Location in Sentence Corpus Apinaye English Alignment EN - AP Kukre kokoi The monkey eats 1-0 2-2 3-1 Ape kra The child works 1-0 2-2 3-1 Ape kokoi rats The big monkey works 1-0 2-3 3-2 4-1 Ape mi mets The good man works 1-0 2-3 3-2 4-1 Ape mets kra The child works well 1-0 2-3 3-1 4-2 Ape punui mi pinjets The old man works badly 1-0 2-??? 3-3 4-1 5-??? Observations: First English word (‘The’) does not align; we say it aligns to the NULL word Apinaye verb in first position English last word aligns to 1st or 2nd position English -> Apinaye: reverse word order (not strictly in sentence pair 5) Hypothesis: alignment for last sentence pair is 1-0 2-4 3-3 4-1 5-2 I.e: ‘pinjets’ – ‘old’ and ‘punui’ – ‘badly’ Stephan Vogel - Machine Translation 8 … POS Information Corpus Kukre kokoi VN The monkey eats DET N V Ape kra VN The child works Det N V Ape kokoi rats V N Adj The big monkey works Det Adj N V Ape mi mets V N Adj The good man works Det Adj N V Ape mets kra V Adv N The child works well Det N V Adv Ape punui mi pinjets V ??? N ??? The old man works badly Det Adj N V Adv Observations: English determiner (‘The’) does not align; perhaps no determiners in Apinaye English Verb Adverb -> Apinaye: Verb Adverb -> no reordering English Adjective Noun -> Apinaye: Noun Adjective -> reordering Hypothesis: ‘pinjets’ is Adj to make it N Adj, ‘punui’ is Adv (consistent with alignment hypothesis) Stephan Vogel - Machine Translation 9 Translate New Sentences: Ap - En Source Sentence: Ape rats mi mets Lexical information: works big man good/well Reordering information: The good man works big Better lexical choice: The good man works hard Compare: Ape mi mets -> The good man works Source Sentence: Lexical information: Reordering information: Better lexical choice: Kukre rats kokoi punui eats big monkey badly The bad monkey eats big The bad monkey eats a lot Stephan Vogel - Machine Translation 10 Translate New Sentences: En - Ap Source Sentence: Lexical information: Reordering information: The old monkey eats a lot NULL pinjets kokio kukre rats kukre rats kokio pinjets Or Deleting words: Rephrase: Reorder: Lexical information: old monkey eats a lot old monkey eats big eats big monkey old kukre rats kokio pinjets Source Sentence: Delete plus rephrase: Reorder: Lexical information: The big child works a long time big child works big works big child big Ape rats kra rats Stephan Vogel - Machine Translation 11 Overview Deciphering foreign text – an example Principles of SMT Data processing Stephan Vogel - Machine Translation 12 Principles of SMT We will use the same approach – learning from data Build translation models using frequency, co-occurrence, word position, etc. information Use the models to translate new sentences Not manually, but fully automatically The training will be automatically The is still lots of manual work left: designing models, preparing data, running experiments, etc. Stephan Vogel - Machine Translation 13 Statistical versus Grammar-Based Often statistical and grammar-based MT are seen as alternatives, even opposing approaches – wrong !!! Dichotomies are: Use probabilities || everything is equally likely, yes/no decision Rich (deep) structure || no or only flat structure Both dimensions are continuous Examples EBMT: no/little structure and heuristics SMT: (initially only) flat structure and probabilities XFER: deep(er) structure and heuristics No Probs Probs Flat Structure EBMT SMT Deep Structure XFER, Interlingua Holy Grail Goal: structurally rich probabilistic models statXFER: deep structure and probabilities Syntax-augmented SMT: deep structure and probabilities Stephan Vogel - Machine Translation 16 Statistical Machine Translation Translator translates source text Use machine learning techniques to extract useful knowledge Translation model: word and phrase translations Language model: how likely words follow in a particular sequence Translation system (decoder) uses these models to translates new sentences Advantages: Source Translation Model Target Language Model Can quickly train for new languages Can adopt to new domains Problems: Need parallel data Source All words, even punctuation, are equal Sentence Difficult to pin-point the causes of errors Stephan Vogel - Machine Translation Translation 17 Tasks in SMT Modelling build statistical models which capture characteristic features of translation equivalences and of the target language Training train translation model on bilingual corpus, train language model on monolingual corpus Decoding find best translation for new sentences according to models Evaluation Subjective evaluation: fluency, adequacy Automatic evaluation: WER, Bleu, etc And all the nitty-gritty stuff Text preprocessing, data cleaning Parameter tuning (minimum error rate training) Stephan Vogel - Machine Translation 18 Noisy Channel View “French is actually English, which has been garbled during transmission; recover the correct, original English” Speaker speaks English Noisy channel distorts into French Stephan Vogel - Machine Translation You hear French, but need to recover the English 19 Bayesian Approach Select translations which has highest probability ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } Model Channel Search Process Model Source Stephan Vogel - Machine Translation 20 SMT Architecture p(e) – language model p(f | e) – translation model Stephan Vogel - Machine Translation 21 Log-Linear Model In practice: ê = argmax{ log(p(e)) + log( p(f | e)) } Translaiton model (TM) and language model (LM) may be of different quality: - simplifying assumptions - trained on different abounts of data Give different weights to both models ê = argmax{ w1 * log(p(e)) + w2 * log( p(f | e)) } Why not add more features? ê = argmax{ w1 * h1(e,f) + ... wn * hn(e, f) } Note: We don‘t need the normalization constant for the argmax Stephan Vogel - Machine Translation 22 Overview Deciphering foreign text – an example Principles of SMT Data processing Stephan Vogel - Machine Translation 23 Corpus Statistics We want to know how much data Corpus size: not file size, not documents, but words and sentences Why is file size not important? Vocabulary: number of word types We want to know some distributions How many words are seen only once? Why is this interesting? Does it help to increase the corpus? … How long are the sentence Does it matter if we have many short of fewer, but longer sentences? Stephan Vogel - Machine Translation 24 All Simple, Basic, Important Important: When you publish, these numbers are important To be able to interpret the results E.g. what works on small corpora may not work on large corpora To make them comparable to other papers Basic: no deep thinking, no fancy Simple: a few unix commands, a few simple scripts wc, grep, sed, sort, uniq perl, awk (my favorite), perhaps python, … Let’s look at some data! Stephan Vogel - Machine Translation 25 BTEC Spa-Eng Corpus Statistics Corpus and vocabulary size Percentage of singletons Number of unknown words, out-of-vocabulary (OOV) rate Sentence length balance Text normalization Spoken language forms: I’ll, we’ar, but also I will, we are Note: this was shown online Stephan Vogel - Machine Translation 26 Tokenization Punctuation attached to words Example: ‘you’ ‘you,’ ‘you.’ ‘you?’ All different strings, i.e. all are different words Tokenization can be tricky What about punctuation in numbers What about appreviations(A5-0104/1999) Numbers are not just numbers Percentages: 1.2% Ordinals: 1st, 2. Ranges: 2000-2006, 3:1 And more: (A5-0104/1999) Stephan Vogel - Machine Translation 27 GigaWord Corpus Distributed by LDC Collection of new papers: NYT, Xinhua News, … > 3 billion words How large is vocabulary? Some observations in vocabulary Number of entries with digits Number of entries with special characters Number of strange ‘words’ Some observations in corpus Sentences with lots of numbers Sentences with lots of punctuation Sentences with very long words Note: this was shown online Stephan Vogel - Machine Translation 28 And then the more interesting Stuff POS tagging Parsing For syntax-based MT systems How parallel are the parse trees? Word segmentation Morphological processing In all these tasks the central problem is: How to make the corpus more parallel? Stephan Vogel - Machine Translation 29