COMS 4705: Natural Language Processing Fall 2010 Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University Why (Machine) Translation? Languages in the world • 6,800 living languages • 600 with written tradition • 95% of world population speaks 100 languages Translation Market • $26 Billion Global Market (2010) • Doubling every five years (Donald Barabé, invited talk, MT Summit 2003) Why (Machine) Translation? Languages in the world • 6,800 living languages • 600 with written tradition • 95% of world population speaks 100 languages Translation Market • $26 Billion Global Market (2010) • Doubling every five years (Donald Barabé, invited talk, MT Summit 2003) Machine Translation Science Fiction • Star Trek Universal Translator an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix"…. Machine Translation Reality http://www.medialocate.com/ Machine Translation Reality • Currently, Google offers translations between the following languages over 3,000 pairs Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician Georgian German Greek Haitian Creole Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish “BBC found similar support”!!! Why Machine Translation? • Full Translation – Domain specific, e.g., Weather reports • Machine-aided Translation – Requires post-editing • Cross-lingual NLP applications – Cross-language IR – Cross-language Summarization • Testing grounds – Extrinsic evaluation of NLP tools, e.g., parsers, pos taggers, tokenizers, etc. Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation Multilingual Challenges • Orthographic Variations – Ambiguous spelling • كتب االوالد اشعارا األوالدُ اشعَارا ْ ب َ ََكت – Ambiguous word boundaries • • Lexical Ambiguity – Bank ( بنكfinancial) vs. ( ضفةriver) – Eat essen (human) vs. fressen (animal) Multilingual Challenges Morphological Variations • Affixational (prefix/suffix) vs. Templatic (Root+Pattern) write kill do written killed done كتب قتل فعل مكتوب مقتول مفعول • Tokenization (aka segmentation+normalization) conj noun article plural And the cars والسيارات Et les voitures and the cars w Al SyArAt et le voitures Multilingual Challenges Syntactic Variations يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book Arabic Subj-Verb V Subj Subj V English Chinese Subj V Subj … V Verb-PP V…PP V…PP Adjectives N Adj Adj N Possessives N Poss Relatives N Rel N of Poss Poss ’s N N Rel V PP PP V Adj de N Poss de N Rel de N Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation MT Approaches MT Pyramid Source meaning Target meaning Source syntax Source word Analysis Target syntax Gisting Target word Generation MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988. MT Approaches MT Pyramid Source meaning Source syntax Source word Analysis Target meaning Transfer Gisting Target syntax Target word Generation MT Approaches Transfer Example • Transfer Lexicon – Map SL structure to TL structure poner :subj butter :mod :obj X mantequilla en :subj X :obj Y :obj Y X puso mantequilla en Y X buttered Y MT Approaches MT Pyramid Source meaning Source syntax Source word Analysis Interlingua Transfer Gisting Target meaning Target syntax Target word Generation MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993) MT Approaches MT Pyramid Source meaning Source syntax Source word Analysis Interlingua Transfer Gisting Target meaning Target syntax Target word Generation MT Approaches MT Pyramid Source meaning Interlingual Lexicons Source syntax Source word Analysis Transfer Lexicons Target meaning Target syntax Dictionaries/Parallel Corpora Target word Generation MT Approaches Statistical vs. Rule-based Source meaning Source syntax Source word Analysis Target meaning Target syntax Target word Generation Statistical MT Noisy Channel Model Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf Slide based on Kevin Knight’s http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Statistical MT Automatic Word Alignment • GIZA++ – A statistical machine translation toolkit used to train word alignments. – Uses Expectation-Maximization with various constraints to bootstrap alignments Maria Mary did not slap the green witch no dio una bofetada a la bruja verde Statistical MT IBM Model (Word-based Model) http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Phrase-Based Statistical MT Morgen fliege ich Tomorrow I will fly nach Kanada to the conference zur Konferenz In Canada • Foreign input segmented in to phrases – “phrase” is any sequence of words • Each phrase is probabilistically translated into English – P(to the conference | zur Konferenz) – P(into the meeting | zur Konferenz) • Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Advantages of Phrase-Based SMT • Many-to-many mappings can handle noncompositional phrases • Local context is very useful for disambiguating – “Interest rate” … – “Interest in” … • The more data, the longer the learned phrases – Sometimes whole sentences MT Approaches Statistical vs. Rule-based vs. Hybrid Source meaning Source syntax Source word Analysis Target meaning Target syntax Target word Generation MT Approaches Practical Considerations • Resources Availability – Parsers and Generators • Input/Output compatability – Translation Lexicons • Word-based vs. Transfer/Interlingua – Parallel Corpora • Domain of interest • Bigger is better • Time Availability – Statistical training, resource building Road Map • Multilingual Challenges for MT • MT Approaches • MT Evaluation MT Evaluation • More art than science • Wide range of Metrics/Techniques – interface, …, scalability, …, faithfulness, ... space/time complexity, … etc. • Automatic vs. Human-based – Dumb Machines vs. Slow Humans Human-based Evaluation Example Accuracy Criteria 5 4 3 2 1 contents of original sentence conveyed (might need minor corrections) contents of original sentence conveyed BUT errors in word order contents of original sentence generally conveyed BUT errors in relationship between phrases, tense, singular/plural, etc. contents of original sentence not adequately conveyed, portions of original sentence incorrectly translated, missing modifiers contents of original sentence not conveyed, missing verbs, subjects, objects, phrases or clauses Human-based Evaluation Example Fluency Criteria 5 4 3 2 1 clear meaning, good grammar, terminology and sentence structure clear meaning BUT bad grammar, bad terminology or bad sentence structure meaning graspable BUT ambiguities due to bad grammar, bad terminology or bad sentence structure meaning unclear BUT inferable meaning absolutely unclear Fluency vs. Accuracy FAHQ MT conMT Prof. MT Info. MT Fluency Accuracy Automatic Evaluation Example Bleu Metric (Papineni et al 2001) • Bleu – – – – – BiLingual Evaluation Understudy Modified n-gram precision with length penalty Quick, inexpensive and language independent Correlates highly with human evaluation Bias against synonyms and inflectional variations Automatic Evaluation Example Bleu Metric Test Sentence Gold Standard References colorless green ideas sleep furiously all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric Test Sentence Gold Standard References colorless green ideas sleep furiously all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric Test Sentence Gold Standard References colorless green ideas sleep furiously colorless green ideas sleep furiously colorless green ideas sleep furiously colorless green ideas sleep furiously all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a1 a2 …an)1/n = (0.8 ╳ 0.5)½ = 0.6325 63.25 Metrics MATR Workshop • Workshop in AMTA conference 2008 – Association for Machine Translation in the Americas • Evaluating evaluation metrics • Compared 39 metrics – 7 baselines and 32 new metrics – Various measures of correlation with human judgment – Different conditions: text genre, source language, number of references, etc. Interested in MT?? • Contact me (habash@cs.columbia.edu) • Research courses, projects • Languages of interest: – English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, …. • Topics – Statistical, Hybrid MT • Phrase-based MT with linguistic extensions • Component improvements or full-system improvements – MT Evaluation – Multilingual computing Thank You