Lecture 40 of 42 NLP and Philosophical Issues Discussion: Machine Translation (MT) Friday, 01 December 2006 William H. Hsu Department of Computing and Information Sciences, KSU KSOL course page: http://snipurl.com/v9v3 Course web site: http://www.kddresearch.org/Courses/Fall-2006/CIS730 Instructor home page: http://www.cis.ksu.edu/~bhsu Reading for Next Class: Sections 22.1, 22.6-7, Russell & Norvig 2nd edition CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University (Hidden) Markov Models: Review Definition of Hidden Markov Models (HMMs) Stochastic state transition diagram (HMMs: states, aka nodes, are hidden) Compare: probabilistic finite state automaton (Mealy/Moore model) Annotated transitions (aka arcs, edges, links) • Output alphabet (the observable part) • Probability distribution over outputs A 0.4 B 0.6 E 0.1 F 0.9 Forward Problem: One Step in ML Estimation Given: model h, observations (data) D 0.4 0.5 0.8 0.6 Estimate: P(D | h) Backward Problem: Prediction Step Given: model h, observations D Maximize: P(h(X) = x | h, D) for a new X Forward-Backward (Learning) Problem Given: model space H, data D 1 A 0.5 G 0.3 H 0.2 0.5 2 3 0.2 C 0.8 D 0.2 E 0.3 F 0.7 A 0.1 G 0.9 Find: h H such that P(h | D) is maximized (i.e., MAP hypothesis) HMMs Also A Case of LSQ (f Values in [Roth, 1999]) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University NLP Hierarchy: Review Problem Definition Speech Acts Given: m sentences containing untagged words Discourse Labeling Example: “The can will rust.” Label (one per word, out of ~30-150): vj s (art, n, aux, vi) Parsing / POS Tagging Representation: labeled examples <(w1, w2, …, wn), s> Lexical Analysis Return: classifier f: X V that tags x (w1, w2, …, wn) Natural Language Applications: WSD, dialogue acts (e.g., “That sounds OK to me.” ACCEPT) Solution Approaches: Use Transformation-Based Learning (TBL) [Brill, 1995]: TBL - mistake-driven algorithm that produces sequences of rules • Each rule of the form (ti, v): a test condition (constructed attribute) and a tag • ti: “w occurs within k words of wi” (context words); collocations (windows) For more info: see [Roth, 1998], [Samuel, Carberry, Vijay-Shankar, 1998] Recent Research E. Brill’s page: http://www.cs.jhu.edu/~brill/ K. Samuel’s page: http://www.eecis.udel.edu/~samuel/work/research.html CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Spanish/English Parallel Corpora: Review Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa 1a. Garcia and associates . 1b. Garcia y asociados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 5a. its clients are angry . 5b. sus clientes estan enfadados . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos . CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Data for Statistical MT and data preparation CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Ready-to-Use Online Bilingual Data 140 120 Chinese/English 100 Millions of words 80 (English side) 60 Arabic/English 40 French/English 20 2004 2002 2000 1998 1996 1994 0 (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Ready-to-Use Online Bilingual Data 180 160 140 120 Millions of words 100 (English side) 80 60 40 20 0 Chinese/English Arabic/English 2004 2002 2000 1998 1996 1994 French/English + 1m-20m words for many language pairs (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Ready-to-Use Online Bilingual Data Chinese/English Arabic/English 2004 2002 2000 1998 1996 French/English 1994 Millions of words (English side) ??? 180 160 140 120 100 80 60 40 20 0 One Billion? CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University From No Data to Sentence Pairs Easy way: Linguistic Data Consortium (LDC) Really hard way: pay $$$ Suppose one billion words of parallel data were sufficient At 20 cents/word, that’s $200 million Pretty hard way: Find it, and then earn it! De-formatting Remove strange characters Character code conversion Document alignment Sentence alignment Tokenization (also called Segmentation) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Sentence Alignment The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. CIS 490 / 730: Artificial Intelligence El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Sentence Alignment 1. 2. 3. 4. 5. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. CIS 490 / 730: Artificial Intelligence 1. 2. 3. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Sentence Alignment 1. 2. 3. 4. 5. The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. CIS 490 / 730: Artificial Intelligence 1. 2. 3. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Sentence Alignment 1. 2. 3. The old man is happy. He has fished many times. His wife talks to him. The sharks await. 1. 2. 3. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0). CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Tokenization (or Segmentation) English Input (some byte stream): "There," said Bob. Output (7 “tokens” or “words”): " There , " said Bob . Chinese Input (byte stream): Output: 美国关岛国际机场及其办公室均接获 一名自称沙地阿拉伯富商拉登等发出 的电子邮件。 美国 关岛国 际机 场 及其 办公 室均接获 一名 自称 沙地 阿拉 伯 富 商拉登 等发 出 的 电子邮件。 CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University MT Evaluation CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University MT Evaluation Manual: SSER (subjective sentence error rate) Correct/Incorrect Error categorization Testing in an application that uses MT as one sub-component Question answering from foreign language documents Automatic: WER (word error rate) BLEU (Bilingual Evaluation Understudy) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University BLEU Evaluation Metric (Papineni et al, ACL-2002) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. CIS 490 / 730: Artificial Intelligence • N-gram precision (score is between 0 & 1) – What percentage of machine n-grams can be found in the reference translation? – An n-gram is an sequence of n words – Not allowed to use same portion of reference translation twice (can’t cheat by typing out “the the the the the”) • Brevity penalty – Can’t just type out single word “the” (precision 1.0!) *** Amazingly hard to “game” the system (i.e., find a way to change machine output so that BLEU goes up, but quality doesn’t) Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University BLEU Evaluation Metric (Papineni et al, ACL-2002) Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. CIS 490 / 730: Artificial Intelligence • BLEU4 formula (counts n-grams up to length 4) exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0) p1 = 1-gram precision P2 = 2-gram precision P3 = 3-gram precision P4 = 4-gram precision Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport . Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places . Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert . CIS 490 / 730: Artificial Intelligence Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter . Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University BLEU Tends to Predict Human Judgments NIST Score (variant of BLEU) 2.5 Adequacy 2.0 R2 = 88.0% Fluency R2 = 90.2% 1.5 Linear (Adequacy) Linear (Fluency) 1.0 0.5 0.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 -0.5 -1.0 -1.5 -2.0 -2.5 Human Judgments slide from G. Doddington (NIST) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Word-Based Statistical MT CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical MT Systems Spanish/English Bilingual Text Statistical Analysis Spanish Que hambre tengo yo CIS 490 / 730: Artificial Intelligence English Text Statistical Analysis Broken English What hunger have I, Hungry I am so, I am so hungry, 01 Dec 2006 HaveFriday, I that hunger … English I am so hungry Computing & Information Sciences Kansas State University Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish Translation Model P(s|e) Que hambre tengo yo CIS 490 / 730: Artificial Intelligence English Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) Friday, e 01 Dec 2006 I am so hungry Computing & Information Sciences Kansas State University Three Problems for Statistical MT Language model Given an English string e, assigns P(e) by formula good English string -> high P(e) random word sequence -> low P(e) Translation model Given a pair of strings <f,e>, assigns P(f | e) by formula <f,e> look like translations -> high P(f | e) <f,e> don’t look like translations -> low P(f | e) Decoding algorithm Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University The Classic Language Model Word N-Grams Goal of the language model -- choose among: He is on the soccer field He is in the soccer field Is table the on cup the The cup is on the table Rice shrine American shrine Rice company American company CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University The Classic Language Model Word N-Grams Generative approach: w1 = START repeat until END is generated: produce word w2 according to a big table P(w2 | w1) w1 := w2 P(I saw water on the table) = P(I | START) * P(saw | I) * P(water | saw) * P(on | water) * P(the | on) * P(table | the) * P(END | table) Probabilities can be learned from online English text. CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Translation Model? Generative approach: Mary did not slap the green witch Source-language morphological analysis Source parse tree Semantic representation Generate target structure Maria no dió una botefada a la bruja verde CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Translation Model? Generative story: Mary did not slap the green witch Source-language morphological analysis What are all the possible moves and their associated probability tables? Source parse tree Semantic representation Generate target structure Maria no dió una botefada a la bruja verde CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University The Classic Translation Model Word Substitution/Permutation [IBM Model 3, Brown et al., 1993] Generative approach: Mary did not slap the green witch Mary not slap slap slap the green witch Mary not slap slap slap NULL the green witch n(3|slap) P-Null t(la|the) Maria no dió una botefada a la verde bruja d(j|i) Maria no dió una botefada a la bruja verde CIS 490 / 730: Artificial Intelligence Probabilities can be learned from raw bilingual Computing & Informationtext. Sciences Friday, 01 Dec 2006 Kansas State University Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … All word alignments equally likely All P(french-word | english-word) equally likely CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “la” and “the” observed to co-occur frequently, so P(la | the) is increased. CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … “house” co-occurs with both “la” and “maison”, but P(maison | house) can be raised without limit, to 1.0, while P(la | house) is limited because of “the” (pigeonhole principle) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … settling down after another iteration CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … Inherent hidden structure revealed by EM training! For details, see: • “A Statistical MT Tutorial Workbook” (Knight, 1999). • “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) • Software: GIZA++ CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Statistical Machine Translation … la maison … la maison bleue … la fleur … … the house … the blue house … the flower … P(juste | fair) = 0.411 P(juste | correct) = 0.027 P(juste | right) = 0.020 … new French sentence CIS 490 / 730: Artificial Intelligence Possible English translations, to be rescored by language model Computing & Information Sciences Friday, 01 Dec 2006 Kansas State University Decoding for “Classic” Models Of all conceivable English word strings, find the one maximizing P(e) x P(f | e) Decoding is an NP-complete challenge (Knight, 1999) Several search strategies are available Each potential English output is called a hypothesis. CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University The Classic Results la politique de la haine . politics of hate . the policy of the hatred . nous avons signé le protocole . we did sign the memorandum of agreement . we have signed the protocol . (Foreign Original) (Reference Translation) (IBM4+N-grams+Stack) où était le plan solide ? but where was the solid plan ? where was the economic base ? (Foreign Original) (Reference Translation) (IBM4+N-grams+Stack) (Foreign Original) (Reference Translation) (IBM4+N-grams+Stack) the Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40.007 billion US dollars today provide data include that year to November china actually using foreign 46.959 billion US dollars and Computing & Information Sciences CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Kansas State University Flaws of Word-Based MT Multiple English words for one French word IBM models can do one-to-many (fertility) but not many-to-one Phrasal Translation “real estate”, “note that”, “interest in” Syntactic Transformations Verb at the beginning in Arabic Translation model penalizes any proposed re-ordering Language model not strong enough to force the verb to move to the right place CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Phrase-Based Statistical MT CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Phrase-Based Statistical MT Morgen fliege ich Tomorrow I will fly nach Kanada to the conference zur Konferenz In Canada Foreign input segmented in to phrases “phrase” is any sequence of words Each phrase is probabilistically translated into English P(to the conference | zur Konferenz) P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Advantages of Phrase-Based Many-to-many mappings can handle non-compositional phrases Local context is very useful for disambiguating “Interest rate” … “Interest in” … The more data, the longer the learned phrases Sometimes whole sentences CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University How to Learn the Phrase Translation Table? One method: “alignment templates” (Och et al, 1999) Start with word alignment, build phrases from that. Maria no dió una bofetada a la bruja verde This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. Mary did not slap the This is the best (or “Viterbi”) alignment. green witch CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University How to Learn the Phrase Translation Table? One method: “alignment templates” (Och et al, 1999) Start with word alignment, build phrases from that. Maria no dió una bofetada a la bruja verde This word-to-word alignment is a by-product of training a translation model like IBM-Model-3. Mary did not slap the This is the best (or “Viterbi”) alignment. green witch CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University IBM Models are 1-to-Many Run IBM-style aligner both directions, then merge: EF best alignment MERGE FE best alignment CIS 490 / 730: Artificial Intelligence Union or Intersection Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University How to Learn the Phrase Translation Table? Collect all phrase pairs that are consistent with the word alignment Maria no dió una bofetada a la bruja verde Mary did not slap one example phrase pair the green witch CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Consistent with Word Alignment Maria no dió Maria no dió Maria Mary Mary Mary did did did not not x no dió not x slap slap consistent slap inconsistent inconsistent Phrase alignment must contain all alignment points for all the words in both phrases! CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Word Alignment Induced Phrases Maria no dió una bofetada a la bruja verde Mary did not slap the green witch (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … Computing & Information Sciences CIS 490 / no 730:dió Artificial Intelligencea la bruja verde, Friday, (Maria una bofetada Mary 01 didDec not2006 slap the green witch) Kansas State University Phrase Pair Probabilities A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus. We hope so! So, now we have a vast list of phrase pairs and their frequencies – how to assign probabilities? CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Phrase Pair Probabilities Basic idea: No EM training Just relative frequency: P(f-f-f | e-e-e) = count(f-f-f, e-e-e) / count(e-e-e) Important refinements: Smooth using word probs P(f | e) for individual words connected in the word alignment Some low count phrase pairs now have high probability, others have low probability Discount for ambiguity If phrase e-e-e can map to 5 different French phrases, due to the ambiguity of unaligned words, each pair gets a 1/5 count Count BAD events too If phrase e-e-e doesn’t map onto any contiguous French phrase, increment event count(BAD, e-e-e) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Advanced Training Methods CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Basic Model, Revisited argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e) x P(f | e) e CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Basic Model, Revisited argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) = e argmax P(e)2.4 x P(f | e) e CIS 490 / 730: Artificial Intelligence … works better! Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Basic Model, Revisited argmax P(e | f) = e argmax P(e) x P(f | e) / P(f) e argmax P(e)2.4 x P(f | e) x length(e)1.1 e Rewards longer hypotheses, since these are unfairly punished by P(e) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Basic Model, Revisited argmax P(e)2.4 x P(f | e) x length(e)1.1 x KS 3.7 … e Lots of knowledge sources vote on any given hypothesis. “Knowledge source” = “feature function” = “score component”. Feature function simply scores a hypothesis with a real value. (May be binary, as in “e has a verb”). Problem: How to set the exponent weights? CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University MT Pyramid interlingua semantics syntax phrases semantics syntax phrases words SOURCE CIS 490 / 730: Artificial Intelligence words TARGET Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Why Syntax? Need much more grammatical output Need accurate control over re-ordering Need accurate insertion of function words Word translations need to depend on grammatically-related words CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Yamada/Knight 01: Modeling and Training Parse Tree(E) VB PRP VB1 he adores VB VB2 Reorder VB he TO listening TO to MN music he VB2 ha TO VB1 VB MN TO music to VB2 VB1 TO VB MN TO music to Translate ga VB PRP kare adores adores listening Insert VB PRP PRP desu VB2 ha TO MN listening no ongaku VB1 VB ga daisuki desu TO wo kiku no Take Leaves . Sentence(J) Kare ha ongaku wo kiku no ga daisuki desu CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Japanese/English Reorder Table Original Order PRP VB1 VB2 VB TO TO NN Reordering PRP VB1 VB2 PRP VB2 VB1 VB1 PRP VB2 VB1 VB2 PRP VB2 PRP VB1 VB2 VB1 PRP VB TO TO VB TO NN NN TO P(reorder|original) 0.074 0.723 0.061 0.037 0.083 0.021 0.107 0.893 0.251 0.749 For French/English, useful parameters like P(N ADJ | ADJ N). CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Casting Syntax MT Models As Tree Transducer Automata [Graehl & Knight 04] Non-local Re-Ordering (English/Arabic) Non-constituent Phrasal Translation (English/Spanish) qS qS S PRO NP1 VP VP NP1 NP2 VB NP2 S PR VP there VB NP are CD NN two men Lexicalized Re-Ordering (English/Chinese) NP hay CD NN dos hombres Long-distance Re-Ordering (English/Japanese) qS NP NP1 PP S NP NP2 P NP1 P NP2 of WH-NP SINV/NP Who MD S S/NP did NP VP/NP VB see CIS 490 / 730: Artificial Intelligence * Friday, 01 Dec 2006 ka NP NP S P NP ga PRO P S VB dare o <saw> Computing & Information Sciences Kansas State University Summary Phrase-based models are state-of-the-art Word alignments Phrase pair extraction & probabilities N-gram language models Beam search decoding Feature functions & learning weights But the output is not English Fluency must be improved Better translation of person names, organizations, locations More automatic acquisition of parallel data, exploitation of monolingual data across a variety of domains/languages Need good accuracy across a variety of domains/languages CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Available Resources Bilingual corpora 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu) Lots of French/English, Spanish/French/English, LDC European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/) GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html) Manually word-aligned test corpus (500 French/English sentence pairs), RWTH Aachen Shared task, NAACL-HLT’03 workshop Decoding Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm) Xiaoyi Ma, LDC (Champollion) Word alignment (www.isi.edu/natural-language/download/hansard/) Sentence alignment (www.isi.edu/~koehn/publications/europarl) ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/) ISI Pharoah phrase-based decoder Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/) Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm) CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Some Papers Referenced on Slides ACL [Och, Tillmann, & Ney, 1999] [Och & Ney, 2000] [Germann et al, 2001] [Yamada & Knight, 2001, 2002] [Papineni et al, 2002] [Alshawi et al, 1998] [Collins, 1997] [Koehn & Knight, 2003] [Al-Onaizan & Knight, 2002] [Och & Ney, 2002] [Och, 2003] [Koehn et al, 2003] • – [Soricut et al, 2002] – [Al-Onaizan & Knight, 1998] • [Marcu & Wong, 2002] [Fox, 2002] [Munteanu & Marcu, 2002] AI Magazine www.isi.edu/~knight [Knight, 1997] EACL – [Cmejrek et al, 2003] • Computational Linguistics – [Brown et al, 1993] – [Knight, 1999] – [Wu, 1997] EMNLP AMTA • AAAI – [Koehn & Knight, 2000] • IWNLG – [Habash, 2002] [MT Tutorial Workbook] • MT Summit – [Charniak, Knight, Yamada, 2003] • NAACL – – – – CIS 490 / 730: Artificial Intelligence [Koehn, Marcu, Och, 2003] [Germann, 2003] [Graehl & Knight, 2004] [Galley, Hopkins, Knight, Marcu, 2004] Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Terminology Simple Bayes, aka Naïve Bayes Zero counts: case where an attribute value never occurs with a label in D No match approach: assign an c/m probability to P(xik | vj) m-estimate aka Laplace approach: assign a Bayesian estimate to P(xik | vj) Learning in Natural Language Processing (NLP) Training data: text corpora (collections of representative documents) Statistical Queries (SQ) oracle: answers queries about P(xik, vj) for x ~ D Linear Statistical Queries (LSQ) algorithm: classification using f(oracle response) • Includes: Naïve Bayes, BOC • Other examples: Hidden Markov Models (HMMs), maximum entropy Problems: word sense disambiguation, part-of-speech tagging Applications • Spelling correction, conversational agents • Information retrieval: web and digital library searches CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University Summary Points More on Simple Bayes, aka Naïve Bayes More examples Classification: choosing between two classes; general case Robust estimation of probabilities: SQ Learning in Natural Language Processing (NLP) Learning over text: problem definitions Statistical Queries (SQ) / Linear Statistical Queries (LSQ) framework • Oracle • Algorithms: search for h using only (L)SQs Bayesian approaches to NLP • Issues: word sense disambiguation, part-of-speech tagging • Applications: spelling; reading/posting news; web search, IR, digital libraries Next Week: Section 6.11, Mitchell; Pearl and Verma Read: Charniak tutorial, “Bayesian Networks without Tears” Skim: Chapter 15, Russell and Norvig; Heckerman slides CIS 490 / 730: Artificial Intelligence Friday, 01 Dec 2006 Computing & Information Sciences Kansas State University