Machine Translation CS 4705 Current Translation Systems Senator John F. Kerry is casting himself as an underdog in the fight of his political life as he seeks to jump-start his presidential campaign today with a major speech outlining the first 100 days of a Kerry presidency. He will promise health care legislation as his first bill to Congress, new limits on government lobbying, and a rebuke of President Bush's military doctrine of preemptive war. Senador Juan F. Kerry es el bastidor mismo pues un oprimido en la lucha de su vida política como él busca saltar-empieza su campaña presidencial hoy con un discurso importante que contornea los primeros 100 días de una presidencia del kerry. Él prometerá la legislación del cuidado médico como su primera cuenta al congreso, nuevos límites en el gobierno cabildeando, y una reprimenda de la doctrina militar de presidente Bush de la guerra con derecho preferente. Senator Juan F. Kerry is the same frame because pressing in the fight of its political life as he looks for jump-begins its presidential campaign today with an important speech that it skirts the first 100 days of a presidency of kerry. He will promise the legislation of the medical care as his first account to the congress, new limits in the government lobbying, and a reprimand of the military doctrine of president Bush of the war with preferred right. Goals for MT • Aids for human translators – Online translating dictionaries – Access to aligned examples • Rough translation plus post-editing • Full translation in – Limited domains – With additional resources such as previous source/target document pairs – Spoken dialogue systems • Cross-language IR: translate the query terms and search • A related area: Language Identification A Simple Approach to MT Le petit chat ont mange le melon/The small cat ate melon. Les petits chats/The small cats Le chat brun/The brown cat Les petits pois/The peas ont mange/ate le melon/melon How do languages differ? • How easy is it to segment text into words? – English vs. Mandarin, Korean, Chinese • How many morphemes per word? – Turkish vs. Vietnamese • How easy is it to segment words into morphemes? – Russian vs. Turkish • Word order – – – – SVO: English, Romance, Mandarin SOV: Japanese VSO: Arabic, Hebrew Free: Walpiri and (German, Hindi, Korean,???) • Where are relationships marked? (head or non) – (e.g. possessive) -- Hungarian vs. English – prepositions vs. postpositions (English vs. Japanese) • Does the language mark gender? Number? Deference? Animacy? Kinship relations? • Lexical gaps (It. Convento/Eng. Monastery, Convent) • Socio-cultural idiosyncracies: – dates (U.S. vs. Europe) – numbers (French vs. English 80) – telephone numbers Sapir/Whorfianism • Linguistic determinism: our language determines our thought • Linguistic relativism: those who speak different languages perceive reality quite differently • We dissect nature along lines laid down by our native languages. The categories and types that we isolate from the world of phenomena we do not find there because they stare every observer in the face; on the contrary, the world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds - and this means largely by the linguistic systems in our minds. We cut nature up, organize it into concepts, and ascribe significances as we do, largely because we are parties to an agreement to organize it in this way - an agreement that holds throughout our speech community and is codified in the patterns of our language. The agreement is, of course, an implicit and unstated one, but its terms are absolutely obligatory; we cannot talk at all except by subscribing to the organization and classification of data which the agreement decrees. (Whorf 1940, pp. 213-14) Transfer Approaches to MT • Require deep linguistic knowledge of specific source and target – Model differences: contrastive approach • Analysis --> Transfer --> Generation – Syntactic parse produces parse tree for source – Syntactic transformation maps source tree to target tree – Lexical transfer substitutes target leaves for source leaves – Handling agreement, particles, inflection • NP Det Adj N {English} the brown cats • NP Det N Adj {French} les chats bruns Weaknesses of Transfer Approaches • Requires transfer rules for each pair of languages – Good syntactic parsers – Good word-sense disambiguation Interlingua Approaches • Idea: extract meaning from target and represent in language-independent form – One set of transformation rules for each language to be translated – A well-chosen ontology: • Capable of representing any meaning that any language can convey • Often, thematic role-based • Semantic analysis semantic representation Target generation – The cats are eating the melon/Les chats mangent le melon. EVENT AGENT eating cats NUMBER PL DEFINITENESS PATIENT melon NUMBER SG DEFINITENESS ASPECT PROGRESSIVE TENSE PRESENT DEF DEF Drawbacks of Interlingua Approaches • Ontology development is time-consuming – Possible only for limited domains – Requires full disambiguation even when current source/target don’t make distinctions – Adding another language may require major redesign Direct Translation • A ‘robust’ approach to translation: do what you can without complex structures • Resembles multi-stage transduction Source: The cats ate the melon. Morph: The cats/N+PL ate/V+Past the melon/N+SG. Lexical Transfer: The chat PL manger Past+PL the melon SG Generation: Les chats ont mangé le melon. Statistical Approaches:The Noisy Channel Model Again Noisy Channel Source Decoder – Input: the translation; Output: the source sentence – Decoding: max Ť = arg P(T|S) TL2 = arg max P(SP|T(S)P) (T ) TL2 = arg max P(S|T) P(T) TL2 How do we develop such models? • Ngram language models: to quantify fluency • Parallel corpora, e.g. Hansards, to quantify faithfulness – Alignment problem: words and sentences • Mr. Michel Guimond (Beauport-Montmorency-Orléans, BQ): moved that Bill C-369, an act to amend the Criminal Code (gaming and betting), be read the second time and referred to a committee. • He said: Mr. Speaker, on March 12, I spoke before the Sub-committee on Private Members' Business to introduce a private member's bill which would have made it possible to open casinos on cruise ships sailing on the St. Lawrence and the Great Lakes. • This bill reflected, not some fantasy of the federal member for BeauportMontmorency-Orléans, but a need expressed after long consultations with port administrators, community organizations and municipalities along the St. Lawrence. A number of municipal councils have even gone so far as to pass resolutions in support of Bill C-369, not the least of these being Quebec City, Beauport, in my riding, Charlesbourg and Ancienne-Lorette. I also consulted with ship owners, organizations promoting navigation on the St. Lawrence, and tourist associations. • M. Michel Guimond (Beauport-Montmorency-Orléans, BQ) propose: Que le projet de loi C-369, Loi modifiant le Code criminel (jeux et paris), soit maintenant lu une deuxième fois et renvoyé à un comité. • -Monsieur le Président, le 12 mars dernier, je me présentais devant le Souscomité des affaires émanant des députés pour présenter un projet de loi d'initiative privée qui aurait permis l'ouverture des casinos à bord des bateaux de croisière qui naviguent sur le Saint-Laurent et les Grands Lacs. • Ce projet de loi n'était pas une fantaisie du député fédéral de BeauportMontmorency-Orléans, mais bien un besoin exprimé, après une longue consultation, auprès des dirigeants des ports, des organismes du milieu et des municipalités environnantes du Saint-Laurent. Plusieurs conseils municipaux sont même allés jusqu'à adopter des résolutions appuyant le projet de loi C369. Je n'en nomme que quelques-unes, et non les moindres: la Ville de Québec, la Ville de Beauport, dans mon comté, la Ville de Charlesbourg et la Ville de l'Ancienne-Lorette. Mes consultations se sont aussi orientées vers les armateurs, les organismes promouvant la navigation sur le Saint-Laurent et les associations touristiques. Evaluating Translation Systems • Human judgments: expensive • Objective measures: may not accord with humans • Solution: find objective measures that correlate highly with human judgments • IBM BLEU metric: – Modified ngram precision: count number of ngrams in candidate that occur in any reference translation, clipped at max for that ngram in any reference translation, summed and divided by unclipped number of candidates Ref1: The cat is on the mat. Ref2: There is a cat on the mat. Cand1: the the the the the the the the (2/7) – Combining ngram scores – Penalizing brief translations • Bleu closely predicts human judgments • Reference transcriptions can be re-used for subsequent system evaluation Future Directions • • • • Phrase translation Example Based MT Serial translation, e.g. manuals Speech to Speech Translation What is NLP? • Syntax vs semantics vs discourse-pragmatics • Knowledge-rich, linguistic vs robust, statistical approaches • Text vs speech oriented research • Generation vs understanding