Machine Translation Jan Odijk Utrecht. March 7, 2011 1 Overview • • • • Lexicons Statistical MT MT: What is (perhaps) possible Conclusions 2 Lexicons • “Wat helemaal niet moeilijk is – Grote woordenboeken met veel moeilijke woorden en vaktermen” – (Steven Krauwer, vorige college) • I disagree 3 Lexicons • True if you know the words and terms in advance • But new words and terms (usually with different translations) are created all the time in science, technology and industry • So you must have techniques to find (identify, extract) such new words/terms and their translations as automatically as possible – To tune the lexicons to specific domains – to continuously extend them 4 Lexicons • Many terms are multiword expressions – With some internal variation – Not always contiguous – This requires special treatment in the lexicon and in the grammar • House* of representatives (Chambre* des représentants) • Patatas* fritas* (French fries*) • Chômeur* (Unemployed person*) 5 Lexicons • Modern formal grammars depend highly on lexical properties • They have very general rule schemata, which are filled in by properties of lexical items – e.g. a word of category X and its complements form a XPhrase – E.g. mass nouns can occur without article in singular; – count nouns can occur with een in singular 6 Lexicons • Properties of lexical items – E.g. which complements a verb takes • E.g. a direct object noun phrase, also an indirect object, predicate, prepositional complement, etc • E.g. an infinitival complement, with or with te, with or without om, with or without a subject, etc. – With which preposition it can be combined • Kijken naar, zorgen voor, houden van – Nouns: mass or count? 7 Lexicons • Traditional dictionaries do not contain such information (or very rarely) • And what is available is not represented in a formal manner • So computers cannot use this information directly 8 Lexicons • It is very difficult to assign such properties correctly in a systematic manner – It requires very good knowledge of syntax – Often the phenomena are not understood well enough – Words often have multiple options with different meanings and translations – Try it yourself for lopen; innemen – Count/Mass: vis; wijn; bestek; meubilair 9 Lexicons • It is very difficult to assign such properties correctly in a systematic manner (Cont.) – Lexicographers are not trained to assign such properties – It must be done for many words – Consistency within one person is hard to achieve – Consistency among multiple people is evebn harder 10 Lexicon: Semantics • Selection restrictions with type system to approach modeling of world knowledge – Requires sophisticated syntactic analysis • • • • • Boek: info (legible) Uur: time unit duration Vergadering: event duration Lezen: subject=human; object=info (legible) Durational adjunct must be a duration phrase 11 Lexicon: Semantics • Selection restrictions – – – – – – – – Pak (1) (suit): cloths Pak (2) (package): entity Dragen (1) (wear): subj=animate; object=cloths Dragen (2) (carry): subj=animate; object= entity Schoen: cloths Entity > cloths Identity preferred over subsumption Homogeneous object preferred over heterogeneous one 12 Lexicon: Semantics • Selection restrictions – Hij draagt een bruin pak • • • • He wears a brown suit (1: cloths=cloths) He carries a brown package (1: entity=entity) He carries a brown suit (2: entity > cloth) *He wears a brown package (cloth ¬> entity) – Hij draagt een bruin pak en zwarte schoenen • He wears a brown suit and black shoes (1: homogeneous and cloths=cloths) • He carries a brown suit and black shoes (2: homogeneous but entity > cloths) • He carries a brown package and black shoes(2: inhomogeneous but entity=entity) • *He wears a brown package and black shoes (cloths ¬> entity) 13 Statistical MT • Statistical MT • Derives MT-system automatically – From statistics taken from • Aligned parallel corpora ( translation model) • Monolingual target language corpora ( language model) • Being worked since early 90’s 14 Statistical MT • Plus: – No or very limited grammar development – Includes language and world knowledge automatically (but implicitly) – Based on actually occurring data – Currently many experimental and commercial systems • Minus: – Requires large aligned parallel corpora – Unclear how much linguistics will be needed anyway – Probably restricted to very limited domains only 15 Statistical MT • • • • Google Translate (statistical MT) Hij draagt een pak. √He wears a suit. Hij draagt schoenen. √ He wears shoes. Hij draagt bruine schoenen en een pak. • √ He wears a suit and brown shoes. (!!) • Hij draagt het pakket √ He carries the package • Hij heeft een pak aan. *He has a suit. • Voert uw bedrijf sloten uit? • – *Does your company locks out? 16 Hybrid MT • Euromatrix esp. “the Euromatrix”, and – Successor project EuromatrixPlus – … – Efficient inclusion of linguistic knowledge into statistical machine translation – The development and testing of hybrid architectures for the integration of rule-based and statistical approaches 17 Hybrid MT • META-NET 2010-2013 (EU-funding) – Building a community with shared vision and strategic research agenda – Building META-SHARE, an open resource exchange facility – Building bridges to neighbouring technology fields • • • • Bringing more Semantics into Translation Optimising the Division of Labour in Hybrid MT Exploiting the Context for Translation Empirical Base for Machine Translation 18 Hybrid MT • PACO-MT 2008-2011 • Investigates hybrid approach to MT – Rule-based and statistical – Uses existing parser for source language analysis – Uses statistical n-gram language models for generation – Uses statistical approach to transfer 19 MT: What is (perhaps) possible • • • • • • • • Cross-Language Information Retrieval Low Quality MT for Gist extraction MT and Speech Technology Controlled Language Limited Domain Interaction with author Combinations of the above Computer-aided translation 20 MT: What is (perhaps) possible • Cross-Language Information Retrieval (CLIR) – – – – Input query: in own language Input query translated into target languages Search in target language documents Results in target language • Translation of individual words only • Growing need (growing multilingual Web) • No perfect translation required 21 MT: What is (perhaps) possible 22 MT: What is (perhaps) possible • Low quality MT for Gist extraction • Low quality but still useful • If interesting high quality human translation can be requested (has to be paid for) 23 MT: What is (perhaps) possible 24 MT: What is (perhaps) possible 25 MT: What is (perhaps) possible • CLIR – Fills a growing need in the market – Is technically feasible – Creates need for translation of found documents • Solved partially by low quality MT • Potentially creates need for more human translation • Stimulates (funds) research into more sophisticated MT 26 MT: What is (perhaps) possible • Combine MT (statistical or rule-based) with OCR technology – – – – Make a picture of a text with your phone Text is OCR-ed Text is translated (usually a short and simple text) • Linguatec Shoot & Translate • Word Lens 27 MT: What is (perhaps) possible • Combine MT (statistical or rule-based) with Speech technology – Complicates the problem on the one hand but – Speech technology (ASR) is currently limited to very limited domains (makes MT simpler) – Many useful applications for speech technology currently in the market • Directory assistance Tourist Information • Tourist communication Call Centers • Navigation Hotel reservations – Some will profit from in-built automatic translation 28 MT: What is (perhaps) possible • Large EC FP6 project TC-STAR (2004-) – (http://www.tc-star.org/) – Research into improved speech technology (ASR and TTS) – Research into statistical MT – Research in combining both (speech-to-speech translation) – In a few selected limited domains 29 MT: What is (perhaps) possible • Commercial Speech2Speech Translation • Jibbigo – http://www.jibbigo.com • Speech-to-speech translation (iPhone, Android) • http://www.phonedog.com/2009/10/30/iphone-appjibbigo-speech-translator • Talk to Me (Android phones) 30 MT: What is (perhaps) possible • Controlled Language – Authoring System limits vocabulary and syntax of document authors – Often desirable in companies to get consistent documentation (e.g. aircraft maintenance manuals) • AECMA Simplified English • GIFAS Rationalized French – Makes MT easier (language well-defined) 31 MT: What is (perhaps) possible • Limited Domain – Translation of • Weather reports (TAUM-Meteo, Canada) • Avalanche warnings (Switzerland) – Fast adaptation to domain/company-specific vocabulary and terminology 32 MT: What is (perhaps) possible • Interaction with author – No fully automatic translation – Document author resolves • Ambiguities unresolved by the system • In a dialogue between the author and the system in the source language • Approach taken in Rosetta project (Philips) • Will only work if the – #unresolved ambiguities is low – Questions to resolve ambiguity are clear 33 MT: What is (perhaps) possible • Hij droeg een bruin pak – Wat bedoelt u met “pak” • (1) kostuum • (2) pakket • Hij droeg een bruin pak – Wat bedoelt u met “dragen (droeg)” • (1) aan of op hebben (kleding) • (2) bij zich hebben (bijv. in de hand) 34 MT: What is (perhaps) possible • Combinations of the above 35 MT: What is (perhaps) possible • Computer-aided translation – For end-users – For professional translators/localization industry • Limited functionality – Specific terminology • Bootstrap translation automatically – Human revision and correction (Post-edit) • Only if – MT Quality is such that it reduces effort – The system is fully integrated in the workflow system 36 Conclusions • MT is really very difficult! • Even making a lexicon for an MT system is very difficult (and a lot of work) • Statistical MT yields practical relatively quick to produce systems (but low-quality) – Provided you have huge amounts of data • Focus of research is on hybrid systems (mixed statistically based/knowledge based) (PACO-MT, META-NET,…) 37 Conclusions • Several constrained versions do yield usable technology with state-of-the-art MT • In some cases: even potentially creates additional needs for MT and human translation 38 – Try it yourself for lopen; innemen – Count/Mass: vis; wijn; bestek; meubilair • http://www.vandale.nl/ 39 Do not go beyond this slide 40 MT Evaluation • Evaluation depends on purpose of MT and how it is used – application, domain, controlled language • Many aspects can be evaluated – functionality, efficiency, usability, reliability, maintainability, portability – translation quality – embedding in work flow • post-editing options/tools 41 MT Evaluation • Focus here: – does the system yield good translations according to human judgement – in the context of developing a system • Again, many aspects: – fidelity (how close), correctness, adequacy, informativeness, intelligibility, fluency – and many ways to measure these aspects 42 MT Evaluation • Test suite – Reference = • list of (carefully selected) sentences • with their translations (ordered by score) – translations judged correct by human (usually developer) – upon every update of the system output of the new system is compared to the reference • if different: system has to be adapted, or reference has to be adapted • Advantages – focus on specific translation problems possible – excellent for regression testing – Manual judgement needed only once for each new output • –other comparisons are automatic • Disadvantages – not really independent – particularly suited for pure rule-based systems – human judgement needed if output differs from reference 43 MT Evaluation • Comparison against – translation corpus – independently created by human translators – possibly multiple equivalently correct translations of a sentence • Advantages – truely independent – also suited for data-driven systems • Disadvantage – requires human judgement (every time there is a system update) • high effort by highly skilled people, high costs, requires a lot of time – human judgement is not easy (unless there is a perfect match) • Useful – for a one-time evaluation of a stable system – not for evaluation during development 44 MT Evaluation • Edit-Distance (Word Accuracy) – metric to determine closeness of translations automatically – the least number of edit operations to turn the translated sentence into the reference sentence – Alshawi et al. 1998 45 MT Evaluation • • • • • • • WA = 1- ((d+s+i)/max(r,c)) d= number of deletions s = number of substitutions i = number of insertions r = reference sentence length c = candidate sentence length easy to calculate using Levenshtein distance algorithm (dynamic programming) • various extensions have been proposed 46 MT Evaluation • Advantages – fully automatic given a reference set • Disadvantages – penalizes candidates if a synonym is used – penalizes swaps of words and block of words too much 47 MT Evaluation • BLEU (method to automate MT Evaluation) – the closer a machine translation is to a professional human translation, the better it is – BiLingual Evaluation Understudy • Required: – corpus of good quality human reference translations – a “closeness” metric 48 MT Evaluation • Two candidate translations from Chinese source – C1: It is a guide to action which ensures that the military always obeys the commands of the party – C2: It is to insure the troops forever hearing the activity guidebook that party direct • Intuitively: C1 is better than C2 49 MT Evaluation • Three reference translations – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 50 MT Evaluation • Basic idea: – a good candidate translation shares many words and phrases with reference translations – comparing n-gram matches can be used to rank candidate translations • n-gram: a sequence of n word occurrences – in BLEU n=1,2,3,4 - 1-grams give a measure of adequacy - longer n-grams give a measure of fluency 51 MT Evaluation • For unigrams: – count the number of matching unigrams • in all references – divide by the total number of unigrams (in the candidate sentence) 52 MT Evaluation • Problem – C1: the the the the the the the (=7/7=1) – R1: the cat is on the mat • Solution: – clip matching count (7) by maximum reference count (2) 2 (CountClip) – modified unigram precision = 2/7=0.29 53 MT Evaluation • Example (unigrams) – C1: It is a guide to action which ensures that the military always obeys the commands of the party (17/18=0.94) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 54 MT Evaluation • Example (unigrams) – C2: It is to insure the troops forever hearing the activity guidebook that party direct (8/14=0.57) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 55 MT Evaluation • Example (bigrams) – C1: It is a guide to action which ensures that the military always obeys the commands of the party (10/17=0.59) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 56 MT Evaluation • Example (bigrams) – C2: It is to insure the troops forever hearing the activity guidebook that party direct (1/13=0.08) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 57 MT Evaluation • • • • • Extend to a full multi-sentence corpus compute n-gram matches sentence by sentence sum the clipped n-gram counts for all candidates divide by the number of n-grams in the text corpus pn = – ∑C ∈ {Candidates}∑n-gram ∈ C Countclip(n-gram) – divided by – ∑C’ ∈ {Candidates}∑n-gram’ ∈ C’ Count(n-gram’) 58 MT Evaluation • Combining n-gram precision scores • weighted linear average works reasonable – ∑Nn=1 wn pn • but: n-gram decisions decays exponentially with n (so log to compensate for this) – exp (∑Nn=1 wn log pn) • weights in BLEU: wn = 1/N 59 MT Evaluation • BLEU is a precision measure – #(C ∩ R) / #C • Recall is difficult to define because of multiple reference translations – e.g. #(C ∩ Rs) / # Rs • where Rs = Ui Ri – will not work 60 MT Evaluation • • • • • • • C1: I always invariably perpetually do C2: I always do R1: I always do R2: I invariably do R3: I perpetually do Recall of C1 over R1-3 is better than C2 but C2 is a better translation 61 MT Evaluation • But without Recall: – – – – – C1: of the compared with R1-3 as before modified unigram precision = 2/2 modified bigram precision = 1/1 which is the wrong result 62 MT Evaluation • Length – n-gram precision penalizes translations longer than the reference – but not translations shorter than the reference – Add Brevity Penalty (BP) 63 MT Evaluation • bi= best match length = reference sentence length closest to candidate sentence i‘s length (e.g. r:12, 15, 17, c: 12 12) • r = test corpus effective reference length = ∑i bi • c = total length of candidate translation corpus 64 MT Evaluation • BP = – – – – computed over the corpus not sentence by sentence and averaged 1 if c > r e(1-r/c) if c <= r • BLEU = BP • exp (∑Nn=1 wn log pn) 65 MT Evaluation • BLEU: – claim: BLEU closely matches human judgement • when averaged over a test corpus • not necessarily on individual sentences • shown extensively in Papineni et al. 2001 – multiple reference translations are desirable • to cancel out translation styles of individual translators • (e.g. East Asian economy v. economy of East Asia) 66 MT Evaluation • Variants on BLEU – NIST • http://www.nist.gov/speech/tests/mt/doc/ngramstudy.pdf • different weights • different BP – ROUGE (Lin and Hovy 2003) • for text summarization • Recall-Oriented Understudy for Gisting Evaluation 67 MT Evaluation • Main Advantage of BLEU – automatic evaluation • good for use during development • particularly useful for data-based systems • Disadvantage – defined for a whole test corpus – not for individual sentences – just measures difference with reference 68