Machine Translation Introduction Jan Odijk LOT Winterschool Amsterdam January 2011 1 Overview • • • • • • • MT: What is it MT: What is not possible (yet?) MT: Why is it so difficult? MT: Can we make it possible? MT: Evaluation MT: What is (perhaps) possible Conclusions 2 MT: What is it? • Input: text in source language • Output text in target language that is a translation of the input text 3 MT: What is it? Interlingua Analyzed input transfer Analyzed output Input direct translation Output 4 MT: System Types • Direct: – Earliest systems (1950s) • Direct word-to-word translation – Recent statistical MT systems • Transfer – Almost all research and commercial systems <= 1990 • Interlingual 5 MT: System Types • Interlingual – A few research systems in the 1980s • Rosetta (Philips), based on Montague Grammar – Semantic derivation trees of attuned grammars • Distributed Translation (BSO) – (enriched) Esperanto • Sometimes logical representations • Hybrid Interlingual/Transfer – Transfer for lexicons; IL for rules 6 Rule-Based Systems • Most systems – explicit source language grammar – parser yields analysis of source language input – transfer component turns it into target language structure – no explicit grammar of target language (except morphology) 7 Rule-Based Systems • Some systems (Eurotra) – explicit source and target language grammar • sometimes reversible – parser yields analysis of source language input – transfer component turns it into target language structure – generation of translation by target language grammar 8 Rule-Based Systems • Some systems (Rosetta, DLT) – explicit source and target language grammar • in some cases reversible – parser yields interlingual representation – generation of translation by target language grammar from interlingual representation 9 MT: Is it difficult? • FAHQT: Fully Automatic High Quality Translation – Fully Automatic: no human intervention – High Quality: close or equal to human translation • Even acceptable quality is difficult to achieve 10 MT: Why is it so difficult? • Ambiguity – Real – Temporary • • • • • Computational Complexity Complexity of language Divergences Language Competence v. Language Use Require large and rich lexicons 11 MT: Why is it so difficult? • De jongen sloeg het meisje met de gitaar • Hij heeft boeken gelezen • Hij heeft uren gelezen – – – – He has been reading books *He has been reading for books *He has been reading hours He has been reading for hours 12 MT: Why is it so difficult? • Uren: not only also – dagen, de hele dag, weken, … – (Words expressing units of time) • But also: – De hele vergadering, meeting, bijeenkomst, les, … – (words expressing events) 13 MT: Why is it so difficult? • Hij draagt een bruin pak – Dragen: wear or carry – Pak: suit or package • Hij draagt een bruin pak en zwarte schoenen • Hij draagt een bruin pak onder zijn arm 14 MT: Why is it so difficult? • Voert uw bedrijf sloten uit? – Uitvoeren: execute, or export? – Bedrijf: act, or company? – Sloten: ditches, or locks? 15 MT: Why is it so difficult? • Temporary Ambiguity – Hij heeft boeken gelezen • Heeft: main or auxiliary verb? • Boeken: noun or verb – Voert uw bedrijf sloten uit? • Voert: form of voeren or of uitvoeren, • Bedrijf: noun or verb form? • Sloten uit: noun+particle or PP: out of ditches/locks 16 Why is MT difficult? • Ambiguity of natural language Summary – requires modeling of knowledge of the world /situation • by rule systems, and/or • by statistics 17 MT: Why is it so difficult? • Computational Complexity – High demands of processing capacity – High demands on memory • Complexity of language – Many different construction types – All interacting with each other 18 Why is MT difficult? • Divergences between language – require deep syntactic analysis – Or very sophisticated statistical techniques 19 Divergences: Category mismatches • Simple category mismatches – – – – – – woonachtig (zijn) v. reside (Adj – Verb) zich ergeren v. (be) annoyed (Verb-Adj) verliefd v. in love (Adj- Prep+Noun) kunnen v. (be) able kunnen v. (be) possible door- v. continue (to) 20 Divergences: Category mismatches • More complex category mismatches – graag vs. like (Adv vs. Verb) • hij zwemt graag vs. he likes to swim – toevallig vs. happen • hij viel toevallig vs. he happened to fall 21 Divergences: Category mismatches • Phrasal category mismatches – de zieke vrouw – the woman who is ill (* the ill woman) – I expect her to leave • ik verwacht dat zij vertrekt – She is likely to come • het is waarschijnlijk dat zij komt 22 Conflational Divergences: • prepositional complements – houden van vs. love • existential er vs. Ø – er passeerde een auto vs. – a car passed • verbal particles – blow (something) up vs. volar 23 Conflational Divergences: • reflexive verbs – zich scheren vs. shave • composed vs. simple tense forms – he will do it vs. lo hará • split negatives vs. composed negatives – he does not see anyone vs. – hij ziet niemand 24 Functional Divergences: • I like these apples – me gustan estas manzanas • se venden manzanas aqui – hier verkoopt men appels • er werd door de toeschouwers gejuicht – the spectators were cheering 25 Divergences: MWEs • semi-fixed MWEs – nuclear power plant vs. kerncentrale • flexible idioms – de plaat poetsen vs. bolt – de pijp uit gaan v. to kick the bucket 26 Divergences: MWEs • semi-idioms (collocations) – zware shag vs. strong tobacco • semi-idioms (support verbs) – aandacht besteden aan – pay attention to 27 MT: Why is it so difficult? • Language Competence v. Language Use – Earlier systems implemented idealized reality – But not the really occurring language use – In some cases • focus on theoretically interesting difficult constructions • That do occur in reality • But other constructions are more important to deal with in practical systems 28 MT: Why is it so difficult? • Large and rich lexicons – Existing human-oriented dictionaries are not suited as such – All information must be available in a formalized way – Much more information is needed than in a traditional dictionary 29 MT: Why is it so difficult? • Multi-word Expressions (MWEs) – Are in current dictionaries only in a very informal way – No standards on how to represent them lexically – Many different types requiring different treatment in the grammar – Huge numbers!! – Domain and company-specific terminology are often MWEs 30 MT: Can we make it possible? • Probably not, • but we can still improve significantly – Lexicons – Selection restrictions – Approximating analyses • Statistical MT 31 MT: Can we make it possible? • Large and rich lexicons – widely accepted and used (de facto) standards – Methods and tools to quickly adapt to domain or company specific vocabulary – Better treatment of MWEs and standards for lexical representation of MWEs 32 MT: Can we make it possible? • Selection restrictions with type system to approach modeling of world knowledge – Requires sophisticated syntactic analysis • • • • • Boek: info (legible) Uur: time unit duration Vergadering: event duration Lezen: subject=human; object=info (legible) Durational adjunct must be a duration phrase 33 MT: Can we make it possible? • Selection restrictions – – – – – – – – Pak (1) (suit): cloths Pak (2) (package): entity Dragen (1) (wear): subj=animate; object=cloths Dragen (2) (carry): subj=animate; object= entity Schoen: cloths Entity > cloths Identity preferred over subsumption Homogeneous object preferred over heterogeneous one 34 MT: Can we make it possible? • Selection restrictions – Hij draagt een bruin pak • • • • He wears a brown suit (1: cloths=cloths) He carries a brown package (1: entity=entity) He carries a brown suit (2: entity > cloth) *He wears a brown package (cloth ¬> entity) – Hij draagt een bruin pak en zwarte schoenen • He wears a brown suit and black shoes (1: homogeneous and cloths=cloths) • He carries a brown suit and black shoes (2: homogeneous but entity > cloths) • He carries a brown package and black shoes(2: inhomogeneous but entity=entity) • *He wears a brown package and black shoes (cloths ¬> entity) 35 MT: Can we make it possible? • Approximating analyses – Ignore certain ambiguities to begin with – Use only limited amount of relevant information – Cut off analysis when there are too many alternatives – This is currently actually done in all practical systems – Need new ways of doing this without affecting quality too seriously 36 MT: Can we make it possible? • Statistical MT • Derives MT-system automatically – From statistics taken from • Aligned parallel corpora ( translation model) • Monolingual target language corpora ( language model) • Being worked since early 90’s 37 MT: Can we make it possible? • Plus: – No or very limited grammar development – Includes language and world knowledge automatically (but implicitly) – Based on actually occurring data – Currently many experimental and commercial systems • Minus: – Requires large aligned parallel corpora – Unclear how much linguistics will be needed anyway – Probably restricted to very limited domains only 38 MT: Can we make it possible? • • • • Google Translate (statistical MT) Hij draagt een pak. √He wears a suit. Hij draagt schoenen. √ He wears shoes. Hij draagt bruine schoenen en een pak. • √ He wears a suit and brown shoes. (!!) • Hij draagt het pakket √ He carries the package • Hij heeft een pak aan. *He has a suit. • Voert uw bedrijf sloten uit? • – *Does your company locks out? 39 MT: Can we make it possible? • Euromatrix esp. “the Euromatrix” – Lists data and tools for European language pairs – Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches 40 MT: Can we make it possible? • Euromatrix esp. “the Euromatrix” – Lists data and tools for European language pairs – Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches • Successor project EuromatrixPlus 41 MT: Can we make it possible? • META-NET 2010-2013 (EU-funding) – Building a community with shared vision and strategic research agenda – Building META-SHARE, an open resource exchange facility – Building bridges to neighbouring technology fields • • • • Bringing more Semantics into Translation Optimising the Division of Labour in Hybrid MT Exploiting the Context for Translation Empirical Base for Machine Translation 42 MT: Can we make it possible? • PACO-MT 2008-2011 • Investigates hybrid approach to MT – Rule-based and statistical – Uses existing parser for source language analysis – Uses statistical n-gram language models for generation – Uses statistical approach to transfer 43 MT Evaluation • Evaluation depends on purpose of MT and how it is used – application, domain, controlled language • Many aspects can be evaluated – functionality, efficiency, usability, reliability, maintainability, portability – translation quality – embedding in work flow • post-editing options/tools 44 MT Evaluation • Focus here: – does the system yield good translations according to human judgement – in the context of developing a system • Again, many aspects: – fidelity (how close), correctness, adequacy, informativeness, intelligibility, fluency – and many ways to measure these aspects 45 MT Evaluation • Test suite – Reference = • list of (carefully selected) sentences • with their translations (ordered by score) – translations judged correct by human (usually developer) – upon every update of the system output of the new system is compared to the reference • if different: system has to be adapted, or reference has to be adapted • Advantages – focus on specific translation problems possible – excellent for regression testing – Manual judgement needed only once for each new output • –other comparisons are automatic • Disadvantages – not really independent – particularly suited for pure rule-based systems – human judgement needed if output differs from reference 46 MT Evaluation • Comparison against – translation corpus – independently created by human translators – possibly multiple equivalently correct translations of a sentence • Advantages – truely independent – also suited for data-driven systems • Disadvantage – requires human judgement (every time there is a system update) • high effort by highly skilled people, high costs, requires a lot of time – human judgement is not easy (unless there is a perfect match) • Useful – for a one-time evaluation of a stable system – not for evaluation during development 47 MT Evaluation • Edit-Distance (Word Accuracy) – metric to determine closeness of translations automatically – the least number of edit operations to turn the translated sentence into the reference sentence – Alshawi et al. 1998 48 MT Evaluation • • • • • • • WA = 1- ((d+s+i)/max(r,c)) d= number of deletions s = number of substitutions i = number of insertions r = reference sentence length c = candidate sentence length easy to calculate using Levenshtein distance algorithm (dynamic programming) • various extensions have been proposed 49 MT Evaluation • Advantages – fully automatic given a reference set • Disadvantages – penalizes candidates if a synonym is used – penalizes swaps of words and block of words too much 50 MT Evaluation • BLEU (method to automate MT Evaluation) – the closer a machine translation is to a professional human translation, the better it is – BiLingual Evaluation Understudy • Required: – corpus of good quality human reference translations – a “closeness” metric 51 MT Evaluation • Two candidate translations from Chinese source – C1: It is a guide to action which ensures that the military always obeys the commands of the party – C2: It is to insure the troops forever hearing the activity guidebook that party direct • Intuitively: C1 is better than C2 52 MT Evaluation • Three reference translations – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 53 MT Evaluation • Basic idea: – a good candidate translation shares many words and phrases with reference translations – comparing n-gram matches can be used to rank candidate translations • n-gram: a sequence of n word occurrences – in BLEU n=1,2,3,4 - 1-grams give a measure of adequacy - longer n-grams give a measure of fluency 54 MT Evaluation • For unigrams: – count the number of matching unigrams • in all references – divide by the total number of unigrams (in the candidate sentence) 55 MT Evaluation • Problem – C1: the the the the the the the (=7/7=1) – R1: the cat is on the mat • Solution: – clip matching count (7) by maximum reference count (2) 2 (CountClip) – modified unigram precision = 2/7=0.29 56 MT Evaluation • Example (unigrams) – C1: It is a guide to action which ensures that the military always obeys the commands of the party (17/18=0.94) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 57 MT Evaluation • Example (unigrams) – C2: It is to insure the troops forever hearing the activity guidebook that party direct (8/14=0.57) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 58 MT Evaluation • Example (bigrams) – C1: It is a guide to action which ensures that the military always obeys the commands of the party (10/17=0.59) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 59 MT Evaluation • Example (bigrams) – C2: It is to insure the troops forever hearing the activity guidebook that party direct (1/13=0.08) – R1: It is a guide to action that ensures that the military will forever heed Party commands – R2: It is the guiding principle which guarantees the military forces always being under the command of the Party – R3: It is the practical guide for the army always to heed the directions of the party 60 MT Evaluation • • • • • Extend to a full multi-sentence corpus compute n-gram matches sentence by sentence sum the clipped n-gram counts for all candidates divide by the number of n-grams in the text corpus pn = – ∑C ∈ {Candidates}∑n-gram ∈ C Countclip(n-gram) – divided by – ∑C’ ∈ {Candidates}∑n-gram’ ∈ C’ Count(n-gram’) 61 MT Evaluation • Combining n-gram precision scores • weighted linear average works reasonable – ∑Nn=1 wn pn • but: n-gram decisions decays exponentially with n (so log to compensate for this) – exp (∑Nn=1 wn log pn) • weights in BLEU: wn = 1/N 62 MT Evaluation • BLEU is a precision measure – #(C ∩ R) / #C • Recall is difficult to define because of multiple reference translations – e.g. #(C ∩ Rs) / # Rs • where Rs = Ui Ri – will not work 63 MT Evaluation • • • • • • • C1: I always invariably perpetually do C2: I always do R1: I always do R2: I invariably do R3: I perpetually do Recall of C1 over R1-3 is better than C2 but C2 is a better translation 64 MT Evaluation • But without Recall: – – – – – C1: of the compared with R1-3 as before modified unigram precision = 2/2 modified bigram precision = 1/1 which is the wrong result 65 MT Evaluation • Length – n-gram precision penalizes translations longer than the reference – but not translations shorter than the reference – Add Brevity Penalty (BP) 66 MT Evaluation • bi= best match length = reference sentence length closest to candidate sentence i‘s length (e.g. r:12, 15, 17, c: 12 12) • r = test corpus effective reference length = ∑i bi • c = total length of candidate translation corpus 67 MT Evaluation • BP = – – – – computed over the corpus not sentence by sentence and averaged 1 if c > r e(1-r/c) if c <= r • BLEU = BP • exp (∑Nn=1 wn log pn) 68 MT Evaluation • BLEU: – claim: BLEU closely matches human judgement • when averaged over a test corpus • not necessarily on individual sentences • shown extensively in Papineni et al. 2001 – multiple reference translations are desirable • to cancel out translation styles of individual translators • (e.g. East Asian economy v. economy of East Asia) 69 MT Evaluation • Variants on BLEU – NIST • http://www.nist.gov/speech/tests/mt/doc/ngramstudy.pdf • different weights • different BP – ROUGE (Lin and Hovy 2003) • for text summarization • Recall-Oriented Understudy for Gisting Evaluation 70 MT Evaluation • Main Advantage of BLEU – automatic evaluation • good for use during development • particularly useful for data-based systems • Disadvantage – defined for a whole test corpus – not for individual sentences – just measures difference with reference 71 MT: What is (perhaps) possible • • • • • • • • Cross-Language Information Retrieval Low Quality MT for Gist extraction MT and Speech Technology Controlled Language Limited Domain Interaction with author Combinations of the above Computer-aided translation 72 MT: What is (perhaps) possible • Cross-Language Information Retrieval (CLIR) – – – – Input query: in own language Input query translated into target languages Search in target language documents Results in target language • Translation of individual words only • Growing need (growing multilingual Web) • No perfect translation required 73 MT: What is (perhaps) possible 74 MT: What is (perhaps) possible • Low quality MT for Gist extraction • Low quality but still useful • If interesting high quality human translation can be requested (has to be paid for) 75 MT: What is (perhaps) possible 76 MT: What is (perhaps) possible 77 MT: What is (perhaps) possible • CLIR – Fills a growing need in the market – Is technically feasible – Creates need for translation of found documents • Solved partially by low quality MT • Potentially creates need for more human translation • Stimulates (funds) research into more sophisticated MT 78 MT: What is (perhaps) possible • Combine MT (statistical or rule-based) with OCR technology – – – – Make a picture of a text with your phone Text is OCR-ed Text is translated (usually a short and simple text) • Linguatec Shoot & Translate • Word Lens 79 MT: What is (perhaps) possible • Combine MT (statistical or rule-based) with Speech technology – Complicates the problem on the one hand but – Speech technology (ASR) is currently limited to very limited domains (makes MT simpler) – Many useful applications for speech technology currently in the market • Directory assistance Tourist Information • Tourist communication Call Centers • Navigation Hotel reservations – Some will profit from in-built automatic translation 80 MT: What is (perhaps) possible • Large EC FP6 project TC-STAR (2004-) – (http://www.tc-star.org/) – Research into improved speech technology (ASR and TTS) – Research into statistical MT – Research in combining both (speech-to-speech translation) – In a few selected limited domains 81 MT: What is (perhaps) possible • Commercial Speech2Speech Translation • Jibbigo – http://www.jibbigo.com • Speech-to-speech translation (iPhone, Android) • http://www.phonedog.com/2009/10/30/iphone-appjibbigo-speech-translator • Talk to Me (Android phones) 82 MT: What is (perhaps) possible • Controlled Language – Authoring System limits vocabulary and syntax of document authors – Often desirable in companies to get consistent documentation (e.g. aircraft maintenance manuals) • AECMA Simplified English • GIFAS Rationalized French – Makes MT easier (language well-defined) 83 MT: What is (perhaps) possible • Limited Domain – Translation of • Weather reports (TAUM-Meteo, Canada) • Avalanche warnings (Switzerland) – Fast adaptation to domain/company-specific vocabulary and terminology 84 MT: What is (perhaps) possible • Interaction with author – No fully automatic translation – Document author resolves • Ambiguities unresolved by the system • In a dialogue between the author and the system in the source language • Approach taken in Rosetta project (Philips) • Will only work if the – #unresolved ambiguities is low – Questions to resolve ambiguity are clear 85 MT: What is (perhaps) possible • Hij droeg een bruin pak – Wat bedoelt u met “pak” • (1) kostuum • (2) pakket • Hij droeg een bruin pak – Wat bedoelt u met “dragen (droeg)” • (1) aan of op hebben (kleding) • (2) bij zich hebben (bijv. in de hand) 86 MT: What is (perhaps) possible • Combinations of the above 87 MT: What is (perhaps) possible • Computer-aided translation – For end-users – For professional translators/localization industry • Limited functionality – Specific terminology • Bootstrap translation automatically – Human revision and correction (Post-edit) • Only if – MT Quality is such that it reduces effort – The system is fully integrated in the workflow system 88 Conclusions • FAHQT not possible (yet?) • MT is really very difficult! • Several constrained versions do yield usable technology with state-of-the-art MT • In some cases: even potentially creates additional needs for MT and human translation 89 Conclusions • Statistical MT yields practical relatively quick to produce systems (but low-quality) • More research and lots of hard work is needed to get better systems • Will probably require hybrid systems (mixed statistically based/knowledge based); the focus of research is here (PACO-MT, META-NET,…) • Needs to be financed by niches where current state-of-the art MT yields usable technology and there is a market. 90