Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013 Agenda 1. History of machine translation activities at WIPO 2. Cross Lingual Search 3. Integration of third party MT engines 4. Development of in-house MT engines tuned for specific tasks 5. Strategy 1.History of MT activites At WIPO MT at WIPO: history Why is WIPO interested in Machine Translation? The IB of the PCT is responsible for translating titles, abstracts, drawing legends, search reports, written opinions and IPRPs for the published PCT applications. (This represents xx millions of words translated per year) WIPO is disseminating multi lingual Patent Information through it’s portal PATENTSCOPE. Multi lingual functions are required to enable the largest number of users worldwide to search and browse patent applications in many different languages MT at WIPO: an overview Mid 2007: International RFP to implement “cross lingual Search” functions in PATENTSCOPE End of 2008: project failure due to the supplier’s inability to deliver a quality product 2009: First Statistical Machine Translation experiments performed in-house. Development of a first engine to translate titles from English to French 2009-2010: development of the PATENTSCOPE CLIR system in 5 languages (EN, FR, DE, ES, JA) MT at WIPO: an overview Summer 2010: Integration of Google Translate in PATENTSCOPE to translate result lists, descriptions and claims March 2011: development and deployment of WIPO’s first own MT system tuned for patents’ titles and abstracts (TAPTA) April 2011: extension of CLIR to cover the Chinese, Korean, Russian and Portuguese languages August 2011: release of PCT corpus: COPPA MT at WIPO: an overview November 2011: Integration of KIPO’s machine translation system in PATENTSCOPE (for the KOEN language pairs) (until December 2012) December 2011: Integration of Microsoft Translate into PATENTSCOPE January 2012: extension of CLIR to cover the Dutch, Italian, and Swedish languages Avril 2012: PATENTSCOPE CLIR functionality integrated into Minesoft’s PATBASE through a web service MT at WIPO: an overview June 2012: provide MT transfer knowledge to UN and ITU October 2012: UN, ITU, Wipo Marks in production November 2012: Extension of Tapta to cover Japanese and German February 2013: Evaluation results: Tapta better than Microsoft and Google (title+abtract in all language pairs, similar results in UN) June 2013: Outsourcing contract using TAPTA for the EN=>FR language pair 2. CLIR (Cross Lingual Information Retrieval) WIPO’s Cross-lingual search: principle ► Free tool available at http://patentscope.wipo.int/search/clir/clir.jsp?interface Language=en ► Enter a search query in either EN, DE, ES, FR, JP, RU, ZH, PT, IT, DU, SE and it will be expanded into the other languages (keywords translation) ► Automatic or supervised mode ► balance between precision and recall set by the user ► Disambiguation by technical domains and by selection of appropriate synonyms ► Built from bilingual dictionaries extracted statistically from Patent corpuses without supervision Interface : Cross-lingual (CLIR)- Automatic CLIR: automatically enriched query (EN_TI:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21) OR EN_AB:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21)) OR (DE_TI:("Hörgeräte" OR "Hörhilfegeräten") OR DE_AB:("Hörgeräte" OR "Hörhilfegeräten")) OR (ES_TI:("audífonos") OR ES_AB:("audífonos")) OR (FR_TI:("audioprothèses" OR "appareils de correction auditive" OR "production d'appareils auditifs") OR FR_AB:("audioprothèses" OR "appareils de correction auditive" OR "production d'appareils auditifs")) OR (JA_TI:("穴形補聴器") OR JA_AB:("穴形補聴器")) OR (KO_TI:("보청") OR KO_AB:("보청")) OR (PT_TI:("audiofone" OR "auxìlio de audição") OR PT_AB:("audiofone" OR "auxìlio de audição")) OR (RU_TI:("слуха протезно"~22 OR "прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR "слухоулучшающих спидом"~22) OR RU_AB:("слуха протезно"~22 OR "прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR "слухоулучшающих спидом"~22)) OR (ZH_TI:("助听器") OR ZH_AB:("助听器")) Why use PATENTSCOPE CLIR? A) Search full text collections simultaneously in many foreign languages without knowing them (not English centric) B) Improve significantly the number of relevant results without increasing significantly the number of irrelevant results 3356 results in English titles or abstracts for hearing AND aids 3825 results obtained with CLIR searching in titles or abstracts in all languages C) Have confidence in your searches: No black box: users have access to the CLIR generated boolean queries (albeit complex) and have the full control on them D) Have a responsive system even for complex queries the query in the previous slide executes in less than 1/2sec in PATENTSCOPE What next? Improve terminology coverage of already supported languages Add other languages (Arabic)? Condition to add a language: Having more than 200’000 (ideally 500’000) titles and if possible abstracts in the language available with associated high quality translations in English 3. Integrated third-party MT engines 9 Interface languages: Deutsch |English|Español |Français |日本語 | 한국어 |Português |Русский |中文 | Integrated 3rd party MT: principles ► Use free MT services available on the internet (so far Google Translate and Microsoft translate) ► Translates from the source language(s) to the language set by the user in the graphical interface ► Translates results lists and description and claims only when requested by the user ► 65 languages supported using Google Translate! ► Quality of Google Translate improved for patent texts thanks to EPO sharing patent corpora with Google Search Results – machine translate Search Results – machine translate Search Results – machine translate Description – machine translate Description – machine translate Description – machine translate Description – machine translate 4. Development of in-house MT engines tuned for specific tasks In-house MT engines MT systems building expertise developed in-house since 2009 Corpora approach: started using PCT corpus of titles and abstracts Uses open source Statistical Machine Translation: Moses (WIPO is a committer with a specific branch) First system developed: Translation Assistant for Patent Titles and Abstracts (TAPTA: publicly available at https://www3.wipo.int/patentscope/translate) Same system (trained on different corpora) developed for the United Nations, for ITU and for translation of Madrid Trademarks goods and services TAPTA Hovering the mouse on the left highlights corresponding segment on the right (and vice-versa) How well does it work? Tapta better than Google and Microsoft for abstracts English->French: Tapta BLEU 46.9 15 abstracts* Google 45.9 / Google-EPO 45.8 / Microsoft 36.7 German->English: Tapta BLEU 38.3 11 title & abstracts* Google 37.8 / Microsoft 26.8 Human evaluation: adequacy/fluency (Tapta: 79%, Google 65%, Microsoft 67%) English->Japanese: Tapta BLEU 25.4 1000 segments (title & abstract)* Google BLEU 22.3 English->Chinese: Tapta BLEU 22 1000 segments (title & abstract)* Google BLEU 17.5 (*) from recent patent applications (published in March 2013), compared to one reference Also in United Nations Aims at assisting UN translators when translating UN official documents from AR,ES,FR,RU,ZH into EN (both directions) BLEU scores Language pair Google Bing Tapta ar-en 55.25 n/a[1] 51.17 en-ar 44.10 33.74 28.94 en-es 61.81 53.39 46.86 en-fr 51.23 45.58 42.19 en-ru 50.85 39.67 38.96 en-zh 43.17 34.16 32.77 es-en 60.32 52.54 49.18 fr-en 53.36 46.46 43.39 ru-en 58.56 47.71 47.09 zh-en 42.31 36.55 30.60 Findings Customized MT engines built on narrow language domains outperform state of the art general purpose MT engines TAPTA automatic evaluations are better than Google Translate on patent titles and abstracts (BLEU scores) Size of corpora matters, as well as quality of sentencepairs alignments Building customized SMT engines is sustainable and does not require large human, IT and financial resources Bibliography TAPTA: A user-driven translation system for patent documents based on domain-aware Statistical Machine Translation, B. Pouliquen, C. Mazenc, A. Ioro in proceedings of the European Association for Machine Translation conference, May 2011, Leuven Belgium COPPA, CLIR and TAPTA: three tools to assist in overcoming the Patent language barrier at WIPO, B. Pouliquen, C. Mazenc in proceedings of Machine Translation Summit 2011, September 2011 Xiamen China Statistical Machine Translation prototype using UN parallel documents, B. Pouliquen, C. Mazenc, C. Elizalde, J. Garcia-Verdugo in proceedings of the 16th EAMT conference, 28-30 May 2012, Trento, Italy (forthcoming ) Large-scale multiple language translation accelerator at the United Nations, B. Pouliquen, C. Elizalde, M, Junczys-Dowmunt, C. Mazenc, J. Garcia-Verdugo in proceedings of Machine Translation Summit 2013, Nice, France 5. Strategy WIPO’s MT strategy Make best use of state-of-the-art technologies available in open source and promote further their development Adapt these technologies to the patent domain (using Patent corpora, Patent classification,…) for practical use cases Develop patent MT systems and put them at disposal of the largest number of users to bridge the language barrier (notably in patent searching) Cooperate with interested offices by sharing experience, corpora and software solutions Adopt a barrier free dissemination of patent corpora when possible to foster research in MT for patent texts Investigate Cloud technologies to be able to ramp up to industrial internet solutions TAPTA: Extend coverage (languages, claims, descriptions) Questions?