4. Development of in-house MT engines tuned for specific

advertisement
Machine Translation
activities at WIPO
Bruno Pouliquen,
Christophe Mazenc
Patentscope workshop
June 2013
Agenda
1. History of machine translation activities at WIPO
2. Cross Lingual Search
3. Integration of third party MT engines
4. Development of in-house MT engines tuned for
specific tasks
5. Strategy
1.History of MT activites
At WIPO
MT at WIPO: history
Why is WIPO interested in Machine Translation?
The IB of the PCT is responsible for translating titles,
abstracts, drawing legends, search reports, written
opinions and IPRPs for the published PCT applications.
(This represents xx millions of words translated per year)
WIPO is disseminating multi lingual Patent Information
through it’s portal PATENTSCOPE. Multi lingual
functions are required to enable the largest number of
users worldwide to search and browse patent
applications in many different languages
MT at WIPO: an overview
Mid 2007: International RFP to implement “cross lingual
Search” functions in PATENTSCOPE
End of 2008: project failure due to the supplier’s inability
to deliver a quality product
2009: First Statistical Machine Translation experiments
performed in-house. Development of a first engine to
translate titles from English to French
2009-2010: development of the PATENTSCOPE CLIR
system in 5 languages (EN, FR, DE, ES, JA)
MT at WIPO: an overview
Summer 2010: Integration of Google Translate in
PATENTSCOPE to translate result lists, descriptions and
claims
March 2011: development and deployment of WIPO’s
first own MT system tuned for patents’ titles and
abstracts (TAPTA)
April 2011: extension of CLIR to cover the Chinese,
Korean, Russian and Portuguese languages
August 2011: release of PCT corpus: COPPA
MT at WIPO: an overview
November 2011: Integration of KIPO’s machine
translation system in PATENTSCOPE (for the KOEN
language pairs) (until December 2012)
December 2011: Integration of Microsoft Translate into
PATENTSCOPE
January 2012: extension of CLIR to cover the Dutch,
Italian, and Swedish languages
Avril 2012: PATENTSCOPE CLIR functionality
integrated into Minesoft’s PATBASE through a web
service
MT at WIPO: an overview
June 2012: provide MT transfer knowledge to UN and
ITU
October 2012: UN, ITU, Wipo Marks in production
November 2012: Extension of Tapta to cover Japanese
and German
February 2013: Evaluation results: Tapta better than
Microsoft and Google (title+abtract in all language pairs,
similar results in UN)
June 2013: Outsourcing contract using TAPTA for the
EN=>FR language pair
2. CLIR
(Cross Lingual Information Retrieval)
WIPO’s Cross-lingual search:
principle
► Free tool available at
http://patentscope.wipo.int/search/clir/clir.jsp?interface
Language=en
► Enter a search query in either EN, DE, ES, FR, JP, RU,
ZH, PT, IT, DU, SE and it will be expanded into the
other languages (keywords translation)
► Automatic or supervised mode
► balance between precision and recall set by the user
► Disambiguation by technical domains and by selection
of appropriate synonyms
► Built from bilingual dictionaries extracted statistically
from Patent corpuses without supervision
Interface : Cross-lingual (CLIR)- Automatic
CLIR: automatically enriched query
(EN_TI:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR
"auditory prosthetic"~21) OR EN_AB:("hearing aids" OR "hearing
prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21)) OR
(DE_TI:("Hörgeräte" OR "Hörhilfegeräten") OR DE_AB:("Hörgeräte" OR
"Hörhilfegeräten")) OR (ES_TI:("audífonos") OR ES_AB:("audífonos")) OR
(FR_TI:("audioprothèses" OR "appareils de correction auditive" OR "production
d'appareils auditifs") OR FR_AB:("audioprothèses" OR "appareils de correction
auditive" OR "production d'appareils auditifs")) OR (JA_TI:("穴形補聴器") OR
JA_AB:("穴形補聴器")) OR (KO_TI:("보청") OR KO_AB:("보청")) OR
(PT_TI:("audiofone" OR "auxìlio de audição") OR PT_AB:("audiofone" OR
"auxìlio de audição")) OR (RU_TI:("слуха протезно"~22 OR "прослушивания
протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR
"прослушивания спидом"~22 OR "прослушивания наведения"~22 OR
"слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR
"слухоулучшающих спидом"~22) OR RU_AB:("слуха протезно"~22 OR
"прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха
наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания
наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой
протезно"~22 OR "слухоулучшающих спидом"~22)) OR (ZH_TI:("助听器")
OR ZH_AB:("助听器"))
Why use PATENTSCOPE CLIR?
A) Search full text collections simultaneously in many foreign
languages without knowing them (not English centric)
B) Improve significantly the number of relevant results without
increasing significantly the number of irrelevant results
3356 results in English titles or abstracts for hearing AND aids
3825 results obtained with CLIR searching in titles or abstracts
in all languages
C) Have confidence in your searches:
No black box: users have access to the CLIR generated boolean
queries (albeit complex) and have the full control on them
D) Have a responsive system even for complex queries
the query in the previous slide executes in less than 1/2sec in
PATENTSCOPE
What next?
Improve terminology coverage of already supported
languages
Add other languages (Arabic)?
Condition to add a language:
Having more than 200’000 (ideally 500’000) titles and if
possible abstracts in the language available with
associated high quality translations in English
3. Integrated third-party
MT engines
9 Interface languages:
Deutsch |English|Español |Français |日本語 | 한국어 |Português |Русский |中文 |
Integrated 3rd party MT:
principles
► Use free MT services available on the internet (so far
Google Translate and Microsoft translate)
► Translates from the source language(s) to the
language set by the user in the graphical interface
► Translates results lists and description and claims only
when requested by the user
► 65 languages supported using Google Translate!
► Quality of Google Translate improved for patent texts
thanks to EPO sharing patent corpora with Google
Search Results – machine translate
Search Results – machine translate
Search Results – machine translate
Description – machine translate
Description – machine translate
Description – machine translate
Description – machine translate
4. Development of in-house
MT engines tuned for specific
tasks
In-house MT engines
MT systems building expertise developed in-house since
2009
Corpora approach: started using PCT corpus of titles
and abstracts
Uses open source Statistical Machine Translation:
Moses (WIPO is a committer with a specific branch)
First system developed: Translation Assistant for Patent
Titles and Abstracts (TAPTA: publicly available at
https://www3.wipo.int/patentscope/translate)
Same system (trained on different corpora) developed
for the United Nations, for ITU and for translation of
Madrid Trademarks goods and services
TAPTA
Hovering the mouse on the left highlights
corresponding segment on the right (and
vice-versa)
How well does it work?
Tapta better than Google and Microsoft for abstracts
English->French: Tapta BLEU 46.9
15 abstracts*
Google 45.9 / Google-EPO 45.8 / Microsoft 36.7
German->English: Tapta BLEU 38.3
11 title & abstracts*
Google 37.8 / Microsoft 26.8
Human evaluation: adequacy/fluency (Tapta: 79%, Google 65%, Microsoft 67%)
English->Japanese: Tapta BLEU 25.4
1000 segments (title & abstract)*
Google BLEU 22.3
English->Chinese: Tapta BLEU 22
1000 segments (title & abstract)*
Google BLEU 17.5
(*) from recent patent applications (published in March 2013), compared to one reference
Also in United Nations
Aims at assisting UN translators
when translating UN official
documents from
AR,ES,FR,RU,ZH into EN (both
directions)
BLEU scores
Language
pair
Google
Bing
Tapta
ar-en
55.25
n/a[1]
51.17
en-ar
44.10
33.74
28.94
en-es
61.81
53.39
46.86
en-fr
51.23
45.58
42.19
en-ru
50.85
39.67
38.96
en-zh
43.17
34.16
32.77
es-en
60.32
52.54
49.18
fr-en
53.36
46.46
43.39
ru-en
58.56
47.71
47.09
zh-en
42.31
36.55
30.60
Findings
Customized MT engines built on narrow language
domains outperform state of the art general purpose MT
engines
TAPTA automatic evaluations are better than Google
Translate on patent titles and abstracts (BLEU scores)
Size of corpora matters, as well as quality of sentencepairs alignments
Building customized SMT engines is sustainable and
does not require large human, IT and financial resources
Bibliography
TAPTA: A user-driven translation system for patent documents based
on domain-aware Statistical Machine Translation, B. Pouliquen, C.
Mazenc, A. Ioro in proceedings of the European Association for Machine
Translation conference, May 2011, Leuven Belgium
COPPA, CLIR and TAPTA: three tools to assist in overcoming the
Patent language barrier at WIPO, B. Pouliquen, C. Mazenc in proceedings
of Machine Translation Summit 2011, September 2011 Xiamen China
Statistical Machine Translation prototype using UN parallel
documents, B. Pouliquen, C. Mazenc, C. Elizalde, J. Garcia-Verdugo in
proceedings of the 16th EAMT conference, 28-30 May 2012, Trento, Italy
(forthcoming ) Large-scale multiple language translation
accelerator at the United Nations, B. Pouliquen, C. Elizalde, M,
Junczys-Dowmunt, C. Mazenc, J. Garcia-Verdugo in proceedings of
Machine Translation Summit 2013, Nice, France
5. Strategy
WIPO’s MT strategy
Make best use of state-of-the-art technologies available in open
source and promote further their development
Adapt these technologies to the patent domain (using Patent
corpora, Patent classification,…) for practical use cases
Develop patent MT systems and put them at disposal of the largest
number of users to bridge the language barrier (notably in patent
searching)
Cooperate with interested offices by sharing experience, corpora
and software solutions
Adopt a barrier free dissemination of patent corpora when possible
to foster research in MT for patent texts
Investigate Cloud technologies to be able to ramp up to industrial
internet solutions
TAPTA: Extend coverage (languages, claims, descriptions)
Questions?
Download