Project Periodic Report

advertisement
This document is part of the Project “Machine Translation Enhanced Computer Assisted Translation (MateCat)”, funded by the
7th Framework Programme of the European Commission through grant agreement no.: 287688.
Project Periodic Report
Grant Agreement number: 287688
Project acronym:
MateCat
Project title:
Machine Translation Enhanced Computer Assisted Translation
Funding Scheme:
STREP
Date of latest version of Annex I: June 30th, 2011
□
2nd
x
3rd
□
4th
□
Periodic report:
1st
Period covered:
from November 1st 2012 to October 31st 2013
Project coordinator:
Name: Marcello Federico
Organisation: Fondazione Bruno Kessler (FBK)
Phone: +39 0461 314 552
Fax: +39 0461 314 591
E-mail: federico@fbk.eu
Project website address: http://www.matecat.com/
Consortium:
Fondazione Bruno Kessler (FBK), Italy
Université du Maine (LE MANS), France
The University of Edinburgh (UEDIN), UK
Translated S.r.l. (TRANSLATED), Italy
Machine Translation Enhanced Computer Assisted Translation
Project Periodic Report, year 2
TABLE OF CONTENTS
1
Publishable summary ........................................................................ 3
Progress report ................................................................................................................................ 4
Research and development ............................................................................................................. 4
Field test .......................................................................................................................................... 5
Dissemination .................................................................................................................................. 5
Expected Results and Impact .......................................................................................................... 6
2/48
Machine Translation Enhanced Computer Assisted Translation
Project Periodic Report, year 2
1
Publishable summary
MateCat aims at improving the integration of machine translation and human translation
within the so-called computer-aided translation (CAT) framework.
CAT tools represent nowadays the dominant technology in the translation industry. They
provide translators with text editors that can manage several document formats and suitably
arrange their content into segments -- i.e. sentences --, ready to be translated. Most
importantly, CAT tools provide access to dictionaries, to translation memories (TMs), and
more recently to machine translation (MT) engines. A TM is basically a repository of translated
segments. During translation, the CAT tool queries the TM to search for exact or fuzzy matches
of the current source segment. These matches are proposed to the translator as translation
suggestions. Once a segment is translated, its source and target texts are added to the TM for
future queries. Recently, when no good matches are found in the TM, suggestions from an MT
engine are also supplied to the translator.
Recent studies have shown that post-editing suggestions from a statistical MT engine can
substantially improve productivity of professional translators. MateCat leverages the growing
interest and expectations in statistical MT by advancing the state of the art along directions
that will hopefully accelerate its adoption by the translation industry. In particular, MateCat is
investigating the integration of MT into the CAT working process along three main research
directions:
● Self-tuning MT, i.e. methods to train statistical MT engines for specific domains or
translation projects;
● User adaptive MT, i.e. methods to quickly adapt statistical MT from user corrections
and feedback.
● Informative MT, i.e. supply users with additional information to enhance their
productivity and work experience.
These new features have been integrated in a new Web-based CAT tool: the MateCat Tool. The
MateCat Tool provides both a professional work environment integrating enhanced MT and a
research platform to run post-editing experiments and to measure user productivity.
3/48
Machine Translation Enhanced Computer Assisted Translation
Project Periodic Report, year 2
Progress in MateCat is systematically measured through field tests evaluating the utility and
usability of the new MT features. Field tests are run with professional translators performing
real translation projects with the MateCat Tool.
Progress report
Research and development
During the second year of the project the consortium focused on research and development
activities covering equally well all work-packages foreseen in the workplan.
Work on self-tuning MT focused on consolidating the project adaptation pipeline and started
investigating new alternative adaptation methods involving continuous space models and
suffix array data structures. The expectation is that both research strands will lead to better
domain and project adaptation methods to be available for the field test of Year 3.
Work on user-adaptive MT developed, implemented and evaluated several original on-line
adaptation methods, based both on generative and discriminative models (D2.1). In particular,
generative methods based on cache models were integrated in the Moses engine and finally in
the MT server. The whole pipeline for on-line learning was optimized for efficiency, given the
close to real-time needs of the MateCat tool, and for concurrent use by multiple users.
Work on informative MT concentrated on the development of MT quality estimation models.
A state-of-the-art architecture was integrated in the MT server for on-line use within the
MateCat Tool. Further, work on terminology help was conducted which leads to an extension
of the Moses decoder in order to properly handle terminology suggestions, and to the
development of software and benchmarks for automatic terminology extraction (D3.1). A
major effort was put to improve the MateCat Tool that has been deployed in the second field
test in mid October 2013. Enhancements have been introduced in the user interface, which
includes new functions –e.g. authentication, project management, preview, concordance
search, etc.-, and supports new MT features –i.e. quality estimation and user adaptation. The
second version of the MateCat Tool has been released internally in September 2013 (D4.2) and
will be released publicly by the end of 2013.
4/48
Machine Translation Enhanced Computer Assisted Translation
Project Periodic Report, year 2
Field test
During the month of October, the consortium ran the second field test to evaluate the impact
of self-tuning MT, user adaptive MT and informative MT (D5.4). While the first two
components were evaluated in terms of user productivity, the informative MT component was
evaluated with respect to its usability. During the second field test, MT engines trained on the
information technology (IT) and legal domains were tested on two translation directions:
English to Italian and English to French. In fact, the outcome of an intermediate field test run
in early summer suggested to replace English-German with English-French, as performance
on German are still not adequate for a post-editing task. The field test was run with 16
professional translators, four for each domain and translation direction, on two real translation
projects provided by the industrial partner. Indeed, the IT translation project resulted (a
posteriori) much harder than those carried out in past experiments. A careful follow-up study
revealed that the field test project consisted in a collection of technical manuals with very
heterogeneous content. This diversity let emerge unforeseen weaknesses in the project and
online adaptation methods, which, especially for French, resulted in quite inconsistent results.
On the contrary, results on the legal domain confirmed for both translation directions
performance improvements as measured in previous lab tests. Hence, the field test confirmed
in this case that project and online MT adaptation could significantly improve the quality of
MT suggestions and thus the productivity of human post-editors.
Dissemination
During the second year, the MateCat consortium has increased its dissemination activities to
promote the outcomes of the project among the scientific, industrial and user communities.
Activities were carried out to raise awareness on the MateCat project among the target
audience. The consortium was present at several international workshops and conferences,
such as the TriKonf Translation Conference, a TAUS Translation Technology Showcase, the
MT Summit 2013, the annual meeting of the Association for Computational Linguistics (ACL),
the International Workshop on Spoken Language Translation (IWSLT), the Machine
Translation Marathon, and the conference of the Association for Machine Translation in the
Americas.
Printed promotional material and the social network Twitter have also supported the
communication. Posters and flyer are available for download on the MateCat web site.
The User Group has been growing since the start of the project thanks to the dissemination
activities and the network established by the consortium. In particular, the involvement of the
5/48
Machine Translation Enhanced Computer Assisted Translation
Project Periodic Report, year 2
User Group in testing MateCat has significantly increased over the course of the second year
in connection with technical progress made by the project.
Expected Results and Impact
The project has continued to deliver good results both on the research and the industrial side.
Activities on self-tuning MT have consolidated the processing pipeline, and started new
investigation that will likely provide an improved pipeline during Year 3. Research on useradaptive and informative MT has developed and successfully evaluated and integrated new
methods for MT online adaptation, context aware translation, and MT quality estimation. The
second version of the MateCat Tool was timely released, which now supports the latest MT
features developed by the project. Expected results for the next year are:
● Self-tuning MT: adoption of continuous space models with support of incremental
project adaptation, adoption of suffix-array based models supporting online-learning;
● User-adaptive MT: improved translation accuracy by exploiting context information of
previous segments, and user corrections and feedback;
● Informative MT: tighter integration of quality estimation supporting online learning,
integration of terminology and concordance help with alignment information;
● MateCat Tool: support of user-adaptive and informative MT, management of texts with
mark-up.
To improve the impact of the MateCat Tool, the industrial partner has started to deploy the
MateCat Tool for a substantial percentage of its internal translation projects. So far, about 2
million of words have been translated with the MateCat tool. This extensive use of the tool has
required a very intensive bug fixing activity, which has definitely improved the reliability of the
code. Dissemination activities have been significantly increased, through technical papers,
invited talks, and tutorials. The industrial partner has also started to promote the tool in
industrial venues, and has involved an important IT player in an intensive testing activity,
which included same customization of the MateCat Tool for localization projects. So far about
600 people have registered on the MateCat website to try the Private Beta.
6/48
Download