This document is part of the Project “Machine Translation Enhanced Computer Assisted Translation (MateCat)”, funded by the 7th Framework Programme of the European Commission through grant agreement no.: 287688. Project Periodic Report Grant Agreement number: 287688 Project acronym: MateCat Project title: Machine Translation Enhanced Computer Assisted Translation Funding Scheme: STREP Date of latest version of Annex I: June 30th, 2011 □ 2nd x 3rd □ 4th □ Periodic report: 1st Period covered: from November 1st 2012 to October 31st 2013 Project coordinator: Name: Marcello Federico Organisation: Fondazione Bruno Kessler (FBK) Phone: +39 0461 314 552 Fax: +39 0461 314 591 E-mail: federico@fbk.eu Project website address: http://www.matecat.com/ Consortium: Fondazione Bruno Kessler (FBK), Italy Université du Maine (LE MANS), France The University of Edinburgh (UEDIN), UK Translated S.r.l. (TRANSLATED), Italy Machine Translation Enhanced Computer Assisted Translation Project Periodic Report, year 2 TABLE OF CONTENTS 1 Publishable summary ........................................................................ 3 Progress report ................................................................................................................................ 4 Research and development ............................................................................................................. 4 Field test .......................................................................................................................................... 5 Dissemination .................................................................................................................................. 5 Expected Results and Impact .......................................................................................................... 6 2/48 Machine Translation Enhanced Computer Assisted Translation Project Periodic Report, year 2 1 Publishable summary MateCat aims at improving the integration of machine translation and human translation within the so-called computer-aided translation (CAT) framework. CAT tools represent nowadays the dominant technology in the translation industry. They provide translators with text editors that can manage several document formats and suitably arrange their content into segments -- i.e. sentences --, ready to be translated. Most importantly, CAT tools provide access to dictionaries, to translation memories (TMs), and more recently to machine translation (MT) engines. A TM is basically a repository of translated segments. During translation, the CAT tool queries the TM to search for exact or fuzzy matches of the current source segment. These matches are proposed to the translator as translation suggestions. Once a segment is translated, its source and target texts are added to the TM for future queries. Recently, when no good matches are found in the TM, suggestions from an MT engine are also supplied to the translator. Recent studies have shown that post-editing suggestions from a statistical MT engine can substantially improve productivity of professional translators. MateCat leverages the growing interest and expectations in statistical MT by advancing the state of the art along directions that will hopefully accelerate its adoption by the translation industry. In particular, MateCat is investigating the integration of MT into the CAT working process along three main research directions: ● Self-tuning MT, i.e. methods to train statistical MT engines for specific domains or translation projects; ● User adaptive MT, i.e. methods to quickly adapt statistical MT from user corrections and feedback. ● Informative MT, i.e. supply users with additional information to enhance their productivity and work experience. These new features have been integrated in a new Web-based CAT tool: the MateCat Tool. The MateCat Tool provides both a professional work environment integrating enhanced MT and a research platform to run post-editing experiments and to measure user productivity. 3/48 Machine Translation Enhanced Computer Assisted Translation Project Periodic Report, year 2 Progress in MateCat is systematically measured through field tests evaluating the utility and usability of the new MT features. Field tests are run with professional translators performing real translation projects with the MateCat Tool. Progress report Research and development During the second year of the project the consortium focused on research and development activities covering equally well all work-packages foreseen in the workplan. Work on self-tuning MT focused on consolidating the project adaptation pipeline and started investigating new alternative adaptation methods involving continuous space models and suffix array data structures. The expectation is that both research strands will lead to better domain and project adaptation methods to be available for the field test of Year 3. Work on user-adaptive MT developed, implemented and evaluated several original on-line adaptation methods, based both on generative and discriminative models (D2.1). In particular, generative methods based on cache models were integrated in the Moses engine and finally in the MT server. The whole pipeline for on-line learning was optimized for efficiency, given the close to real-time needs of the MateCat tool, and for concurrent use by multiple users. Work on informative MT concentrated on the development of MT quality estimation models. A state-of-the-art architecture was integrated in the MT server for on-line use within the MateCat Tool. Further, work on terminology help was conducted which leads to an extension of the Moses decoder in order to properly handle terminology suggestions, and to the development of software and benchmarks for automatic terminology extraction (D3.1). A major effort was put to improve the MateCat Tool that has been deployed in the second field test in mid October 2013. Enhancements have been introduced in the user interface, which includes new functions –e.g. authentication, project management, preview, concordance search, etc.-, and supports new MT features –i.e. quality estimation and user adaptation. The second version of the MateCat Tool has been released internally in September 2013 (D4.2) and will be released publicly by the end of 2013. 4/48 Machine Translation Enhanced Computer Assisted Translation Project Periodic Report, year 2 Field test During the month of October, the consortium ran the second field test to evaluate the impact of self-tuning MT, user adaptive MT and informative MT (D5.4). While the first two components were evaluated in terms of user productivity, the informative MT component was evaluated with respect to its usability. During the second field test, MT engines trained on the information technology (IT) and legal domains were tested on two translation directions: English to Italian and English to French. In fact, the outcome of an intermediate field test run in early summer suggested to replace English-German with English-French, as performance on German are still not adequate for a post-editing task. The field test was run with 16 professional translators, four for each domain and translation direction, on two real translation projects provided by the industrial partner. Indeed, the IT translation project resulted (a posteriori) much harder than those carried out in past experiments. A careful follow-up study revealed that the field test project consisted in a collection of technical manuals with very heterogeneous content. This diversity let emerge unforeseen weaknesses in the project and online adaptation methods, which, especially for French, resulted in quite inconsistent results. On the contrary, results on the legal domain confirmed for both translation directions performance improvements as measured in previous lab tests. Hence, the field test confirmed in this case that project and online MT adaptation could significantly improve the quality of MT suggestions and thus the productivity of human post-editors. Dissemination During the second year, the MateCat consortium has increased its dissemination activities to promote the outcomes of the project among the scientific, industrial and user communities. Activities were carried out to raise awareness on the MateCat project among the target audience. The consortium was present at several international workshops and conferences, such as the TriKonf Translation Conference, a TAUS Translation Technology Showcase, the MT Summit 2013, the annual meeting of the Association for Computational Linguistics (ACL), the International Workshop on Spoken Language Translation (IWSLT), the Machine Translation Marathon, and the conference of the Association for Machine Translation in the Americas. Printed promotional material and the social network Twitter have also supported the communication. Posters and flyer are available for download on the MateCat web site. The User Group has been growing since the start of the project thanks to the dissemination activities and the network established by the consortium. In particular, the involvement of the 5/48 Machine Translation Enhanced Computer Assisted Translation Project Periodic Report, year 2 User Group in testing MateCat has significantly increased over the course of the second year in connection with technical progress made by the project. Expected Results and Impact The project has continued to deliver good results both on the research and the industrial side. Activities on self-tuning MT have consolidated the processing pipeline, and started new investigation that will likely provide an improved pipeline during Year 3. Research on useradaptive and informative MT has developed and successfully evaluated and integrated new methods for MT online adaptation, context aware translation, and MT quality estimation. The second version of the MateCat Tool was timely released, which now supports the latest MT features developed by the project. Expected results for the next year are: ● Self-tuning MT: adoption of continuous space models with support of incremental project adaptation, adoption of suffix-array based models supporting online-learning; ● User-adaptive MT: improved translation accuracy by exploiting context information of previous segments, and user corrections and feedback; ● Informative MT: tighter integration of quality estimation supporting online learning, integration of terminology and concordance help with alignment information; ● MateCat Tool: support of user-adaptive and informative MT, management of texts with mark-up. To improve the impact of the MateCat Tool, the industrial partner has started to deploy the MateCat Tool for a substantial percentage of its internal translation projects. So far, about 2 million of words have been translated with the MateCat tool. This extensive use of the tool has required a very intensive bug fixing activity, which has definitely improved the reliability of the code. Dissemination activities have been significantly increased, through technical papers, invited talks, and tutorials. The industrial partner has also started to promote the tool in industrial venues, and has involved an important IT player in an intensive testing activity, which included same customization of the MateCat Tool for localization projects. So far about 600 people have registered on the MateCat website to try the Private Beta. 6/48