Free/Open-Source Machine Translation: a three-day tutorial at the Centre for Next Generation Localisation Mikel L. Forcada Universitat d’Alacant, E-03071 Alacant, Spain Centre for Next Generation Localisation April 14–16, 2009 1 Objectives • To become acquainted with existing free/open-source machine translation (FOSMT) software • To understand rule-based FOSMT using Apertium as an example • To experience FOSMT development in the Apertium platform • To understand corpus-based FOSMT using Moses as an example (*) • To experience FOSMT development with Moses (*) • To become aware of free/open-source resources which can be used to build FOSMT systems • To understand how FOSMT can be used to generate new humanlanguage technology tools We will work on objectives marked with (*) if time allows. 1 2 Audience PhD students and postdoctoral researchers in the areas of language technologies: machine translation, speech, natural language processing, digital content management —information retrieval, information extraction, adaptive hypermedia— and localization. May also be attended by translators with basic to intermediate ICT skills. 3 Duration, structure, resources needed Five two-hour sessions distributed in two days and a half as follows: • Tuesday, April 14, 10:00–12:00 and 14:00–16:00, laboratory L1.14 • Wednesday, April 15, 10:00–12:00 and 14:00–16:00, laboratory L1.01 • Thursday, April 16, 10:00–12:00, laboratory L1.14 Both labs are in the School of Computing. We’ll have two kinds of sessions: • “Classroom” sessions will be organized to be as participative as possible to adapt different backgrounds. • Laboratory sessions are designed for participants to get a grasp of what FOSMT development may look like. 4 Tentative list of blocks & contents 1. Free/open-source software 2 • Free software: The four basic freedoms. The ambiguity of the word free. Adoption of “open-source”. Free/opensource software (FOSS) versus freeware, shareware, and cost-free services on the net. • Copyleft as an addition to free licenses, and the creation and securing of a commons. Copylefted and noncopylefted licenses. • Analysis of typical non-free licenses used in academic and commercial settings. • FOSS and science. • The shift toward FOSS-based business models: advantages and vulnerabilities (Behlendorf 1999) • FOSS project management. Roles. Releases. Repositories, modification control. “Bazaar-style” versus “cathedral-style” 2. Free/open-source machine translation (FOSMT) • The three basic components of a MT system in development: engine, data, tools (ED&T). Basic tools (compiler), auxiliary tools (evaluation tools, dictionary management tools). FOSMT as MT in which ED&T are all free/open-source. • Free/open-source rule-based MT and corpus-based MT • Copyleft applied to linguistic data. Pooling and the commons (Streiter et al. 2006). Minor languages (Forcada 2006). • Advantages of FOSMT: The advanced-user/developer continuum. In particular, advantages for minor languages. • Generation of resources for other language technologies (LT). • Using standard or well documented format: interoperability, transferability. • FOSMT development scenarios • Examples of existing FOSMT software. • FOSMT project management. 3. Rule-based FOSMT: Apertium 3 (a) Description • Background (interNOSTRUM, Universia) • Rationale • Apertium as a shallow-transfer machine translation platform: engine, data, tools (Armentano et al. 2006). • Language-pair data • Funding • The Apertium community as an example of FOSS development (repositories: the trunk, the incubator; roles). • Apertium tools: apertium-dixtools, apertium-transfertools, apertium-tagger-training-tools. • Apertium-based applications (Tinylex, Wordpress plugin, Pidgin plugin, OpenOffice.org plugins, etc.) • Apertium as a research platform (Sánchez-Martı́nez and Forcada 2007, Sánchez-Martı́nez et al. 2008). (b) Laboratory: installing and modifying Apertium • Installing Apertium from the latest sources on a virtual Linux machine. • Changing the data for a language pair: vocabularies and, optionally, transfer rules. 4. Corpus-based FOSMT: Moses (*) (a) Statistical machine translation • • • • Statistical MT (SMT) The data: sentence-aligned corpora. Training: the statistical models The SMT engine or “decoder” (b) Description of Moses and related software: • • • • Training: Giza++ Language models: irstlm Tuning: MERT Decoding: Moses (c) Laboratory: installing and running Moses. 4 QUESTIONS: • Installing Moses and all of the auxiliary software (Giza++, language model, etc.). • Training a toy SMT system with Moses. 5. Additional material (*) (a) Evaluation of machine translation • FOSS for evaluation: IQMT, etc. • Human evaluation: postediting and the traditional adequacy/fluency/informativeness • Automatic evaluation: BLEU, etc. Criticism of automatic evaluation. (b) Free/open-source resources that can be used to build FOSMT systems • Morphological analysers, taggers: Freeling • Data useful for morphological analysers: Wiktionary (start of Breton, Icelandic and Faroese) • Free bilingual dictionaries: DACCO, etc. • Parallel corpora: Europarl, OPUS, etc. • ReTraTos: perl tools to build bilingual dictionaries from aligned corpora • Free text: Wikipedia (easy access to free “raw” text) • Bitext tools: bitextor, tagaligner, etc. (c) Related software Non-MT FOSS for translation: OmegaT (translation memories), etc. Blocks marked with (*) will be dealt with if time allows. 4.1 References • Armentano-Oller, C., Carrasco, R.C., Corbı́-Bellot, A.M., Forcada, M.L., Ginestı́-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramı́rezSánchez, F., Sánchez-Martı́nez, F., Scalco, M. (2006) “Open-source 5 Portuguese–Spanish machine translation” In Lecture Notes in Computer Science 3960, 50-59 (Computational Processing of the Portuguese Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006, May 13-17, 2006, Itatiaia, Rio de Janeiro, Brazil). Available online: http://www.dlsi.ua.es/˜fsanchez/pub/pdf/armentano06. pdf • Behlendorf, Bruce (1999) “Open source as a business strategy”, in DiBona, C., Ockman, S., Stone, M., eds. Open Sources: Voices from the Open Source Revolution, O’Reilly. Available online: http://oreilly. com/catalog/opensources/book/toc.html • Forcada, Mikel (2006) “Open source machine translation: an opportunity for minor languages” in LREC-2006: Fifth International Conference on Language Resources and Evaluation. 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy, 23 May 2006; pp.1-6. Available online: http://www.mt-archive.info/LREC-2006-Forcada.pdf, http://www.dlsi.ua.es/˜mlf/docum/forcada06p2.pdf • Sánchez-Martı́nez, F., Forcada, Mikel L. (2007) “Automatic induction of shallow-transfer rules for open-source machine translation” in TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, Skövde [Sweden], 7-9 September 2007; pp.181-190. Available online: http://www. mt-archive.info/TMI-2007-Sanchez-Martinez.pdf, http: //www.dlsi.ua.es/˜fsanchez/pub/pdf/sanchez07c.pdf • Sánchez-Martı́nez, F., Pérez-Ortiz, J.A., Forcada, M.L. “Using targetlanguage information to train part-of-speech taggers for machine translation”. Machine Translation, 22(1–2)29–66. • Streiter, O. and Scannell, K.P. and Stuflesser, M. (2006) ”Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers”, Machine Translation 20(4)267–289. Draft available online: http://borel.slu.edu/pub/mt.pdf. 6