Free/Open-Source Machine Translation: a three

Free/Open-Source Machine
Translation: a three-day tutorial at the
Centre for Next Generation Localisation
Mikel L. Forcada
Universitat d’Alacant, E-03071 Alacant, Spain
Centre for Next Generation Localisation
April 14–16, 2009
• To become acquainted with existing free/open-source machine translation (FOSMT) software
• To understand rule-based FOSMT using Apertium as an example
• To experience FOSMT development in the Apertium platform
• To understand corpus-based FOSMT using Moses as an example (*)
• To experience FOSMT development with Moses (*)
• To become aware of free/open-source resources which can be used
to build FOSMT systems
• To understand how FOSMT can be used to generate new humanlanguage technology tools
We will work on objectives marked with (*) if time allows.
PhD students and postdoctoral researchers in the areas of language technologies: machine translation, speech, natural language processing, digital content management —information retrieval, information extraction,
adaptive hypermedia— and localization.
May also be attended by translators with basic to intermediate ICT
Duration, structure, resources needed
Five two-hour sessions distributed in two days and a half as follows:
• Tuesday, April 14, 10:00–12:00 and 14:00–16:00, laboratory L1.14
• Wednesday, April 15, 10:00–12:00 and 14:00–16:00, laboratory L1.01
• Thursday, April 16, 10:00–12:00, laboratory L1.14
Both labs are in the School of Computing. We’ll have two kinds of sessions:
• “Classroom” sessions will be organized to be as participative as possible to adapt different backgrounds.
• Laboratory sessions are designed for participants to get a grasp of
what FOSMT development may look like.
Tentative list of blocks & contents
1. Free/open-source software
• Free software: The four basic freedoms. The ambiguity
of the word free. Adoption of “open-source”. Free/opensource software (FOSS) versus freeware, shareware,
and cost-free services on the net.
• Copyleft as an addition to free licenses, and the creation and securing of a commons. Copylefted and noncopylefted licenses.
• Analysis of typical non-free licenses used in academic
and commercial settings.
• FOSS and science.
• The shift toward FOSS-based business models: advantages and vulnerabilities (Behlendorf 1999)
• FOSS project management. Roles. Releases. Repositories, modification control.
“Bazaar-style” versus
2. Free/open-source machine translation (FOSMT)
• The three basic components of a MT system in development: engine, data, tools (ED&T). Basic tools (compiler), auxiliary tools (evaluation tools, dictionary management tools). FOSMT as MT in which ED&T are all
• Free/open-source rule-based MT and corpus-based MT
• Copyleft applied to linguistic data. Pooling and the commons (Streiter et al. 2006). Minor languages (Forcada
• Advantages of FOSMT: The advanced-user/developer
continuum. In particular, advantages for minor languages.
• Generation of resources for other language technologies (LT).
• Using standard or well documented format: interoperability, transferability.
• FOSMT development scenarios
• Examples of existing FOSMT software.
• FOSMT project management.
3. Rule-based FOSMT: Apertium
(a) Description
• Background (interNOSTRUM, Universia)
• Rationale
• Apertium as a shallow-transfer machine translation platform: engine, data, tools (Armentano et al. 2006).
• Language-pair data
• Funding
• The Apertium community as an example of FOSS development (repositories: the trunk, the incubator; roles).
• Apertium tools: apertium-dixtools, apertium-transfertools, apertium-tagger-training-tools.
• Apertium-based applications (Tinylex, Wordpress plugin,
Pidgin plugin, plugins, etc.)
• Apertium as a research platform (Sánchez-Martı́nez and
Forcada 2007, Sánchez-Martı́nez et al. 2008).
(b) Laboratory: installing and modifying Apertium
• Installing Apertium from the latest sources on a virtual
Linux machine.
• Changing the data for a language pair: vocabularies and,
optionally, transfer rules.
4. Corpus-based FOSMT: Moses (*)
(a) Statistical machine translation
Statistical MT (SMT)
The data: sentence-aligned corpora.
Training: the statistical models
The SMT engine or “decoder”
(b) Description of Moses and related software:
Training: Giza++
Language models: irstlm
Tuning: MERT
Decoding: Moses
(c) Laboratory: installing and running Moses.
• Installing Moses and all of the auxiliary software (Giza++,
language model, etc.).
• Training a toy SMT system with Moses.
5. Additional material (*)
(a) Evaluation of machine translation
• FOSS for evaluation: IQMT, etc.
• Human evaluation: postediting and the traditional adequacy/fluency/informativeness
• Automatic evaluation: BLEU, etc. Criticism of automatic
(b) Free/open-source resources that can be used to build FOSMT
• Morphological analysers, taggers: Freeling
• Data useful for morphological analysers: Wiktionary (start
of Breton, Icelandic and Faroese)
• Free bilingual dictionaries: DACCO, etc.
• Parallel corpora: Europarl, OPUS, etc.
• ReTraTos: perl tools to build bilingual dictionaries from
aligned corpora
• Free text: Wikipedia (easy access to free “raw” text)
• Bitext tools: bitextor, tagaligner, etc.
(c) Related software
Non-MT FOSS for translation: OmegaT (translation memories), etc.
Blocks marked with (*) will be dealt with if time allows.
• Armentano-Oller, C., Carrasco, R.C., Corbı́-Bellot, A.M., Forcada,
M.L., Ginestı́-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramı́rezSánchez, F., Sánchez-Martı́nez, F., Scalco, M. (2006) “Open-source
Portuguese–Spanish machine translation” In Lecture Notes in Computer Science 3960, 50-59 (Computational Processing of the Portuguese
Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR
2006, May 13-17, 2006, Itatiaia, Rio de Janeiro, Brazil). Available online:˜fsanchez/pub/pdf/armentano06.
• Behlendorf, Bruce (1999) “Open source as a business strategy”, in
DiBona, C., Ockman, S., Stone, M., eds. Open Sources: Voices from the
Open Source Revolution, O’Reilly. Available online: http://oreilly.
• Forcada, Mikel (2006) “Open source machine translation: an opportunity for minor languages” in LREC-2006: Fifth International Conference on Language Resources and Evaluation. 5th SALTMIL Workshop
on Minority Languages: “Strategies for developing machine translation for
minority languages”, Genoa, Italy, 23 May 2006; pp.1-6. Available online:,˜mlf/docum/forcada06p2.pdf
• Sánchez-Martı́nez, F., Forcada, Mikel L. (2007) “Automatic induction of shallow-transfer rules for open-source machine translation”
in TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, Skövde [Sweden],
7-9 September 2007; pp.181-190. Available online: http://www., http:
• Sánchez-Martı́nez, F., Pérez-Ortiz, J.A., Forcada, M.L. “Using targetlanguage information to train part-of-speech taggers for machine translation”. Machine Translation, 22(1–2)29–66.
• Streiter, O. and Scannell, K.P. and Stuflesser, M. (2006) ”Implementing NLP projects for noncentral languages: instructions for funding
bodies, strategies for developers”, Machine Translation 20(4)267–289.
Draft available online: