Free/Open-Source Machine Translation: a three

advertisement
Free/Open-Source Machine
Translation: a three-day tutorial at the
Centre for Next Generation Localisation
Mikel L. Forcada
Universitat d’Alacant, E-03071 Alacant, Spain
Centre for Next Generation Localisation
April 14–16, 2009
1
Objectives
• To become acquainted with existing free/open-source machine translation (FOSMT) software
• To understand rule-based FOSMT using Apertium as an example
• To experience FOSMT development in the Apertium platform
• To understand corpus-based FOSMT using Moses as an example (*)
• To experience FOSMT development with Moses (*)
• To become aware of free/open-source resources which can be used
to build FOSMT systems
• To understand how FOSMT can be used to generate new humanlanguage technology tools
We will work on objectives marked with (*) if time allows.
1
2
Audience
PhD students and postdoctoral researchers in the areas of language technologies: machine translation, speech, natural language processing, digital content management —information retrieval, information extraction,
adaptive hypermedia— and localization.
May also be attended by translators with basic to intermediate ICT
skills.
3
Duration, structure, resources needed
Five two-hour sessions distributed in two days and a half as follows:
• Tuesday, April 14, 10:00–12:00 and 14:00–16:00, laboratory L1.14
• Wednesday, April 15, 10:00–12:00 and 14:00–16:00, laboratory L1.01
• Thursday, April 16, 10:00–12:00, laboratory L1.14
Both labs are in the School of Computing. We’ll have two kinds of sessions:
• “Classroom” sessions will be organized to be as participative as possible to adapt different backgrounds.
• Laboratory sessions are designed for participants to get a grasp of
what FOSMT development may look like.
4
Tentative list of blocks & contents
1. Free/open-source software
2
• Free software: The four basic freedoms. The ambiguity
of the word free. Adoption of “open-source”. Free/opensource software (FOSS) versus freeware, shareware,
and cost-free services on the net.
• Copyleft as an addition to free licenses, and the creation and securing of a commons. Copylefted and noncopylefted licenses.
• Analysis of typical non-free licenses used in academic
and commercial settings.
• FOSS and science.
• The shift toward FOSS-based business models: advantages and vulnerabilities (Behlendorf 1999)
• FOSS project management. Roles. Releases. Repositories, modification control.
“Bazaar-style” versus
“cathedral-style”
2. Free/open-source machine translation (FOSMT)
• The three basic components of a MT system in development: engine, data, tools (ED&T). Basic tools (compiler), auxiliary tools (evaluation tools, dictionary management tools). FOSMT as MT in which ED&T are all
free/open-source.
• Free/open-source rule-based MT and corpus-based MT
• Copyleft applied to linguistic data. Pooling and the commons (Streiter et al. 2006). Minor languages (Forcada
2006).
• Advantages of FOSMT: The advanced-user/developer
continuum. In particular, advantages for minor languages.
• Generation of resources for other language technologies (LT).
• Using standard or well documented format: interoperability, transferability.
• FOSMT development scenarios
• Examples of existing FOSMT software.
• FOSMT project management.
3. Rule-based FOSMT: Apertium
3
(a) Description
• Background (interNOSTRUM, Universia)
• Rationale
• Apertium as a shallow-transfer machine translation platform: engine, data, tools (Armentano et al. 2006).
• Language-pair data
• Funding
• The Apertium community as an example of FOSS development (repositories: the trunk, the incubator; roles).
• Apertium tools: apertium-dixtools, apertium-transfertools, apertium-tagger-training-tools.
• Apertium-based applications (Tinylex, Wordpress plugin,
Pidgin plugin, OpenOffice.org plugins, etc.)
• Apertium as a research platform (Sánchez-Martı́nez and
Forcada 2007, Sánchez-Martı́nez et al. 2008).
(b) Laboratory: installing and modifying Apertium
• Installing Apertium from the latest sources on a virtual
Linux machine.
• Changing the data for a language pair: vocabularies and,
optionally, transfer rules.
4. Corpus-based FOSMT: Moses (*)
(a) Statistical machine translation
•
•
•
•
Statistical MT (SMT)
The data: sentence-aligned corpora.
Training: the statistical models
The SMT engine or “decoder”
(b) Description of Moses and related software:
•
•
•
•
Training: Giza++
Language models: irstlm
Tuning: MERT
Decoding: Moses
(c) Laboratory: installing and running Moses.
4
QUESTIONS:
• Installing Moses and all of the auxiliary software (Giza++,
language model, etc.).
• Training a toy SMT system with Moses.
5. Additional material (*)
(a) Evaluation of machine translation
• FOSS for evaluation: IQMT, etc.
• Human evaluation: postediting and the traditional adequacy/fluency/informativeness
• Automatic evaluation: BLEU, etc. Criticism of automatic
evaluation.
(b) Free/open-source resources that can be used to build FOSMT
systems
• Morphological analysers, taggers: Freeling
• Data useful for morphological analysers: Wiktionary (start
of Breton, Icelandic and Faroese)
• Free bilingual dictionaries: DACCO, etc.
• Parallel corpora: Europarl, OPUS, etc.
• ReTraTos: perl tools to build bilingual dictionaries from
aligned corpora
• Free text: Wikipedia (easy access to free “raw” text)
• Bitext tools: bitextor, tagaligner, etc.
(c) Related software
Non-MT FOSS for translation: OmegaT (translation memories), etc.
Blocks marked with (*) will be dealt with if time allows.
4.1
References
• Armentano-Oller, C., Carrasco, R.C., Corbı́-Bellot, A.M., Forcada,
M.L., Ginestı́-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramı́rezSánchez, F., Sánchez-Martı́nez, F., Scalco, M. (2006) “Open-source
5
Portuguese–Spanish machine translation” In Lecture Notes in Computer Science 3960, 50-59 (Computational Processing of the Portuguese
Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR
2006, May 13-17, 2006, Itatiaia, Rio de Janeiro, Brazil). Available online: http://www.dlsi.ua.es/˜fsanchez/pub/pdf/armentano06.
pdf
• Behlendorf, Bruce (1999) “Open source as a business strategy”, in
DiBona, C., Ockman, S., Stone, M., eds. Open Sources: Voices from the
Open Source Revolution, O’Reilly. Available online: http://oreilly.
com/catalog/opensources/book/toc.html
• Forcada, Mikel (2006) “Open source machine translation: an opportunity for minor languages” in LREC-2006: Fifth International Conference on Language Resources and Evaluation. 5th SALTMIL Workshop
on Minority Languages: “Strategies for developing machine translation for
minority languages”, Genoa, Italy, 23 May 2006; pp.1-6. Available online: http://www.mt-archive.info/LREC-2006-Forcada.pdf,
http://www.dlsi.ua.es/˜mlf/docum/forcada06p2.pdf
• Sánchez-Martı́nez, F., Forcada, Mikel L. (2007) “Automatic induction of shallow-transfer rules for open-source machine translation”
in TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, Skövde [Sweden],
7-9 September 2007; pp.181-190. Available online: http://www.
mt-archive.info/TMI-2007-Sanchez-Martinez.pdf, http:
//www.dlsi.ua.es/˜fsanchez/pub/pdf/sanchez07c.pdf
• Sánchez-Martı́nez, F., Pérez-Ortiz, J.A., Forcada, M.L. “Using targetlanguage information to train part-of-speech taggers for machine translation”. Machine Translation, 22(1–2)29–66.
• Streiter, O. and Scannell, K.P. and Stuflesser, M. (2006) ”Implementing NLP projects for noncentral languages: instructions for funding
bodies, strategies for developers”, Machine Translation 20(4)267–289.
Draft available online: http://borel.slu.edu/pub/mt.pdf.
6
Download