Help-Tutorial - moses-for-mere

advertisement
Moses for Mere Mortals – Tutorial
A translation chain for the real world
Maria José Machado
João Luís Rosas
Hilário Leal Fontes
Table of Contents
Moses for Mere Mortals – Tutorial .......................................................................................................1
A. Purpose .............................................................................................................................................4
A.1. Moses for Mere Mortals (the Linux scripts) .............................................................................4
A.2. Windows add-ins ......................................................................................................................5
B. Requirements ....................................................................................................................................6
B.1. System requirements .................................................................................................................6
B.2. Software ....................................................................................................................................6
C. Overview ..........................................................................................................................................7
D. Installation for new Moses users ......................................................................................................8
D. 1. Demonstration corpus ..............................................................................................................8
D. 2. First steps with the scripts........................................................................................................8
E. Using your own corpora ...................................................................................................................8
F. create script........................................................................................................................................9
G. Understanding the directory structure of $mosesdir .......................................................................10
H. Names of the files and of the languages .........................................................................................12
I. Files needed for the training and what you need to know before training your own corpus ..........12
I.1. Where to put the files to be processed ......................................................................................12
I.2. Need of strictly aligned corpora files .......................................................................................12
I.3. Do not use spaces in file names ...............................................................................................12
I.4. Corpus files .............................................................................................................................13
I.4.1. Using TMX files to create Moses corpus files .................................................................13
I.5. Test files ...................................................................................................................................13
I.6. Language model file .................................................................................................................13
I.7. Tuning files ...............................................................................................................................14
I.7.1. maxruns parameter .........................................................................................................14
I.8. Evaluation (testing) files ..........................................................................................................14
I.9. Recaser files .............................................................................................................................15
J. make-test script ................................................................................................................................15
J.1. The ominous control characters ...............................................................................................16
K. train script .......................................................................................................................................16
K.1. Description of some important parameters .............................................................................17
K.2. Greatly increasing the training speed .....................................................................................17
K.3. Controlling tuning ..................................................................................................................17
K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of
a training already done in previous trainings .................................................................................17
K.5. Training of an inverted corpus ...............................................................................................18
K.6. Isolating a training from all the other ones .............................................................................18
K.7. NIST scorer errors ..................................................................................................................18
K.7. NIST scorer errors ..................................................................................................................18
L. translate script .................................................................................................................................18
L.1. Speed .......................................................................................................................................20
L.2. Reusing tuning weights (only for advanced users) .................................................................20
M. score script .....................................................................................................................................20
M.1. Two types of scores ................................................................................................................22
M.2. Problems with the NIST scorers ............................................................................................22
M.2. Problems with the NIST scorers ............................................................................................22
N. Utilities ...........................................................................................................................................23
N.1. transfer-training-to-another-location script .............................................................................23
O. Improving quality and speed ..........................................................................................................24
P. Deleting trained corpora .................................................................................................................24
P.1. You want to erase all the trainings that you have done ............................................................25
P.2. You want to erase just some of all the trainings that you have done .......................................25
P.2.1. Scenario 1: More than one Moses installations available ...............................................25
P.2.2. Scenario 2: Single Moses installation available ...............................................................26
Q. How to contribute ...........................................................................................................................26
R. Thanks.............................................................................................................................................27
S. Author..............................................................................................................................................27
APPENDIX: default parameters of each of the scripts .......................................................................28
1) create script: ...............................................................................................................................28
2) make-test script: .........................................................................................................................28
3) train script:..................................................................................................................................29
4) translate script: ...........................................................................................................................37
5) score script: ................................................................................................................................40
6) transfer-training-to-another-location script ................................................................................41
A. Purpose
Moses-for-Mere-Mortals builds a translation chain prototype with Moses + IRSTLM + RandLM +
MGIZA. Very large corpora can therefore be processed. Its main aims are the following:
1) To help build a translation chain for the real world (especially if there is a tight schedule for
production);
2) To guide the first steps of users that are just beginning to use Moses;
3) To enable a quick evaluation of Moses;
4) To integrate machine translation and translation memories.
Even though the main thrust of this work centers on Linux (the operating system where Moses runs),
translators usually work on an MS Windows environment. Therefore, two add-ins (collectively called
“Windows add-ins”) help to make the bridge from Windows to Linux and then back from Linux to
Windows (windows-add-ins-1.043.tar.gz package).
For a brief overview of the whole project, please see Overview.
Moses for Mere Mortals also has some original features:
1) Removes control characters from the input files (these can crash a training);
2) Extracts from the corpus files 2 test files with randomly selected, non-consecutive segments that
are erased from the corpus files;
3) A new training does not interfere with the files of a previous training;
4) A new training reuses as much as possible the files created in previous trainings (thus saving
time);
5) Inversion of corpora (e.g., from en-pt to pt-en) detected, allowing a much quicker training than
that of the original language pair (also checks that the inverse training is correct);
6) Stops with an informative message if any of the phases of training (language model building,
recaser training, corpus training, memory-mapping, tuning or training test) doesn't produce the
expected results;
7) Can limit the duration of tuning;
8) Get the BLEU and NIST scores of a translation (either for the whole document or for each
segment of it);
9) Transfer your trainings to someone else or to another Moses installation in the same computer;
10) All the mkcls, GIZA and MGIZA parameters can now be controlled through parameters of the
train script;
11) Selected parameters of the Moses scripts and the Moses decoder can now be controlled through
the train and translate scripts;
12) Can score in one go all the Moses translation files present in one directory.
NOTE: given the fact that this tutorial can be read from top to bottom, or, more probably, by jumping
selectively to random sections, we have repeated points that we consider important in several different
sections.
A.1. Moses for Mere Mortals (the Linux scripts)
Moses for Mere Mortals, the Linux component of this work, automates the installation, the creation of
a representative set of test files, the training, the translation and even the scoring tasks. It also comes
with a demonstration corpus (too small for doing justice to the qualitative results that can be achieved
with Moses, but capable of giving a realistic view of the relative duration of the steps involved). For
building very large Moses corpora using your own translation memories (*.TMX files), please use the
Extract_TMX_Corpus program included in the windows-add-ins-1.043.tar.gz package. If you want to
transfer Moses translations to a *.TMX translation memory tool (e. g., SDL Trados Translator's
Workbench), you can use Moses2TMX, a program included in the same package. Together, these 2
open source programs make the link between Linux (where the corpora are trained and the translations
are made) and Windows (where most translators actually use Moses translations).
One would expect that the users of these scripts, after having tried the demonstration corpus, can
immediately get results with the real corpora they are interested in.
These scripts also avoid the destruction of previously trained corpora by the subsequent training
of a new corpus and simultaneously try to reuse the relevant parts of previous trainings in the
subsequent training of a new corpus.
It is also possible to train corpora where every word is presented together with, for instance, its
respective lemma and/or part of speech tag (“factored training”). The present scripts do not cover this
type of training. If you are interested in morphologically rich languages, and especially if one of them
is the target language, you should be interested in factored training, a field that is evolving a lot at the
present time.
Moses-for-Mere-Mortals scripts are based on instructions from several sources, especially the
http://www.dlsi.ua.es/~mlf/fosmt-moses.html and the http://www.statmt.org/moses_steps.html web
pages and the Moses, IRSTLM, RandLM, giza-pp and MGIZA documentation, as well as on research
on the available literature on Moses, namely the Moses mailing list
(http://news.gmane.org/gmane.comp.nlp.moses.user). The comments transcribe parts of the manuals of
all the tools used.
Moses MT System is an open source project under the guidance of Philipp Koehn, University of
Edinburgh, and is supported by the European Commission Framework Programme on Research and
Technological Development and others.
For information on the general concepts of Statistical Machine Translation, see “Statistical Machine
Translation" by Philipp Koehn, Cambridge University Press, 2010.
A.2. Windows add-ins
Translators also usually do not have corpora large enough for getting excellent machine translation
(MT) results and the results they get with MT are highly dependent on training it with a corpus that is
highly representative of the domains that they are interested in. The synergy between machine
translation and translation memories is not often stressed, but it seems to us that it naturally leads to
better results: machine translation can be enriched with the human translations stored in the translation
memories; and translation memories, if they do not have a certain “match” percentage for a given
segment, can be complemented with machine translation segments.
Therefore, the Windows add-ins address this synergy between MT and translation memories (namely
those in the widely used TMX format). Translation memories are not, however, an obligatory part of
this pack: the Linux scripts work with any perfectly aligned corpora files in UTF-8 format with Linux
line endings.
In the windows-add-ins-1.043.tar.gz package you will find a README file with instructions for each
of the 2 Windows add-ins:
1) Extract_TMX_Corpus, which converts a whole directory of TMX files into a Moses corpus that
can be used for training;
2) Moses2TMX, which converts a batch of Moses translations and their corresponding source
language documents into TMX files.
They therefore will not be mentioned in the rest of this Help/Tutorial file.
B. Requirements
B.1. System requirements
Moses for Mere Mortals (MMM) has been tested with the following Linux (64 bits) distributions:
Ubuntu 10.04 LTS (http://old-releases.ubuntu.com/releases/10.04/)
Ubuntu 10.10 (http://old-releases.ubuntu.com/releases/10.10/)
Ubuntu 11.04 (http://old-releases.ubuntu.com/releases/11.04/)
If you are new to Linux, this could be a good time to ask for the help of a friend. The scripts should
also work in other Linux distributions with slight changes, but they have not been tested in any
other distribution.
 Computer:

Minimum 2 Gb RAM (preferably much more)

Preferably a fast multiprocessor computer

Disk space: as a rough rule, the disk space needed for corpus training is approximately
100 times the size of the corpus (source plus target files).
B.2. Software
In order to use the scripts, the following packages need to be installed in Ubuntu (64 bits):
1) automake
2) bison
3) boost-build
4) build-essential
5) flex
6) help2man
7) libboost-all-dev
8) libgc-dev
9) libpthread-stubs0-dev
10) libtool
11) zlibc
12) zlib1g-dev
13) gawk
14) tofrodos
You can install them by selecting the System menu and then the Administration > Synaptic Package
Manager command.
C. Overview
These scripts have only been tested in a Ubuntu 10.04 LTS, 10.10 and 11.04 64 bits environment.
Note: So as to avoid having to rewrite this Help file every time that a script version changes, in
what follows the version numbers of the scripts have been omitted (for example, in most places we
write create-* (or just create), instead of create-1.34, in order to refer to the create-1-34 script).
1) You should start using these scripts by opening the create script and changing there some
important parameters that will make your installation adequate for your hardware (see section F.
create script).
2) Then launch the create script, which will download and compile ***all*** the Moses packages
and the packages upon which Moses depends. Since some compilation errors do not stop
compilation but can lead to an unusable tool, the create script checks, at the end of the
compilation of each tool, whether the appropriate files are present or not.
3) Select a set of corpus files (2 strictly aligned files in UTF-8 format, one in the source language
and another in the target language) for Moses training and launch the make-test script in order
to create a representative set of segments that will be used for testing the corpus that you will
train and in order to erase those segments from the corpus files that will be used for training.
This step is not obligatory but is highly recommended.
NOTE 1: Even if you do not have your own corpus for training, you can use the demo corpus
that comes with these scripts (you do not need to do anything to have this corpus: the create
script takes care of this and the other scripts are configured to use it “straight from the box”). It
is highly recommended that new Moses users start using these scripts using the demo corpus.
NOTE 2: If you do not have text files in UTF-8 format (necessary for Moses), but you do have
translation memories in *.TMX format, you can use the Extract-TMX-Corpus tool to create the
adequate UTF-8 files from your TMX files (included in the windows-add-ins-1.043.tar.gz
package).
4) Next, the train script trains a corpus composed exclusively of plain words (“non-factored
training”). This script does include some advanced features, like memory-mapping (for saving
memory resources, what is essential for processing large corpora), tuning (in order to try to get
qualitatively better results) and the ability to change certain parameters that can either reduce
the processing time or increase the quality of the results.
5) The translate script then translates one or more documents placed in a specific directory.
6) Finally, the score script allows you to score Moses translations against human translations of the
same text, giving BLEU and NIST scores for either the whole document or for each segment of
the document (depending on the settings that you define).
7) If you want to exchange your trained corpora with someone else or to another Moses
installation (even one in the same computer), the transfer-training-to-another-location script
helps you to do that.
Users are expected to open the scripts and to change their parameters according to their needs (the
parameters are preceded by comments that explain their purpose and sometimes even their allowable
values and limits; many of those comments are citations of the Help of the several packages that Moses
uses and of the Moses manual). These parameters are set at the top of each script in a clearly defined
section.
D. Installation for new Moses users
D. 1. Demonstration corpus
Moses-for-Mere-Mortals comes with a small demonstration corpus so that you can quickly see what the
scripts can do. This corpus is automatically installed by the create script (no need for you to do
anything) and is used by the other scripts, which are already configured to use it (you will have to open
the translate script and to change its $logfile parameter, however, as described in section L).
The corpus was extracted from DGT-TM Acquis available from the Commission's Joint Research
Centre website (please note that only European Community legislation printed in the paper edition of
the Official Journal of the European Union is deemed authentic).
It is small (200 000 segments in the Portuguese and English languages) and the results of its processing
cannot be seen as representative of the quality Moses can achieve (especially if you consider that
IRSTLM and RandLM are intended to process corpora with several to many millions of segments).
However, a small corpus like this will reveal you facts about Moses (like the relative length of time
needed for each of its steps) and it is therefore highly recommended that you start using the scripts with
their default settings. Later on, you can set them so that the actual work you are interested in will be
done.
If you don't change the default settings of the create and the train scripts, Moses will train this
Portuguese-English corpus, which involves 300 000 segments for language model building and 200
000 segments for corpus training.
This corpus had a BLEU score of 0.7115 and a NIST score of 11.3007 with a 5-gram language model.
Even though small, it took some time to train in a machine with a Intel i7 720-QM processor and 8 GB
of DDR3 RAM (2h 21m 05s without tuning and 7h 14m 00s with tuning limited to a maximum of 10
iterations). The training of a corpus with 6.6 million segments in this same machine took 2 days 22 h
59m (without tuning).
D. 2. First steps with the scripts
For a quick introduction to Moses-for-Mere-Mortals, please consult the QUICK-START-GUIDE.doc.
E. Using your own corpora
You should do the same steps as described above for new users, but, before launching each of the
scripts, you should change the values of the parameters you are interested in (as you have already
done in section D before using the translate script).
So as to ease your task, the description of each of the scripts in this Help/Tutorial file will start
with a section entitled “Vital parameters”, which tells you the strictest minimum of parameters
that you should change in case you want to train a corpus different from the demonstration
corpus.
In fact, in order to train your own corpora, you have to define your own settings and to choose your
own corpora files and the languages you are interested in. You might also want to change the
parameters of Moses or of the packages it uses. In order to do that, before launching the scripts, open
them and set the variables defined between
###################################################################################
# The values of the variables that follow should be filled according to your needs: #
###################################################################################
and
###################################################################################
# End of parameters that you should fill #
###################################################################################
Each parameter is preceded by a comment that describes its function and, in some cases, even states the
values that are the allowable and the default values you can use. They often consist of extracts of the
Help files, readmes or manuals of the several packages used.
Please refer to the sections that follow, which describe each of the scripts and some important info in
more detail.
F. create script
Vital parameters: mosesdir, mosesnumprocessors
This is a shell script that creates (compiles) Moses and the packages that are used upon which it
depends.
1. Go to the Moses-for-Mere-Mortals-1.16/scripts directory and open the create script. At the top of
the script, you can change several variables that allow you to better adapt it to your own
requirem
ents:
 $mosesdir: this is the base directory where Moses will be installed (default value:
$HOME/moses-irstlm-randlm). You can change both its name (in this case, “moses-irstlmrandlm”) and its location (in this case, “$HOME”). This variable is defined in all the scripts
and its value MUST be the same in all of them if you want them to be able to work
together (and you do want that!).
 $mosesnumprocessors: the number of processors of your computer that you want Moses to use
(Moses will be compiled to make better use of them). The default value is 1, but nowadays you
can easily find computers with 2 and 4 processors.
2. Save your changes. Do not change any other parameter for the time being (later on, after you
have run all these scripts, you can start your own very personal experiences).
This script also creates some important directories:
 $mosesdir/corpora_for_training: this is the directory where the corpora and the other input
files for training are located; there you'll find already several files that can be used to run the
training demo; you should place here all the files needed for training the corpora you are
interested in.
 $mosesdir/corpora_trained: this is the directory where the files created during corpus training
are saved; it is advisable not to not touch this directory, since you can destroy the training of
one or several corpora;
 $mosesdir/logs: this is the directory where the training summary files that describe how the
training was done are saved. These files have the name of several variables that will be used by
the translate script (and which this latter script will extract for you). These files are very
important because they are the only way to indicate to the translate script the trained corpus
you want to use for translation;
 $mosesdir/toolsdir: this is the directory where both Moses and all the other tools (giza-pp,
irstlm and so on) will be installed; this directory will not change during the training and you
should not change it.
G. Understanding the directory structure of $mosesdir
The directory structure indicated in the figure below results from the execution of all the scripts.
These directories are
created at the time they are
needed. After installation,
for example, not all of
them will exist already.
1. Once you have installed Moses with create, you should put the corpus files that you want to train in
the $mosesdir/corpora_for_training directory. You should also place here the files used for creating
the language model (if different), for training recasing (if different), for tuning (if any), and for testing
the results of training (if any).
2. If you then use the train script for training that corpus, a directory $mosesdir/corpora_trained will
be created to store the trained files.
NOTE: $mosesdir/corpora_trained is a vital directory and you are strongly urged not to change
any of its contents because it has a complex structure that mixes files from several trainings to
allow reusing of the work done in previous trainings.
If you change the contents, you risk destroying not just one, but several trainings. You can,
however, use a specific training already done by referring to its logfile (see below) in the
$mosesdir/logs directory. It is also possible to isolate a training from all others (please refer to
section N.1 in order to learn how to do that).
3. At the end of the training, a training summary file (“logfile”) will be created in the $mosesdir/logs
directory. This file is very important because its name will be used in the translate script to indicate
the trained corpus you want to use for translation.
4. Once a corpus is trained, you can start using it to get actual translations with the translate script.
Place the documents to be translated (you can translate one or many documents at the same time) in the
$mosesdir/translation_input directory (created by the train script ) and then launch the translate
script. You should also make sure, before the translation starts, that the files placed in the
translation_input directory are indeed adequate for the trained corpus that this script uses (for
instance, the languages used for training should match those of the translation. Otherwise, you will
waste time translating files that shouldn't have been translated using that trained corpus, e.g. because
their language is not adequate).
5. Translation can have 2 types of outputs:
a) a “normal” Moses translation, if you set the $translate_for_tmx parameter to value different from 1
(default: 0); or
b) a Moses translation especially suited for making translation memories, if you set the
$translate_for_tmx parameter to 1.
The “normal” translation will be located in the $mosesdir/translation_output directory.
The translation intended to build TMX translation memories will appear, together with the
corresponding modified input file, in the $mosesdir/translation_files_for_tmx directory.
In both cases, the translation will have the name of the source document plus an ending that
corresponds to the destination language and a final “suffix” “.moses”. This avoids confusions between
source document and translation and between Moses translation and reference translation.
6. A new script (score) enables you to place a reference (that is, human) translation in the
translation_reference directory and get the BLEU and NIST scores of the corresponding Moses
translations in the translation_scoring directory.
Again, you have 2 choices:
a) get a score for the whole document, if the parameter $score_line_by_line is different from 1; or
b) get a score line by line, with the segments ordered by ascending BLEU score, if the parameter
$score_line_by_line is equal to 1.
7. The contents of the $mosesdir/tools directory should not be changed, since it includes all the files
needed for Moses to work.
H. Names of the files and of the languages
The names of the files and languages, which are used to create some directories names, should not
include spaces or symbol characters, like asterisks, backslashes or question marks. Try to stick with
letters, numbers, and the dash, dot, and underscore if you want to avoid surprises. Avoid using a
dash as the first character of a file name, because some Linux commands will treat it as a switch. If
your files start with a dot, they'll become hidden files.
I. Files needed for the training and what you need to know before training
your own corpus
I.1. Where to put the files to be processed
All the files that are going to be mentioned should be put in $mosesdir/corpora_for_training (as
described above, mosesdir is the base Moses system directory, whose default value is $HOME/mosesirstlm-randlm)
I.2. Need of strictly aligned corpora files
Be very sure that the corpus files you use are strictly aligned. Otherwise, you risk getting quite
puzzling errors. At the very least, check the number of lines of the source language file and of the target
language files, which should be strictly the same. In order to do this, type in your terminal the
commands:
wc -l {name_of_source_language_file}
and
wc -l {name_of_target_language_file}
You may also want to check that the last line of the source language file does correspond to the last line
of the target language file. You can even do more detailed checks.
I.3. Do not use spaces in file names
The name of all the files used in these scripts should not include spaces.
I.4. Corpus files
At the very least, you need 2 parallel corpus files for training (one source language file and
one target language file, where the line number of a segment in the starting language file is
exactly equal to the line number of its translation in the target language file, that is, those 2 files
should be strictly aligned). The way the script works requires that both files share a common
$corpusbasename (a variable that you set in the train script), that is, a “prefix” that identifies
the corpus in question (e.g., “corpus1000”). To each file, you should also add a “suffix” that
identifies the language in question (e.g., “en”, "fr" or “pt”). The name of the corpus files should
be exclusively composed of these “prefixes” and “suffixes” If you want to create a corpus1000
for the pt and en languages (pt being the source language and en the target language in all the
examples that follow), you therefore need the following files:
 corpus1000.pt
 corpus1000.en
I.4.1. Using TMX files to create Moses corpus files
If you do not have separate and strictly aligned files (one in the source language and another in
the target language), but you do have *.TMX translation memory files, you can use ExtractTMX-Corpus (included in the windows-add-ins-1.043.tar.gz package) to transform such *.TMX
files into files that Moses can use (it namely converts them to UTF-8 format). This program also
does some cleaning of the corpus files that will prevent Moses errors during the training phase.
I.5. Test files
The make-test script allows you to use these 2 files to obtain a representative sample of their
segments and creates 2 new corpus files in which it erases the corresponding lines of the
segments that were extracted for testing. These segments can still occur in those new corpus
files, since they may occur in more than one line. The new corpus files created by this script
(whose name is given at the end of its execution) should then be the files used for setting the
$corpusbasename. If you had started it with the example files indicated above, it would create 4
new files:
 corpus1000.for_train.pt (a file used for corpus training)
 corpus1000.for_train.en (a file used for corpus training)
 corpus1000.for_test.pt (a file used for testing the trained corpus)
 corpus1000.for_test.en (a file used for testing the trained corpus)
I.6. Language model file
You also need a single file for building the language model (this file must contain segments
only in the target language). If you want, you can use the target language file used for the
training, in which case you do not need a new file for this purpose. However, you might want to
use a (possibly much bigger) file for building the language model, since this is a very important
building-block that costs comparatively little time to process and affects the fluency of the
translation. If you want to use a specific file for this purpose, you need to use a file, that you
yourself create, whose “prefix” is arbitrary (e.g., “corpus145000”), but whose “suffix” is
obligatorily the abbreviation of the target language already used in the corpus files (see point
I.4). Therefore, continuing our example, if you want to use a file with 145000 segments for
building the language model, you could name it:
 145000.en
I.7. Tuning files
The tuning of a corpus should lead in principle to better translation results. For that, 2
conditions must be met: 1) the tuning parameter must be set to 1 in the train script (this is its
default value); 2) you need two files: one for the source language and one for the target
language.
Again, you can use the files described in point I.4, in which case you do not need any new
files. However, given that tuning is a very, very long process (and perhaps the longest of all),
you might want to use a set of files with a ***much smaller size*** than those described in
point 1. However, too small files lead generally to a big number of tuning iterations and to
less improvement and too big files take more time to process. You have to strike a balance
between these two extremes (1 000 to 2000 segments is good enough). Such files should
have a common arbitrary “prefix” and as “suffix” the abbreviations of the languages already
described in point I.4. Therefore, continuing our example, you could use 2 much smaller files
named:
1. 100tuning.pt
2. 100tuning.en
I.7.1. maxruns parameter
Tuning is a phase that can easily take more time than all the others put together.
Furthermore, you can't easily estimate its duration beforehand, since the number of its
runs is highly variable from corpus to corpus.
Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces
some changes to the original Moses script so that the user can control the number of
tuning runs through the parameter $maxruns of the train script.
A value of -1 means that an unlimited number of runs is allowed. Any positive number
>= 2 means that tuning should be stopped after that run number. The default value is 10.
Good values can be between 5 and 10 (choose the lower end of this spectrum if you
want to speed up things; but you will risk a worse tuning).
I.8. Evaluation (testing) files
If you want to evaluate the training of your trained corpus, you'll need 3 files, one for the
source language, another for the human translation in the target language and the Moses
translation file (in order to create this later file during training, the runtrainingtest
parameter should be set to 1, which is its default value, in the train script.
You can use the files described in point I.4 for that purpose, in which case you need no new
files. You can use files especially created for the purpose or you can use the make-test script
to create them (see note below).
Such files should have a common arbitrary “prefix” and as “suffix” the abbreviations of the
languages already described in point I.4. Therefore, continuing our example, you could use 2
much smaller files named:
 1000.for_test.pt
 1000.for_test.en
NOTE: Since the choice of the segments used for the evaluation can considerably affect its
results, a script (make-test) was made that divides the corpus files into X sectors and chooses
randomly Y segments in each of those sectors based on a random segment line number that it
chooses. The resulting test files are probably more representative of the several contexts and
vocabularies of your corpus than a set of consecutive segments would be.
I.9. Recaser files
You also need a (possibly much bigger) file in the target language for recaser training. This
is a comparatively quick process and you can therefore invest on it. The recaser training file
has a “prefix” that is arbitrary (e.g., “10000recaser”), but a “suffix” that is obligatorily the
abbreviation of the target language already used in the corpus files (see point I.4).
Therefore, continuing our example, if you want to use a file with 10000 segments for recaser
training, you could name it:
 10000recaser.en
J. make-test script
Vital parameters: lang1, lang2, mosesdir, basefilename
This script assumes that a Moses installation has already been done with the create script. It uses two
aligned input files, one in the source and another in the target language, whose $basefilename should
be equal and which differ by a “suffix” that indicates the abbreviations of their respective languages, as
indicated in point I.4 (e.g., 200000.pt and 200000.en, the $basefilename being, in this case, “200000”).
The script produces 2 new sets of files:
1) 2 files (one in the source language and another in the target language) used for testing the
trained corpus; for that, it divides the corpus into X $sectors (a parameter that you can define)
and then randomly selects Y $segments (another parameter you can define) in each sector. All
the selected segments will have different line numbers (no line can be chosen more than once).
This procedure offers a better guarantee that the segments used for testing the trained corpus
are more representative of all the styles and contexts of the corpus being used than they would
be if you would arbitrarily choose the same number of consecutive segments somewhere in the
input files. These files have a basefilename equal to $basefilename.for_test and will be later
used by the train script (e.g, 200000.for_test.pt and 00000.for_test.en).
2) 2 files (one in the source language and another in the destination language) that are equal to the
starting files, except that the segments used for creating the 2 test files have been erased from
them. These files have a basefilename equal to $basefilename.for_train and will be later used
by the train script (e.g, 200000.for_train.pt and 00000.for_train.en).
NOTE: if you want to compare the relative results of a change in training parameters, you
should execute the training test before and after the change in parameters with the same set of
test files (run make-test-* just once and use the test files it creates to test both trainings).
J.1. The ominous control characters
The Moses clean-corpus-n.perl script does not erase the control characters, what has led, in our
experiments, to a crash during the memory-mapping of a reordering-table (with a control-K character),
in one case several days after the training had started. Problems with this character have also been
described in the Moses mailing list. Therefore, the make-test script (as well as the train script) also
substitutes all instances of the control-G, control-H, control-L, control-M and control-K characters by a
space in all the files.
K. train script
Vital parameters: mosesdir, lang1, lang2, corpusbasename, lmbasename, tuningbasename,
testbasename, recaserbasename, reuse, paralleltraining, memmapping, tuning, runtrainingtest, lngmdl,
Gram, mgizanumprocessors, maxruns
This script assumes that a Moses installation has already been done with the create script and can
optionally use the files created by the make-test script.
Even though it might not cover all the features that you might like to play with (namely those necessary
for factored training, which could involve the use of a lemmatizer and/or a parts of speech tagger), this
script does allow you to train a corpus, to memory-map your training files (so that Moses uses less
RAM resources), to do tuning, to do a training test (also with memory-mapping), and to get the NIST
and BLEU scores of that test. It also makes available all the parameters used by IRSTLM, RandLM,
mkcls, GIZA and MGIZA, as well as selected parameters used by the Moses decoder and the Moses
scripts. These parameters are all set by default to the values they receive when you use the Moses
scripts. If you are new to Moses, do not change them.
If your computer has more than 1 processor, you should also change here the mgizanumprocessors
parameter (set by default to 1) to the actual number of processors of your computer that you want to
use with MGIZA. Just open the train script, search the name of this parameter change it and save your
changes. This can lead to a much bigger training speed.
At the very least, this script will build a linguistic model and will train a corpus (except if they exist
already for the same parameters that you choose, in which case it will not rebuild them).
Other steps are optional: memory-mapping, tuning and testing. In order to set the steps that will be
executed, you have to change the parameters at the top of the script. The role of those parameters is
also indicated there (in comments that precede each one of them).
The directory structure created by these scripts ensures 2 things: 1) no training will interfere with the
files of a previous training; 2) a posterior training will reuse as much as possible the files created in
previous trainings; and 3) the training of an inverse corpus, like the training of a pt-en corpus after the
training of an en-pt corpus with the same corpus files, is also speeded up because it uses some files that
were already created in the first of the trainings.
At the end of the training, a logfile (training summary) file will be created in the $mosesdir/logs
directory. It includes details about the duration of the several phases of training, values that will be used
when you translate files based on this trained corpus, a list of the main input files, a list of all the files
created during training, a list of all the parameters used and the score of the trained corpus test (if a test
was done). The name of this file is most important because it is used by the translate script to select the
trained corpus that you want to use for translation.
K.1. Description of some important parameters
1. reuse: if set to 1, any step already done in a previous training will not be redone;
2. memmapping: if set to 1, memory-mapping of the corpus will be done (this will reduce your
RAM requirements)
3. tuning: if set to 1, tuning will be done; this should lead in principle to better results, but that
does not always happen in practice; tuning can easily take more time than all the other steps
combined
4. runtrainingtest: if set to 1, a test of the training will be done with scoring
5. lngmdl: package chosen to build the language model (in these scripts, you have 2 choices: 1 =
IRSTLM; 5 = RandLM)
6. Gram: n-gram order; can influence significantly the results; the more, the better (maximal
value: 9), but the execution time will suffer; normally between 3 and 9 (default: 5)
7. maxruns: maximal number of iterations made during training (default: 10; -1 means no limit)
K.2. Greatly increasing the training speed
The parameter mgizanumprocessors determines the number of processors of your computer that will
be used during phase 2 (MGIZA) of training. Since this phase is the longest one of training, you should
set this parameter to the number of processors (actual or virtual) of your computer that you want to use.
K.3. Controlling tuning
Tuning is a phase that can easily take more time than all the others put together. Furthermore, you can't
easily estimate its duration beforehand, since the number of its runs is highly variable from corpus to
corpus.
Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces some changes to
the original Moses script so that the user can control the number of tuning runs (iterations) through the
parameter $maxruns of the train script.
A value of -1 means that an unlimited number of runs is allowed. Any positive number >= 2 means that
tuning should be stopped after that run number. The default value is 10. Good values can be between 5
and 10 (choose the lower end of this spectrum if you want to speed up things; but you will risk a worse
tuning).
K.4. Avoiding the destruction of a previous training by a subsequent
training and reusing parts of a training already done in previous
trainings
In order to guarantee that the training of a corpus doesn't destroy files from a previously trained corpus
and in order to ensure that each training reuses as much as possible the files already created in previous
trainings, a complex (and confusing) directory structure was created. This, however, implies that the
files of all the previous trainings are dispersed in the $mosesdir/corpora_trained directory. As already
stated, this is a directory which you shouldn't change, since, by doing that, you can destroy not just one
but several (or all) trainings.
However, a $mosesdir/logs directory exists where you can find a summary of every training you made
that describes, among other things, the parameters it used and the files it created. In order to use a
trained corpus for translation you just have to copy the name of its log file into the $logfile parameter
of the translate script. Nothing else is necessary for translation and indeed you can ignore where the
trained corpus files are for all practical purposes.
K.5. Training of an inverted corpus
A particular case of reusing previously made files is the training of an inverse corpus of a corpus
already trained (suppose en->pt when the pt->en trained corpus already exists). In that case, and if all
the training parameters stay equal in both instances, this script detects this situation and uses the
tokenized, cleaned and lowercased files of the previous training, as well as phases 1 and 2 (MGIZA,
one of the longest steps of training) of the previous training. This can lead to savings of up to 25% of
the execution time.
K.6. Isolating a training from all the other ones
You might feel tempted to isolate a training from all the other ones, perhaps because it is very
important for you. However, so that a given training does not erase any part of a previous one and so
that the scripts can reuse as much as possible the steps already done in previous trainings, the files of
the several trainings are interspersed in the $mosesdir/corpora_trained directory. There is,
nevertheless, an easy way to isolate a given training that you are going to make from all the other
preceding ones, if you insist on that:
1) Rename the $mosesdir/corpora_trained directory to $mosesdir/corpora_trained-bak;
2) Rename the $mosesdir/logs directory to $mosesdir/logs-bak;
3) Do the training of the corpus you want to isolate from all others (this will create new
$mosesdir/corpora_trained and $mosesdir/logs directories that will just contain the trained
corpus data);
4) Move the newly created $mosesdir/corpora_trained and the $mosesdir/logs directories to a
safe place (outside $mosesdir);
5) Rename the $mosesdir/corpora_trained-bak directory back to $mosesdir/corpora_trained;
6) Rename the $mosesdir/logs-bak directory to $mosesdir/logs.
In order to be able to reuse again the training that you isolated from all others, you have to simply
repeat steps 1) and 2) and to move its corpora_trained and logs directories that you isolated to the
$mosesdir.
K.7. NIST scorer errors
If you get an error message from the NIST scorer at the end of the training, please see section M.2 and
correct the situation as described there.
L. translate script
Vital parameters: mosesdir, logfile, translate_for_tmx (if this latter parameter is set to 1, then look
also minseglen, othercleanings, improvesegmentation, removeduplicates)
This script assumes that Moses with IRSTLM and RandLM has been compiled with create and that a
trained model exists already for the type of language pair you want to translate (which requires you to
have already run the train script).
This script translates, using the trained model that you chose in its $logdir and $logfile parameters, the
file or files that you yourself have put in $mosesdir/translation_input. In fact, it will translate in one
step all the files that are there.
It is very important that you fill correctly the name of the $logdir and $logfile parameters, since they
are the only way of telling the script which trained corpus you want to use for translation. By default,
$logdir has the value “$HOME/moses-irstlm-randlm/logs” and, if you haven't changed this parameter,
you just have to go to this directory, identify the file that corresponds to the corpus you want to use and
copy its name (omit the path!) into $logfile.
Translation can have 2 types of outputs:
1) A “normal” Moses translation, if the $translate_for_tmx parameter is set to 0 (default value). The
“normal” translation will be located in the $mosesdir/translation_output directory.
or
2) A Moses translation especially suited for being used with a translation memory tool, if you set the
$translate_for_tmx parameter to 1. This type of translation will be located, together with the modified
input file, in the $mosesdir/translation_files_for_tmx directory. It is especially interesting for those
who use machine translation together with translation memories (notably those who just use MT
segments when there are no translation memory segments above a certain “match” percentage).
By default, translate_for_tmx=0, which means it will do a "normal" translation. This type of
translation respects fully the formatting of the original text, and therefore keeps long paragraphs,
which, according to some sources, do not lead to the best results. That, however, didn't stop us to get
very respectable BLEU and NIST scores, as you can see yourself if you try the demo or your own
corpora.
If, and only if, you set $translate_for_tmx to 1, then the values assigned to other parameters will be
taken into account:
a) $minseglen: if set to a value different from -1 and greater than 0, all segments with a length of less
than minseglen will be erased; if set to -1, no segments will be erased whatever their length; default
value: -1;
b) $othercleanings: if set to 1, tabulation signs will be replaced by newlines and lines composed only
of digits, spaces and parentheses will be removed; default value: 1;
c) $improvesegmentation: if set to 1, replaces any of the characters [:;.!?] followed by a space by that
character followed by a newline, deletes empty lines and substitutes double spaces by a single space;
default value: 1;
d)
$removeduplicates: if set to 1, removes duplicated segments; default value: 1
If you want to do a scoring of the Moses translation (for that, you need to have a reference – human –
translation) and if $translate_for_tmx is set to 1, then you MUST set $minseglen = -1,
$othercleanings = 0, $improvesegmentation = 0 and $removeduplicates = 0 (so that the source
document and the reference translation have the same number of segments).
The names of the output files will be equal to those placed in $mosesdir/translation_input except for
a suffix that is appended to them with the abbreviation of the target language. Therefore, if you input
the file 100.pt you will get a translated 100.pt.en.moses file (if “en” is the abbreviation of the target
language).
Furthermore, both the source document and the Moses translation are also changed so that:
1) Some of the named entities defined in the TMX specification are duly created (e.g., “<” →
“<”);
2) “ / “ → “/”;
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The source file name MUST observe the following convention:
Source file
: <textbasename>.<abbreviation of source language>
(ex: 100.en)
!!! The <basename> should NOT include the character "." !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^
L.1. Speed
Especially with very large trained corpora (several million segments), translation can be slow.
According to the Moses manual, to get faster performance than the default Moses setting at roughly the
same performance, use the parameters $searchalgorithm=1 (default: 0), $cubepruningpoplimit=2000
(default: 1000) and $stack=2000 (default: 100).
You can also try to reduce both of the latter 2 parameters to values of 500 or less (say, 100) and
experiment to determine it they significantly change the translation quality.
L.2. Reusing tuning weights (only for advanced users)
Since tuning is a very long phase and since its only useful product is a set of weights that it transfers to
the moses.ini file, you could perhaps invest in a single long tuning for each pair of languages that you
you are interested in and you could copy those weights from such a moses.ini to every other moses.ini
that will be created for that same language pair, a very big time saving trick.
If the files used for tuning are representative of the types of documents you are interested in, they
should in principle lead to better results than the default values used when no tuning is done (that is not
always the case).
You might be interested in doing this only if the score script shows a significant increase in translation
quality after tuning is made. In practice, you could first train a corpus without tuning, translate a
representative text and then score that translation with the scoring script. Then, you could retrain the
same corpus with tuning and translate that same representative text and score it (since the scripts reuse
the previously made steps, the previous training will be reused and you will just do a new tuning and a
new training test if no other parameters are changed). You can repeat this for several representative
texts. If the scores obtained with tuning are significantly higher than those obtained without tuning,
then you can use the tuning weights for all the similar corpora of that language pair.
Depending on your settings, you can have a moses.ini file in $mosesdir/corpora_trained/model,
$mosesdir/corpora_trained/memmaps, and $mosesdir/corpora_trained/evaluation. If you want to
use the tuning weights, you should change those weights in all those moses.ini after the training is
made. The weights in question are listed in the [weight-d], [weight-l], [weight-t] and [weight-w]
sections of the $mosesdir/corpora_trained/tuning/.../moses.weight-reused.ini file.
M. score script
Vital parameters: mosesdir, scoreTMXdocuments, score_line_by_line (if this latter parameter is set to
1, then you should set the remove_equal parameters), tokenize and lowercase
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
!!! THIS SCRIPT SHOULD NOT BE USED WITH DOCUMENTS TRANSLATED WITH THE
translate script WITH ITS $translate_for_tmx PARAMETER SET TO 1 ***UNLESS*** the
$othercleanings, $improvesegmentation and $ removeduplicates parameters of that script were all set to
0 and $minseglen was set to -1 (this processing changes the order of the segments and can also make
the source document have a number of segments that is different from the number of segments of the
reference translation, namely because it can delete some segments and/or add some new ones) !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In some cases, you might want to score the translations you get with the translate script against
reference (human) translations that might be available. This script scores all the Moses translation files
put in the $mosesdir/translation_output (if the $scoreTMXdocuments parameter is set to 0) or in the
$mosesdir/translation_files_for_tmx (if the $scoreTMXdocuments parameter is set to 1) directory. For
that to happen correctly, it is very important that you respect the conventions that follow. The source
document that was translated should also observe the naming criteria defined in the translate script.
If you have done a “normal” translation, you should set the $scoreTMXdocuments parameter to 0 (its
default value).
The $scoreTMXdocuments parameter should be set to 1 if you have previously translated the text
with $translate_for_tmx set to 1 in the translate script, indicating that you have done a translation
especially suited for making translation memories.
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The source file and the reference translation file names MUST observe the following conventions:
Source file
: <scorebasename>.<abbreviation of source language>
(ex: 100.en)
Reference translation file: <scorebasename>.<abbreviation of target language>.ref (ex: 100.pt.ref)
!!! The <basename> should NOT include the character "."
!!!#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^
The script could be useful in order to determine whether some parameter changes lead to
improvements or to even indirectly assess the satisfaction of the human users of translations made with
your trained corpus (for that, just consider their translation, after it is done, as the reference translation
and score the Moses translation against their reference translation).
For a score to be done, you need do have a source document, a reference (human) translation of the
source document and a Moses translation of the source document for each score that you intend to get.
The reference translations should all be put in the $mosesdir/translation_reference directory.
As seen in section L, you could have done a “normal” translation (if $translate_for_tmx = 0 in the
translate script) or a translation especially suited for TMX translation memories (if
$translate_for_tmx = 1).
The source document should be placed in the $mosesdir/translation_input directory, for normal
translations, or in $mosesdir/translation_files_for_tmx, for translations suited for TMX.
The reference translation should be put in the $mosesdir/translation_reference directory. The scripts
will put the Moses translation in the right location ($mosesdir/translation_output directory, for
normal translations, and $mosesdir/translation_files_for_tmx for translations suited for TMX).
M.1. Two types of scores
1) This script produces a NIST and BLEU score for a whole Moses translation if you set its
$score_line_by_line parameter to a value different from 1 (default: 0).
2) In some cases, you might want to detect types of segments that are systematically very well or very
incorrectly translated by Moses. In such cases, you want a BLEU or a NIST score of each segment
translated by Moses. For that, you have to set the $score_line_by_line parameter to 1. In that case, a
more detailed output file is produced with 6 fields per line:
1) BLEU score;
2) NIST score;
3) Number of the segment in the source document;
4) Source language segment;
5) Reference (human) translation of that segment;
6) Moses translation of that segment.
Furthermore, when $score_line_by_line is set to 1, this script sorts the segments in ascending order of
BLEU score.
In our opinion, the $tokenize and $lowercase lead to results that better reflect human judgment. Their
default values are set to 1.
When the $score_line_by_line is set to 1 and the $remove_equal parameter to 1, then repeated
segments will not appear in the scoring report. You can, however, easily determine how many they are
if you subtract the total number of segments contained in the scoring report from the total number of
segments of the scored text.
M.2. Problems with the NIST scorers
The more recent scorers freely provided by NIST (mteval-v13a-20091001.tar.gz and mteval-v12.pl )
stopped with an error in some of our texts (for instance, when “ & “ and the Unicode U+001E appeared
in the texts to be scored). We have noticed that the older mteval-v11b.pl scorer (that is used by this
script) didn't cause us so many problems. But even this latter scorer still has occasional problems with
some characters (namely the Unicode U+001E character). The Unicode U+001E character is the
“information separator two” and our converter of Word documents to UTF-8 texts has a tendency to
convert non secant hyphens into this character. We solve this problem with the following code (that
replaces all the occurrences of the U+001E character with an hypen):
sed -e 's/\x1E/\-/g' inputfilename > outputfilename
If you are faced with the same problem (training OK, but scorer error), you first have to discover in
your texts the character that is causing the error and you can then use code like this to substitute the
character in question by another, still suitable, character.
We aren't sure about the cause of these problems. They might be due to our converter to UTF-8 or to
the text files themselves.
N. Utilities
N.1. transfer-training-to-another-location script
Vital parameters: mosesdirmine, newusername, mosesdirotheruser
This script should be used on the computer in which the training that is to be transfered was made
(not on the target computer).
You might want to exchange trained corpora with other persons or transfer a training already made to a
different location. In several of the training files, though, your own $mosesdir directory is written
literally (e.g., “/home/mary/moses-irstlm-randlm”).
This script creates in the $mosesdir that contains the trainings that you want to transfer
($mosesdirmine parameter) a new subdirectory
$mosesdirmine/corpora_trained_for_another_location/$newusername and places there a copy of
the $mosesdirmine/corpora_trained and $mosesdirmine/logs that you want to transfer.
In these 2 latter subdirectories, the string that referred literally to the trained corpora initial location is
replaced by the correct string that will enable them to be used by another user and/or another location
that you specify in $newusername and $mosesdirotheruser (since you can have several Moses
installations in the same computer).
This script will copy all the trainings contained in the $mosesdirmine/corpora_trained directory.
Your original trainings are not affected by this operation (they are not erased).
You can then copy these 2 subdirectories (prepared for being transfered) to the new location or to the
new computer where you want them to be used (it is you who has to manually copy them there, for
instance copying them to a USB key or to an external hard disk and copying them to the $mosesdir
directory where you want them to be used, which corresponds to the $mosesdirotheruser directory).
After you have transferred the corpora to their intended location, you can safely erase the
$mosesdirmine/corpora_trained_for_another_location.
The $mosesdirmine parameter is the value of your $mosesdir (by default, $HOME/moses-irstlmrandlm) whose trainings you want to transfer. The $mosesdirotheruser parameter is the value of the
$mosesdir to which you want to transfer your training. The $newusername parameter is the Linux
login name of the user to whom you want to transfer your training (if you keep your own login, that
means that you are trying to transfer the training to another Moses installation in your own computer).
Since reading and writing to disk can lead to errors, we strongly suggest that you make a backup of the
$mosesdirotheruser directory before transferring the
$mosesdirmine/corpora_trained_for_another_location/$newusername/corpora_trained and the
$mosesdirmine/corpora_trained_for_another_location/$newusername/logs subdirectories to it and
especially before erasing or overwriting anything.
Please note that you should just copy the corpora_trained and logs subdirectories to
$mosesdirotheruser directory (not the
$mosesdirmine/corpora_trained_for_another_location/$newusername directory).
If the $mosesdirotheruser directory that is going to receive the new corpora_trained and logs
subdirectories already has some trainings, when you merge them there you will be alerted that the
subdirectories with the same name will be overwritten by the new ones. Even though we again stress
that it is much safer to make a backup of the contents of $mosesdirotheruser before attempting this
operation and highly recommend you to do that (any error might destroy the previous trainings already
present there!), you should accept that “overwriting”. In fact, given the structure of those directories,
the files already present there should not disappear. But this is something that you make at your own
risk and peril.
O. Improving quality and speed
These scripts, especially the train and the translate scripts, allow you to control more than 80
parameters that influence either quality or speed and relate to both Moses and the packages it uses.
Please refer to the comments that precede the parameters, especially those of the train and translate
scripts, in order to learn more about them. They were often extracted from the manuals of both Moses
and the packages upon which Moses depends. Such parameters are organized in what seems a logical
way and an especial care was taken to specify simultaneous changes that should be made of several
parameters that work together.
According to the Moses manual, you should try first the (less numerous) parameters of the translate
script. In case you want to reset the parameters that you changed to their default values, you can consult
the Appendix of the present document to get those values.
P. Deleting trained corpora
Please remember that Moses for Mere Mortals reuses as much as possibly the steps done in previous
trainings. Therefore, if, for instance, you always use the same target language and have always trained
your corpora with the same language model, Moses for Mere Mortals in fact just uses a ***single***
$mosesdir/corpora_trained/lm directory for all your trainings. If you therefore erase that directory,
***all*** your trained corpora will cease to work.
Please consider also that if in the future you will need to make some of the steps that you now want to
erase, they will have to be redone again.
In a general way, be very, very careful with deleting, especially if you do not know the directory
structure and the files needed for each of the training steps. The fact that Moses for Mere Mortals
already reuses steps previously done in other corpora trainings is in itself an insurance that you are
saving disk space.
P.1. You want to erase all the trainings that you have done
That's really easy. Just delete the $mosesdir/corpora_trained and the $mosesdir/logs directories. Next
time you'll use the train script, it will re-create these 2 directories.
P.2. You want to erase just some of all the trainings that you have done
There are 2 ways to delete corpora. The first one, though more accurate, requires you to have 2 Moses
installations (you can have as many Moses installations as you want, each one in a $mosesdir with a
different name). The second one is much more risky and isn't as effective, but it does not require you to
have more than one Moses installation. In both cases, we strongly recommend that you carefully
backup the corpora_trained and the logs subdirectories of the $mosesdir which will be changed. As
you know, if you accidentally change by mistake the $mosesdir/corpora_trained or the $mosesdir/logs
directories, you can lose one, several or even all the trainings you have done.
P.2.1. Scenario 1: More than one Moses installations available
This is by far the less risky and more efficient method of deleting unwanted trainings.
Tip: If you do not have more than one Moses installations (that is, more than one $mosesdir), you can
easily create a new one by running the create script and setting its $mosesdir parameter to a value
different from the $mosesdir value that was used to create your present Moses installation (after its
execution, you will have 2 different Moses installations).
1) Prepare the $mosesdir where you want to delete a trained corpus (let's call it $mosesdirstart)
to be transferred to another location, by using the transfer-training-to-another-location script.
2) As you know (see section N), this script processes all the trained corpora of $mosesdirstart,
that is, the ones you want delete and also the ones you do not want to delete, and creates 2 new
directories:
$mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained
and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs. Delete in
these 2 directories, respectively, the subdirectories and the logfiles that correspond to the
corpora that you want to delete (the most voluminous files are on the
$mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained/m
odel and
$mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained/lm
directories).
NOTE: this is a very sensitive step. If, as explained in the beginning of this section, you delete any
file that is still used by trained corpora that you want to retain, those trained corpora will cease
to work. But since you are acting on a copy of your corpora (made by the transfer-* script) you
really can play a bit, and even learn, as long as you will do step 17 before doing step 18).
3) Let's call the $mosesdir that will receive the trained corpora that you do not want to delete
$mosesdirfinal. Just for playing safe, backup its $mosesdirfinal/corporatrained and
$mosesdirfinal/logs subdirectories.
4) Now you just have to move the
$mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained
and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs to,
respectively, $mosesdirfinal/corpora_trained and $mosesdirfinal/logs.
5) In order to verify that everything was well done, make a small translation with one of the
trained corpora that you want to keep in $mosesdirfinal, as well as a translation with one of the
corpora that you have now manually transferred to there. This is a most important step that you
should never jump unless you want to risk some, and possibly even all, of your work.
6) If no problems were detected in the previous step, delete $mosesdirstart.
P.2.2. Scenario 2: Single Moses installation available
A more convoluted, much riskier and less efficient way is also available, but it is only intended for
persons who are real experts (in computers, in Moses ***and*** in Moses for Mere Mortals). If you
make any mistake and if you haven't backed up your files, you risk destroying your work.
The log file of the training that you want to suppress (located in the $mosesdir/logs subdirectory)
contains, at its very end, a list of the files used in that training. You can erase the files that use more
space and that are surely not required for any other trainings if you just erase some of the files, and only
***some*** of the files, listed in the log file, located in the following subdirectories of
$mosesdir/corpora_trained:
1)
2)
3)
4)
5)
lm
evaluation
memmaps
model
tuning
The expert side of things lies in the fact that, from such files, you should only erase the files of those
directories that are not being used by a trained corpus that you want to retain. Since there are many
cases of figure, that question is beyond the scope of the present Tutorial.
Q. How to contribute
You are actively encouraged to improve Moses for Mere Mortals. Comments, criticisms and further
scripts or documentation that will make the process of using Moses more user-friendly are gladly
welcome.
If we accept your work, we will fully acknowledge his/her author (and only him/her) and we propose
that in the very beginning of it you write:
#copyright {year}, {your name}
#licenced according to the {name of the licence} licence
If you propose a significant change to an existing script that does not involve our work, the names of all
of the authors will be mentioned on it and the licence will have to be agreed upon.
Code that respects the structure of our scripts (and that preferably changes them the way you want)
will be reviewed much more quickly and will cause us much less work. Other code, unless it is very
well documented and its purpose seems us very interesting, risks not to be reviewed for sheer lack of
time, unless it itself can work with the present Moses for Mere Mortals directory structure or creates
itself a complete and different structure that is immediately functional. Please understand that we also
have professional time pressures that may stop us to work as much as we want on Moses for Mere
Mortals for some time. Even if we wouldn't agree to include your code in our work, you would of
course still be free to use Moses-for-Mere-Mortals and to distribute it, according to the terms of the
GPL licence.
We expect you to test your code, but we reserve also the right to propose changes if it nevertheless
doesn't pass the tests that we make and to which our own work is also subjected. We may not always
have time to immediately test all your proposals, but you can expect a very quick and brief answer
acknowledging the receipt of your code, as well as the expected schedule of its testing (except during
holidays or times of particular work pressure).
If you want to belong to our team, please send an e-mail to joao.luis.rosas@gmail.com describing
yourself and the ideas that you would like to implement. We will answer every such proposal. We can't
promise we will accept every proposal, but we are actively looking for more developers.
R. Thanks
Special thanks:
 Maria José Machado, whose suggestions and research have influenced significantly the score
script. She helped in the evaluation of Moses output in general and organized, together with
Hilário, a comparative evaluation, made by professional translators, of the qualitative results of
Google, Moses and a rule-based MT engine. She suggested a deep restructuring of the present
Help-Tutorial file and is a co-author of it.
 Hilário Leal Fontes, who made very helpful suggestions about the functionality of several
scripts and made comprehensive tests. He is also the author of the nonbreaking_prefix.pt script
(for the Portuguese language). He has compiled the corpora that were used to train Moses and
to test these scripts, including 2 very large corpora with 6.6 and 12 million segments. He has
also revised the Help/Tutorial file.
 Tom Hoar, for consolidating previous documentation into the Quick-Start-Guide.doc to help
users to get up to speed very quickly.
 Manuel Tomas Carrasco Benitez, whose Xdossier application was used to create a pack of the
Moses-for-Mere-Mortals files.
 Gary Daine, who made helpful remarks and who contributed code for Extract_TMX_Corpus.
 Authors of the http://www.dlsi.ua.es/~mlf/fosmt-moses.html (Mikel Forcada and Francis
Tyers) and the http://www.statmt.org/moses_steps.html pages. These pages have helped me a
lot in the first steps with Moses.
 Authors of the documentation of Moses, giza-pp, MGIZA, IRSTLM and RandLM; some of the
comments of the present scripts describing the various parameters include extracts of them.
 European Commission's Joint Research Centre and Directorate-General for Translation for the
DGT-TM Acquis - freely available on the JRC website and providing aligned corpora of about 1
million segments of Community law texts in 22 languages - which was used in the
demonstration corpus. Please note that only European Community legislation printed in the
paper edition of the Official Journal of the European Union is deemed authentic.
S. Author
João Luís Amorim de Castro Rosas
joao.luis.rosas@gmail.com
The author wishes to stress that the very, very long (unimaginable) working hours and the numerous
extremely relevant suggestions of Hilário Leal Fontes and Maria José Machado, who tested this
software in an actual translation environment, were an immense contribution and created also a very
pleasurable working environment (despite the stress we all suffered ;-) ). These scripts would not be the
same, and would in fact be much worse, without their help, which made them reflect and try to answer
the practical problems of professional translators.
APPENDIX: default parameters of each of the scripts
NOTE: the lines starting with the symbol “#” are comments that explain the role of the parameters; the
parameters are indicated in bold. The vital parameters of each script (those that you probably will want
to change if you train your own corpora) are indicated in bold red.
1) create script:
#Full path of the base directory location of your Moses system
mosesdir=$HOME/moses-irstlm-randlm
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! Please set $mosesnumprocessors to the number of processors of your computer !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Number of processors in your computer
mosesnumprocessors=1
#Install small demo corpus: 1 = Install; Any other value = Do not install (!!! this will install a very
small corpus that can be used to see what the scripts and Moses can do; if dodemocorpus is set to 1, this
series of scripts will be able to use the demo corpus without you having to change their settings !!!)
dodemocorpus=1
#Remove the downloaded compressed packages and some directories no longer needed once the
installation is done; 1 = remove the downloaded packages; any other value = do not remove those
packages
removedownloadedpackges=1
2) make-test script:
#Base path of Moses installation
mosesdir=$HOME/moses-irstlm-randlm
#Source language abbreviation
lang1=pt
#Target language abbreviation
lang2=en
#Number of sectors in which each input file will be cut
totalnumsectors=100
#Number of segments pseudorandomly searched in each sector
numsegs=10
#Name of the source language file used for creating one of the test files (!!! omit the path; the name
should not include spaces !!!)
basefilename=200000
3) train script:
#Full path of the base directory location of your Moses system
mosesdir=$HOME/moses-irstlm-randlm
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#NOTE 1: The corpus that you want to train, together with the respective tuning files (if different), the
testing files (if different), the file used for recasing, and the file used to build the language model (if
different) should be placed in $mosesdir/corpora_for_training !!!
#NOTE 2: After the script is executed, you will find a summary of what has been done (the corpus
summary file) in $mosesdir/logs
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#====================1. LANGUAGES
========================================
#Abbreviation of language 1 (source language)
lang1=pt
#Abbreviation of language 2 (target language)
lang2=en
#====================2. FILES =============================================
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! The names of the files should not include spaces, as well as special characters, like asterisks,
backslashes or question marks. Try to stick with letters, numbers, and the dash, dot, and underscore if
you want to avoid Bash surprises. Avoid using a dash as the first character of a file name, because most
Linux commands will treat it as a switch. If your files start with a dot, they'll become hidden files !!!
The $corpusbasename, $lmbasename and $recaserbasename parameters that follow have
OBLIGATORILY to be defined.
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Basename of the corpus placed in $mosesdir/corpora_for_training (the example that follows refers to
the 2 files 200000.for_train.en and 200000.for_train.pt, whose basename is 200000.for_train)
corpusbasename=200000.for_train
#Basename of the file used to build the language model (LM), placed in
$mosesdir/corpora_for_training (!!! this is a file in the target language !!!)
lmbasename=300000
#Basename of the recaser training file, placed in $mosesdir/corpora_for_training
recaserbasename=300000
#Basename of the tuning corpus, placed in $mosesdir/corpora_for_training
tuningbasename=800
#Basename of the test set files (used for testing the trained corpus), placed in
$mosesdir/corpora_for_training
testbasename=200000.for_test
#===================== 3. TRAINING STEPS ===================================
#--------------------------------------------------------------------------------------------------------------------------#Reuse all relevant files that have already been created in previous trainings: 1= Do ; Any other
value=Don't
reuse=1
#--------------------------------------------------------------------------------------------------------------------------#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#NOTE 1: If in doubt, leave the settings that follow as they are; you will do a full training with
memory mapping, tuning, a training test and scoring of the training test of the demo corpus; the results
will appear in $mosesdir/corpora_trained and a log file will be available in $mosesdir/logs.
#NOTE 2: You can also proceed step by step (e.g., first doing just LM building and corpus training and
then testing), so as to better control the whole process.
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Do parallel corpus training: 1= Do ; Any other value=Don't !!!
paralleltraining=1
#Number of the first training step (possible values: 1-9); choose 1 for a completely new corpus
firsttrainingstep=1
#Number of the last training step (possible values: 1-9); choose 9 for a completely new corpus
lasttrainingstep=9
#Do memory mapping: 1 = Do ; Any other value = Don't
memmapping=1
#Do tuning: 1= Do ; Any other value=Don't; can lead, but does not always lead, to better results; takes
much more time
tuning=1
#Do a test (with scoring) of the training: 1 = Do ; Any other value = Don't
runtrainingtest=1
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# If you are new to Moses, stop here for the time being
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#=================== 4. LANGUAGE MODEL PARAMETERS ======================
# Use IRSTLM (1) or RandLM (5)
lngmdl=1
#Order of ngrams - the higher the better, but more memory required (choose between 3 and 9; good
value: 5)
Gram=5
#---------------------------*** 4.1. IRSTLM PARAMETERS ***-----------------------------------------------# Distributed language model: 1= Yes; Any other value = No (splits the file used to build the language
model into parts, processes each part separately and finally merges the parts)
distributed=1
# Number of parts to split dictionary into balanced n-gram prefix lists (in the creation of a distributed
language model); default: 5; !!! Only used if distributed = 1 !!!
dictnumparts=20
# Smoothing possible values: witten-bell (default); kneser-ney, improved-kneser-ney
s='witten-bell'
# Quantize LM (IRSTLM user manual, p. 4: "Reduces memory comsumption at the cost of some loss
of performance") 1 = Do ; Any other value = Don't. May induce some accuracy loss. Reduces the size
of the LM.
quantize=0
# Memory-mapping of the LM. 1 = Do; Any other value = Don't. Avoids the creation of the binary LM
directly in RAM (allows bigger LM at the cost of lower speed; often necessary when LM file is very
big) !!!
lmmemmapping=1
#------------------------------------*** 4.2. RandLM PARAMETERS ***---------------------------------------# The format of the input data. The following formats are supported: for a CountRandLM, "corpus"
(tokenised text corpora, one sentence per line); for a BackoffRandLM, 'arpa' (an ARPA backoff
language model)
inputtype=corpus
# The false positive rate of the randomised data structure on an inverse log scale so '-falsepos 8'
produces a false positive rate of 1/2^8
falsepos=8
# The quantisation range used by the model. For a CountRandLM, quantisation is performed by taking
a logarithm. The base of the logarithm is set as 2^{1/'values'}. For a BackoffRandLM, a binning
quantisation algorithm is used. The size of the codebook is set as 2^{'values'}
values=8
#====================== 5. TRAINING PARAMETERS ===========================
#------------------------------------- *** 5.1. TRAINING STEP 1 ***--------------------------------------------#********** mkcls options
#Number of mkcls interations (default:2)
nummkclsiterations=2
#Number of word classes
numclasses=50
#--------------------------------------*** 5.2. TRAINING STEP 2 ***--------------------------------------------#.................................................. 5.2.1. MGIZA parameters
.......................................................................
#Number of processors of your computer that will be used by MGIZA (if you use all the processors
available, the training will be considerably speeded)
mgizanumprocessors=1
#........ 5.2.2. GIZA parameters .......................................................................
#maximum sentence length; !!! never exceed 101 !!!
ml=101
#No. of iterations:
#------------------#number of iterations for Model 1
model1iterations=5
#number of iterations for Model 2
model2iterations=0
#number of iterations for HMM (substitutes model 2)
hmmiterations=5
#number of iterations for Model 3
model3iterations=3
#number of iterations for Model 4
model4iterations=3
#number of iterations for Model 5
model5iterations=0
#number of iterations for Model 6
model6iterations=0
#
#parameters for various heuristics in GIZA++ for efficient training:
#-----------------------------------------------------------------#Counts increment cutoff threshold
countincreasecutoff=1e-06
#Counts increment cutoff threshold for alignments in training of fertility models
countincreasecutoffal=1e-05
#minimal count increase
mincountincrease=1e-07
#relative cutoff probability for alignment-centers in pegging
peggedcutoff=0.03
#Probability cutoff threshold for lexicon probabilities
probcutoff=1e-07
#probability smoothing (floor) value
probsmooth=1e-07
#parameters for describing the type and amount of output:
#----------------------------------------------------------#0: detailled alignment format, 1: compact alignment format
compactalignmentformat=0
#dump frequency of Model 1
model1dumpfrequency=0
#dump frequency of Model 2
model2dumpfrequency=0
#dump frequency of HMM
hmmdumpfrequency=0
#output: dump of transfer from Model 2 to 3
transferdumpfrequency=0
#dump frequency of Model 3/4/5
model345dumpfrequency=0
#for printing the n best alignments
nbestalignments=0
#1: do not write any files
nodumps=1
#1: write alignment files only
onlyaldumps=1
#0: not verbose; 1: verbose
verbose=0
#number of sentence for which a lot of information should be printed (negative: no output)
verbosesentence=-10
#smoothing parameters:
#--------------------#f-b-trn: smoothing factor for HMM alignment model #can be ignored by -emSmoothHMM
emalsmooth=0.2
#smoothing parameter for IBM-2/3 (interpolation with constant))
model23smoothfactor=0
#smooting parameter for alignment probabilities in Model 4)
model4smoothfactor=0.4
#smooting parameter for distortion probabilities in Model 5 (linear interpolation with constant
model5smoothfactor=0.1
#smoothing for fertility parameters (good value: 64): weight for wordlength-dependent fertility
parameters
nsmooth=4
#smoothing for fertility parameters (default: 0): weight for word-independent fertility parameters
nsmoothgeneral=0
#parameters modifying the models:
#-------------------------------#0 = IBM-3/IBM-4 as described in (Brown et al. 1993); 1: distortion model of empty word is deficient;
2: distortion model of empty word is deficient (differently); setting this parameter also helps to avoid
that during IBM-3 and IBM-4 training too many words are aligned with the empty word); 1 = only 3dimensional alignment table for IBM-2 and IBM-3
compactadtable=1
deficientdistortionforemptyword=0
#d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)
depm4=76
#d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)
depm5=68
#lextrain: dependencies in the HMM alignment model. &1: sentence length; &2: previous class; &4:
previous position; &8: French position; &16: French class)
emalignmentdependencies=2
#f-b-trn: probability for empty word
emprobforempty=0.4
#parameters modifying the EM-algorithm:
#-------------------------------------#fixed value for parameter p_0 in IBM-5 (if negative then it is determined in training)
m5p0=-1
manlexfactor1=0
manlexfactor2=0
manlexmaxmultiplicity=20
#maximum fertility for fertility models
maxfertility=10
#fixed value for parameter p_0 in IBM-3/4 (if negative then it is determined in training)
p0=0.999
#0: no pegging; 1: do pegging
pegging=0
#-------------- *** 5.3. TRAINING SCRIPT PARAMETERS ***---------------------------------------------#Heuristic used for word alignment; possible values: intersect (intersection seems to be a synonym),
union, grow, grow-final,grow-diag, grow-diag-final-and (default value),srctotgt, tgttosrc (Moses
manual, p. 72, 144)
alignment=grow-diag-final-and
#Reordering model; possible values: msd-bidirectional-fe (default), msd-bidirectional-f, msd-fe, msd-f,
monotonicity-bidirectional-fe, monotonicity-bidirectional-f, monotonicity-fe, monotonicity-f (Moses
manual, p. 77)
reordering=msd-bidirectional-fe
#Minimum length of the sentences (used by clean)
MinLen=1
#Maximum length of the sentences (used by clean)
MaxLen=60
#Maximum length of phrases entered into phrase table (max: 7; choose a lower value if phrase size
length is an issue; good value for most purposes: 3)
MaxPhraseLength=5
#-------------- *** 5.4. DECODER PARAMETERS ***-------------------------------------------------------#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! Only used in the training evaluation, and only if tuning = 0 !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#***** QUALITY TUNING:
# Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good
translations of each other
weight_t=1
# Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target
language
weight_l=1
# Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence
weight_d=1
# Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output;
positive values favour short output); ensures translations do not get too long or too short
weight_w=0
#-----------------------------------------# Use Minumum Bayes Risk (MBR) decoding (1 = Do; Any other value = do not); instead of
outputting the translation with the highest probability, MBR decoding outputs the translation that is
most similar to the most likely translations.
mbr=0
# Number of translation candidates consider. MBR decoding uses by default the top 200 distinct
candidate translations to find the translation with minimum Bayes risk
mbrsize=200
# Scaling factor used to adjust the translation scores (default = 1.0)
mbrscale=1.0
# Adds walls around punctuation ,.!?:;". 1= Do; Any other value = do not. Specifying reordering
constraints around punctuation is often a good idea.
monotoneatpunctuation=0
#***** SPEED TUNING:
# Fixed limit for how many translation options are retrieved for each input phrase (0 = no limit;
positive value = number of translation options per phrase)
ttablelimit=20
# Use the relative scores of hypothesis for pruning, instead of a fixed limit (0= no pruning; decimal
value = more pruning)
beamthreshold=0
# Threshold for constructing hypotheses based on estimate cost (default: 0 = not used).During the beam
search, many hypotheses are created that are too bad to be even entered on a stack. For many of them,
it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of
such hypotheses hazards a guess about their viability. This is based on correct score except for the
actual language model costs which are very expensive to compute. Hypotheses that, according to this
estimate, are worse than the worst hypothesis of the target stack, even given an additional specified
threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try
threshold factors between 0.5 and 1
earlydiscardingthreshold=0
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#To get faster performance than the default Moses setting at roughly the same performance, use the
parameter settings $searchalgorithm=1, $cubepruningpoplimit=2000 and $stack=2000. With cube
pruning, the size of the stack has little impact on performance, so it should be set rather high. The
speed/quality trade-off is mostly regulated by the -cube-pruning-pop-limit, i.e. the number of
hypotheses added to each stack
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Search algorithm; cube pruning is faster than the traditional search at comparable levels of search
errors; 0 = default; 1 = turns on cube pruning
searchalgorithm=0
# Number of hypotheses added to each stack; only a fixed number of hypotheses are generated for each
span; default is 1000, higher numbers slow down the decoder, may result in better quality
cubepruningpoplimit=1000
# Reduce size of hypothesis stack, that keeps the best partial translations (=beam); default: 100
stack=100
# Maximum phrase length (default: 20)
maxphraselen=20
# ****** SPEED AND QUALITY TUNING
# Minimum number of hypotheses from each coverage pattern; you may also require that a minimum
number of hypotheses is added for each word coverage (they may be still pruned out, however). This is
done using the switch -cube-pruning-diversity, which sets the minimum. The default is 0
cubepruningdiversity=0
# Distortion (reordering) limit in maximum number of words (0 = monotone; -1 = unlimited ; any other
positive value = maximal number of words; default:6)); limiting distortion often increases speed and
quality
distortionlimit=6
#================= 6. TUNING PARAMETERS =================================
# Maximum number of runs of tuning ( -1 = no limit; Any positive number = maximum number of
runs)
maxruns=10
4) translate script:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! The names of the files to be translated should not include spaces !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# The source file name and the reference translation file MUST observe the following conventions:
#
Source file
: <basename>.<abbreviation of source language>
(ex: 100.en)
#
Reference translation file: <basename>.<abbreviation of target language>.ref (ex: 100.pt.ref)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Full path of the base directory location of your Moses system
mosesdir=$HOME/moses-irstlm-randlm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Even if you are using the demonstration corpus, you have to fill the $logfile parameter so that the
script can be executed !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Name of the log file of the corpus to be used (time-saving tip: copy and paste it here; the default
directory of the log files is $mosesdir/logs); example of a possible name of a log file: pt-en.C200000.for_train-60-1.LM-300000.MM-1.day-18-01-10-time-14-08-50.txt)
logfile=
#Create a translation report when translations are finished; 1 = Do; Any other value = Do not
create_translation_report=1
#--------------------------------------*** TMX OPTIONS ***----------------------------------------------------#Process both the document to be translated and the Moses translation so that the machine translation
can best be used in TMX translation memories
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#!!! If you set this parameter to 1, you SHOULD NOT use the score script unless the $othercleanings,
$improvesegmentation and $ removeduplicates parameters are all set to 0 and $minseglen is set to -1,
since this processing changes the order of the segments and can also make the source document have a
number of segments that is different from the number of segments of the reference translation (namely
because it can delete some segments and/or add some new ones) !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
translate_for_tmx=0
#Minimal length of sentences; -1=any length; any other value=segments with less than $minseglen will
be erased ( !!! only active if translate_for_tmx =1 !!!)
minseglen=-1
#Substitute tabulation signs by newlines and remove lines composed only of digits, spaces and
parentheses ( !!! only active if translate_for_tmx = 1 !!!)
othercleanings=1
# Substitute any of the characters [:;.!?] followed by a space by that character followed by a newline;
delete empty lines; substitute doublespaces by one space ( !!! only active if translate_for_tmx = 1 !!!)
improvesegmentation=1
#Sort segments and remove those segments that are identical ( !!! only active if translate_for_tmx =1
!!! )
removeduplicates=1
#---------------------------------- *** MOSES DECODER PARAMETERS ***------------------------------#***** QUALITY TUNING:
# Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good
translations of each other
weight_t=1
# Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target
language
weight_l=1
# Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence
weight_d=1
# Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output;
positive values favour short output); ensures translations do not get too long or too short
weight_w=0
#-----------------------------------------# Use Minumum Bayes Risk (MBR) decoding (1 = Do; Any other value = do not); instead of
outputting the translation with the highest probability, MBR decoding outputs the translation that is
most similar to the most likely translations.
mbr=0
# Number of translation candidates consider. MBR decoding uses by default the top 200 distinct
candidate translations to find the translation with minimum Bayes risk
mbrsize=200
# Scaling factor used to adjust the translation scores (default = 1.0)
mbrscale=1.0
# Adds walls around punctuation ,.!?:;". 1= Do; Any other value = do not. Specifying reordering
constraints around punctuation is often a good idea.
monotoneatpunctuation=0
#***** SPEED TUNING:
# Fixed limit for how many translation options are retrieved for each input phrase (0 = no limit;
positive value = number of translation options per phrase)
ttablelimit=20
# Use the relative scores of hypothesis for pruning, instead of a fixed limit (0= no pruning; decimal
value = more pruning)
beamthreshold=0
# Threshold for constructing hypotheses based on estimate cost (default: 0 = not used).During the beam
search, many hypotheses are created that are too bad to be even entered on a stack. For many of them,
it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of
such hypotheses hazards a guess about their viability. This is based on correct score except for the
actual language model costs which are very expensive to compute. Hypotheses that, according to this
estimate, are worse than the worst hypothesis of the target stack, even given an additional specified
threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try
threshold factors between 0.5 and 1
earlydiscardingthreshold=0
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#To get faster performance than the default Moses setting at roughly the same performance, use the
parameter settings $searchalgorithm=1, $cubepruningpoplimit=2000 and $stack=2000. With cube
pruning, the size of the stack has little impact on performance, so it should be set rather high. The
speed/quality trade-off is mostly regulated by the -cube-pruning-pop-limit, i.e. the number of
hypotheses added to each stack
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Search algorithm; cube pruning is faster than the traditional search at comparable levels of search
errors; 0 = default; 1 = turns on cube pruning
searchalgorithm=0
# Number of hypotheses added to each stack; only a fixed number of hypotheses are generated for each
span; default is 1000, higher numbers slow down the decoder, may result in better quality
cubepruningpoplimit=1000
# Reduce size of hypothesis stack, that keeps the best partial translations (=beam); default: 100
stack=100
# Maximum phrase length (default: 20)
maxphraselength=20
# ****** SPEED AND QUALITY TUNING
# Minimum number of hypotheses from each coverage pattern; you may also require that a minimum
number of hypotheses is added for each word coverage (they may be still pruned out, however). This is
done using the switch -cube-pruning-diversity, which sets the minimum. The default is 0
cubepruningdiversity=0
# Distortion (reordering) limit in maximum number of words (0 = monotone; -1 = unlimited ; any other
positive value = maximal number of words; default:6)); limiting distortion often increases speed and
quality
distortionlimit=6
5) score script:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! THIS SCRIPT SHOULD NOT BE USED WITH DOCUMENTS TRANSLATED WITH THE
translate script WITH ITS $translate_for_tmx PARAMETER SET TO 1 IF ITS $remove_equal
PARAMETER IS SET TO A VALUE DIFFERENT FROM 0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! The names of the source and target reference translation files should not include spaces !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The source file name and the reference translation file MUST observe the following conventions:
Source file
: <basename>.<abbreviation of source language>
(ex: 100.en)
Reference translation file: <basename>.<abbreviation of target language>.ref (ex: 100.pt.ref)
!!! The <testbasename> should NOT include the character "." !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Base directory of your Moses installation (made with create)
mosesdir=$HOME/moses-irstlm-randlm
#Scores documents prepared for TMX translation memories. If this parameter is set to 1, the script will
look for the documents $s and $m in the $mosesdir/translation_files_for_tmx directory; if not set to 1,
it will look for the $s document in the mosesdir/translation_input directory and for the $m document in
$mosesdir/translation_output; in both cases, it will look for the $r document in
$mosesdir/translation_reference
scoreTMXdocuments=0
#Create a report where each segment gets its own score; 0 = score the whole document; 1 = score each
segment
score_line_by_line=0
#Remove moses translation segments that are equal to reference translation segments and whose BLEU
score is zero (!!! Only active if score_line_by_line=1 !!!)
remove_equal=1
#Tokenize the source document and the reference and the Moses translation
tokenize=1
#Lowercase the source document and the reference and the Moses translation
lowercase=1
6) transfer-training-to-another-location script
#Base dir of your the Moses system (e.g., $HOME/moses-irstlm-randlm) whose trainings you want to
transfer
mosesdirmine=$HOME/moses-irstlm-randlm
#Login name of the user to whom the trained corpora will be transfered
newusername=john
#Basedir of the Moses system of the user to which the trained corpora will be transfered
mosesdirotheruser=/media/1.5T/moses-irstlm-randlm
Download