Moses for Mere Mortals – Tutorial A translation chain for the real world Maria José Machado João Luís Rosas Hilário Leal Fontes Table of Contents Moses for Mere Mortals – Tutorial .......................................................................................................1 A. Purpose .............................................................................................................................................4 A.1. Moses for Mere Mortals (the Linux scripts) .............................................................................4 A.2. Windows add-ins ......................................................................................................................5 B. Requirements ....................................................................................................................................6 B.1. System requirements .................................................................................................................6 B.2. Software ....................................................................................................................................6 C. Overview ..........................................................................................................................................7 D. Installation for new Moses users ......................................................................................................8 D. 1. Demonstration corpus ..............................................................................................................8 D. 2. First steps with the scripts........................................................................................................8 E. Using your own corpora ...................................................................................................................8 F. create script........................................................................................................................................9 G. Understanding the directory structure of $mosesdir .......................................................................10 H. Names of the files and of the languages .........................................................................................12 I. Files needed for the training and what you need to know before training your own corpus ..........12 I.1. Where to put the files to be processed ......................................................................................12 I.2. Need of strictly aligned corpora files .......................................................................................12 I.3. Do not use spaces in file names ...............................................................................................12 I.4. Corpus files .............................................................................................................................13 I.4.1. Using TMX files to create Moses corpus files .................................................................13 I.5. Test files ...................................................................................................................................13 I.6. Language model file .................................................................................................................13 I.7. Tuning files ...............................................................................................................................14 I.7.1. maxruns parameter .........................................................................................................14 I.8. Evaluation (testing) files ..........................................................................................................14 I.9. Recaser files .............................................................................................................................15 J. make-test script ................................................................................................................................15 J.1. The ominous control characters ...............................................................................................16 K. train script .......................................................................................................................................16 K.1. Description of some important parameters .............................................................................17 K.2. Greatly increasing the training speed .....................................................................................17 K.3. Controlling tuning ..................................................................................................................17 K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of a training already done in previous trainings .................................................................................17 K.5. Training of an inverted corpus ...............................................................................................18 K.6. Isolating a training from all the other ones .............................................................................18 K.7. NIST scorer errors ..................................................................................................................18 K.7. NIST scorer errors ..................................................................................................................18 L. translate script .................................................................................................................................18 L.1. Speed .......................................................................................................................................20 L.2. Reusing tuning weights (only for advanced users) .................................................................20 M. score script .....................................................................................................................................20 M.1. Two types of scores ................................................................................................................22 M.2. Problems with the NIST scorers ............................................................................................22 M.2. Problems with the NIST scorers ............................................................................................22 N. Utilities ...........................................................................................................................................23 N.1. transfer-training-to-another-location script .............................................................................23 O. Improving quality and speed ..........................................................................................................24 P. Deleting trained corpora .................................................................................................................24 P.1. You want to erase all the trainings that you have done ............................................................25 P.2. You want to erase just some of all the trainings that you have done .......................................25 P.2.1. Scenario 1: More than one Moses installations available ...............................................25 P.2.2. Scenario 2: Single Moses installation available ...............................................................26 Q. How to contribute ...........................................................................................................................26 R. Thanks.............................................................................................................................................27 S. Author..............................................................................................................................................27 APPENDIX: default parameters of each of the scripts .......................................................................28 1) create script: ...............................................................................................................................28 2) make-test script: .........................................................................................................................28 3) train script:..................................................................................................................................29 4) translate script: ...........................................................................................................................37 5) score script: ................................................................................................................................40 6) transfer-training-to-another-location script ................................................................................41 A. Purpose Moses-for-Mere-Mortals builds a translation chain prototype with Moses + IRSTLM + RandLM + MGIZA. Very large corpora can therefore be processed. Its main aims are the following: 1) To help build a translation chain for the real world (especially if there is a tight schedule for production); 2) To guide the first steps of users that are just beginning to use Moses; 3) To enable a quick evaluation of Moses; 4) To integrate machine translation and translation memories. Even though the main thrust of this work centers on Linux (the operating system where Moses runs), translators usually work on an MS Windows environment. Therefore, two add-ins (collectively called “Windows add-ins”) help to make the bridge from Windows to Linux and then back from Linux to Windows (windows-add-ins-1.043.tar.gz package). For a brief overview of the whole project, please see Overview. Moses for Mere Mortals also has some original features: 1) Removes control characters from the input files (these can crash a training); 2) Extracts from the corpus files 2 test files with randomly selected, non-consecutive segments that are erased from the corpus files; 3) A new training does not interfere with the files of a previous training; 4) A new training reuses as much as possible the files created in previous trainings (thus saving time); 5) Inversion of corpora (e.g., from en-pt to pt-en) detected, allowing a much quicker training than that of the original language pair (also checks that the inverse training is correct); 6) Stops with an informative message if any of the phases of training (language model building, recaser training, corpus training, memory-mapping, tuning or training test) doesn't produce the expected results; 7) Can limit the duration of tuning; 8) Get the BLEU and NIST scores of a translation (either for the whole document or for each segment of it); 9) Transfer your trainings to someone else or to another Moses installation in the same computer; 10) All the mkcls, GIZA and MGIZA parameters can now be controlled through parameters of the train script; 11) Selected parameters of the Moses scripts and the Moses decoder can now be controlled through the train and translate scripts; 12) Can score in one go all the Moses translation files present in one directory. NOTE: given the fact that this tutorial can be read from top to bottom, or, more probably, by jumping selectively to random sections, we have repeated points that we consider important in several different sections. A.1. Moses for Mere Mortals (the Linux scripts) Moses for Mere Mortals, the Linux component of this work, automates the installation, the creation of a representative set of test files, the training, the translation and even the scoring tasks. It also comes with a demonstration corpus (too small for doing justice to the qualitative results that can be achieved with Moses, but capable of giving a realistic view of the relative duration of the steps involved). For building very large Moses corpora using your own translation memories (*.TMX files), please use the Extract_TMX_Corpus program included in the windows-add-ins-1.043.tar.gz package. If you want to transfer Moses translations to a *.TMX translation memory tool (e. g., SDL Trados Translator's Workbench), you can use Moses2TMX, a program included in the same package. Together, these 2 open source programs make the link between Linux (where the corpora are trained and the translations are made) and Windows (where most translators actually use Moses translations). One would expect that the users of these scripts, after having tried the demonstration corpus, can immediately get results with the real corpora they are interested in. These scripts also avoid the destruction of previously trained corpora by the subsequent training of a new corpus and simultaneously try to reuse the relevant parts of previous trainings in the subsequent training of a new corpus. It is also possible to train corpora where every word is presented together with, for instance, its respective lemma and/or part of speech tag (“factored training”). The present scripts do not cover this type of training. If you are interested in morphologically rich languages, and especially if one of them is the target language, you should be interested in factored training, a field that is evolving a lot at the present time. Moses-for-Mere-Mortals scripts are based on instructions from several sources, especially the http://www.dlsi.ua.es/~mlf/fosmt-moses.html and the http://www.statmt.org/moses_steps.html web pages and the Moses, IRSTLM, RandLM, giza-pp and MGIZA documentation, as well as on research on the available literature on Moses, namely the Moses mailing list (http://news.gmane.org/gmane.comp.nlp.moses.user). The comments transcribe parts of the manuals of all the tools used. Moses MT System is an open source project under the guidance of Philipp Koehn, University of Edinburgh, and is supported by the European Commission Framework Programme on Research and Technological Development and others. For information on the general concepts of Statistical Machine Translation, see “Statistical Machine Translation" by Philipp Koehn, Cambridge University Press, 2010. A.2. Windows add-ins Translators also usually do not have corpora large enough for getting excellent machine translation (MT) results and the results they get with MT are highly dependent on training it with a corpus that is highly representative of the domains that they are interested in. The synergy between machine translation and translation memories is not often stressed, but it seems to us that it naturally leads to better results: machine translation can be enriched with the human translations stored in the translation memories; and translation memories, if they do not have a certain “match” percentage for a given segment, can be complemented with machine translation segments. Therefore, the Windows add-ins address this synergy between MT and translation memories (namely those in the widely used TMX format). Translation memories are not, however, an obligatory part of this pack: the Linux scripts work with any perfectly aligned corpora files in UTF-8 format with Linux line endings. In the windows-add-ins-1.043.tar.gz package you will find a README file with instructions for each of the 2 Windows add-ins: 1) Extract_TMX_Corpus, which converts a whole directory of TMX files into a Moses corpus that can be used for training; 2) Moses2TMX, which converts a batch of Moses translations and their corresponding source language documents into TMX files. They therefore will not be mentioned in the rest of this Help/Tutorial file. B. Requirements B.1. System requirements Moses for Mere Mortals (MMM) has been tested with the following Linux (64 bits) distributions: Ubuntu 10.04 LTS (http://old-releases.ubuntu.com/releases/10.04/) Ubuntu 10.10 (http://old-releases.ubuntu.com/releases/10.10/) Ubuntu 11.04 (http://old-releases.ubuntu.com/releases/11.04/) If you are new to Linux, this could be a good time to ask for the help of a friend. The scripts should also work in other Linux distributions with slight changes, but they have not been tested in any other distribution. Computer: Minimum 2 Gb RAM (preferably much more) Preferably a fast multiprocessor computer Disk space: as a rough rule, the disk space needed for corpus training is approximately 100 times the size of the corpus (source plus target files). B.2. Software In order to use the scripts, the following packages need to be installed in Ubuntu (64 bits): 1) automake 2) bison 3) boost-build 4) build-essential 5) flex 6) help2man 7) libboost-all-dev 8) libgc-dev 9) libpthread-stubs0-dev 10) libtool 11) zlibc 12) zlib1g-dev 13) gawk 14) tofrodos You can install them by selecting the System menu and then the Administration > Synaptic Package Manager command. C. Overview These scripts have only been tested in a Ubuntu 10.04 LTS, 10.10 and 11.04 64 bits environment. Note: So as to avoid having to rewrite this Help file every time that a script version changes, in what follows the version numbers of the scripts have been omitted (for example, in most places we write create-* (or just create), instead of create-1.34, in order to refer to the create-1-34 script). 1) You should start using these scripts by opening the create script and changing there some important parameters that will make your installation adequate for your hardware (see section F. create script). 2) Then launch the create script, which will download and compile ***all*** the Moses packages and the packages upon which Moses depends. Since some compilation errors do not stop compilation but can lead to an unusable tool, the create script checks, at the end of the compilation of each tool, whether the appropriate files are present or not. 3) Select a set of corpus files (2 strictly aligned files in UTF-8 format, one in the source language and another in the target language) for Moses training and launch the make-test script in order to create a representative set of segments that will be used for testing the corpus that you will train and in order to erase those segments from the corpus files that will be used for training. This step is not obligatory but is highly recommended. NOTE 1: Even if you do not have your own corpus for training, you can use the demo corpus that comes with these scripts (you do not need to do anything to have this corpus: the create script takes care of this and the other scripts are configured to use it “straight from the box”). It is highly recommended that new Moses users start using these scripts using the demo corpus. NOTE 2: If you do not have text files in UTF-8 format (necessary for Moses), but you do have translation memories in *.TMX format, you can use the Extract-TMX-Corpus tool to create the adequate UTF-8 files from your TMX files (included in the windows-add-ins-1.043.tar.gz package). 4) Next, the train script trains a corpus composed exclusively of plain words (“non-factored training”). This script does include some advanced features, like memory-mapping (for saving memory resources, what is essential for processing large corpora), tuning (in order to try to get qualitatively better results) and the ability to change certain parameters that can either reduce the processing time or increase the quality of the results. 5) The translate script then translates one or more documents placed in a specific directory. 6) Finally, the score script allows you to score Moses translations against human translations of the same text, giving BLEU and NIST scores for either the whole document or for each segment of the document (depending on the settings that you define). 7) If you want to exchange your trained corpora with someone else or to another Moses installation (even one in the same computer), the transfer-training-to-another-location script helps you to do that. Users are expected to open the scripts and to change their parameters according to their needs (the parameters are preceded by comments that explain their purpose and sometimes even their allowable values and limits; many of those comments are citations of the Help of the several packages that Moses uses and of the Moses manual). These parameters are set at the top of each script in a clearly defined section. D. Installation for new Moses users D. 1. Demonstration corpus Moses-for-Mere-Mortals comes with a small demonstration corpus so that you can quickly see what the scripts can do. This corpus is automatically installed by the create script (no need for you to do anything) and is used by the other scripts, which are already configured to use it (you will have to open the translate script and to change its $logfile parameter, however, as described in section L). The corpus was extracted from DGT-TM Acquis available from the Commission's Joint Research Centre website (please note that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic). It is small (200 000 segments in the Portuguese and English languages) and the results of its processing cannot be seen as representative of the quality Moses can achieve (especially if you consider that IRSTLM and RandLM are intended to process corpora with several to many millions of segments). However, a small corpus like this will reveal you facts about Moses (like the relative length of time needed for each of its steps) and it is therefore highly recommended that you start using the scripts with their default settings. Later on, you can set them so that the actual work you are interested in will be done. If you don't change the default settings of the create and the train scripts, Moses will train this Portuguese-English corpus, which involves 300 000 segments for language model building and 200 000 segments for corpus training. This corpus had a BLEU score of 0.7115 and a NIST score of 11.3007 with a 5-gram language model. Even though small, it took some time to train in a machine with a Intel i7 720-QM processor and 8 GB of DDR3 RAM (2h 21m 05s without tuning and 7h 14m 00s with tuning limited to a maximum of 10 iterations). The training of a corpus with 6.6 million segments in this same machine took 2 days 22 h 59m (without tuning). D. 2. First steps with the scripts For a quick introduction to Moses-for-Mere-Mortals, please consult the QUICK-START-GUIDE.doc. E. Using your own corpora You should do the same steps as described above for new users, but, before launching each of the scripts, you should change the values of the parameters you are interested in (as you have already done in section D before using the translate script). So as to ease your task, the description of each of the scripts in this Help/Tutorial file will start with a section entitled “Vital parameters”, which tells you the strictest minimum of parameters that you should change in case you want to train a corpus different from the demonstration corpus. In fact, in order to train your own corpora, you have to define your own settings and to choose your own corpora files and the languages you are interested in. You might also want to change the parameters of Moses or of the packages it uses. In order to do that, before launching the scripts, open them and set the variables defined between ################################################################################### # The values of the variables that follow should be filled according to your needs: # ################################################################################### and ################################################################################### # End of parameters that you should fill # ################################################################################### Each parameter is preceded by a comment that describes its function and, in some cases, even states the values that are the allowable and the default values you can use. They often consist of extracts of the Help files, readmes or manuals of the several packages used. Please refer to the sections that follow, which describe each of the scripts and some important info in more detail. F. create script Vital parameters: mosesdir, mosesnumprocessors This is a shell script that creates (compiles) Moses and the packages that are used upon which it depends. 1. Go to the Moses-for-Mere-Mortals-1.16/scripts directory and open the create script. At the top of the script, you can change several variables that allow you to better adapt it to your own requirem ents: $mosesdir: this is the base directory where Moses will be installed (default value: $HOME/moses-irstlm-randlm). You can change both its name (in this case, “moses-irstlmrandlm”) and its location (in this case, “$HOME”). This variable is defined in all the scripts and its value MUST be the same in all of them if you want them to be able to work together (and you do want that!). $mosesnumprocessors: the number of processors of your computer that you want Moses to use (Moses will be compiled to make better use of them). The default value is 1, but nowadays you can easily find computers with 2 and 4 processors. 2. Save your changes. Do not change any other parameter for the time being (later on, after you have run all these scripts, you can start your own very personal experiences). This script also creates some important directories: $mosesdir/corpora_for_training: this is the directory where the corpora and the other input files for training are located; there you'll find already several files that can be used to run the training demo; you should place here all the files needed for training the corpora you are interested in. $mosesdir/corpora_trained: this is the directory where the files created during corpus training are saved; it is advisable not to not touch this directory, since you can destroy the training of one or several corpora; $mosesdir/logs: this is the directory where the training summary files that describe how the training was done are saved. These files have the name of several variables that will be used by the translate script (and which this latter script will extract for you). These files are very important because they are the only way to indicate to the translate script the trained corpus you want to use for translation; $mosesdir/toolsdir: this is the directory where both Moses and all the other tools (giza-pp, irstlm and so on) will be installed; this directory will not change during the training and you should not change it. G. Understanding the directory structure of $mosesdir The directory structure indicated in the figure below results from the execution of all the scripts. These directories are created at the time they are needed. After installation, for example, not all of them will exist already. 1. Once you have installed Moses with create, you should put the corpus files that you want to train in the $mosesdir/corpora_for_training directory. You should also place here the files used for creating the language model (if different), for training recasing (if different), for tuning (if any), and for testing the results of training (if any). 2. If you then use the train script for training that corpus, a directory $mosesdir/corpora_trained will be created to store the trained files. NOTE: $mosesdir/corpora_trained is a vital directory and you are strongly urged not to change any of its contents because it has a complex structure that mixes files from several trainings to allow reusing of the work done in previous trainings. If you change the contents, you risk destroying not just one, but several trainings. You can, however, use a specific training already done by referring to its logfile (see below) in the $mosesdir/logs directory. It is also possible to isolate a training from all others (please refer to section N.1 in order to learn how to do that). 3. At the end of the training, a training summary file (“logfile”) will be created in the $mosesdir/logs directory. This file is very important because its name will be used in the translate script to indicate the trained corpus you want to use for translation. 4. Once a corpus is trained, you can start using it to get actual translations with the translate script. Place the documents to be translated (you can translate one or many documents at the same time) in the $mosesdir/translation_input directory (created by the train script ) and then launch the translate script. You should also make sure, before the translation starts, that the files placed in the translation_input directory are indeed adequate for the trained corpus that this script uses (for instance, the languages used for training should match those of the translation. Otherwise, you will waste time translating files that shouldn't have been translated using that trained corpus, e.g. because their language is not adequate). 5. Translation can have 2 types of outputs: a) a “normal” Moses translation, if you set the $translate_for_tmx parameter to value different from 1 (default: 0); or b) a Moses translation especially suited for making translation memories, if you set the $translate_for_tmx parameter to 1. The “normal” translation will be located in the $mosesdir/translation_output directory. The translation intended to build TMX translation memories will appear, together with the corresponding modified input file, in the $mosesdir/translation_files_for_tmx directory. In both cases, the translation will have the name of the source document plus an ending that corresponds to the destination language and a final “suffix” “.moses”. This avoids confusions between source document and translation and between Moses translation and reference translation. 6. A new script (score) enables you to place a reference (that is, human) translation in the translation_reference directory and get the BLEU and NIST scores of the corresponding Moses translations in the translation_scoring directory. Again, you have 2 choices: a) get a score for the whole document, if the parameter $score_line_by_line is different from 1; or b) get a score line by line, with the segments ordered by ascending BLEU score, if the parameter $score_line_by_line is equal to 1. 7. The contents of the $mosesdir/tools directory should not be changed, since it includes all the files needed for Moses to work. H. Names of the files and of the languages The names of the files and languages, which are used to create some directories names, should not include spaces or symbol characters, like asterisks, backslashes or question marks. Try to stick with letters, numbers, and the dash, dot, and underscore if you want to avoid surprises. Avoid using a dash as the first character of a file name, because some Linux commands will treat it as a switch. If your files start with a dot, they'll become hidden files. I. Files needed for the training and what you need to know before training your own corpus I.1. Where to put the files to be processed All the files that are going to be mentioned should be put in $mosesdir/corpora_for_training (as described above, mosesdir is the base Moses system directory, whose default value is $HOME/mosesirstlm-randlm) I.2. Need of strictly aligned corpora files Be very sure that the corpus files you use are strictly aligned. Otherwise, you risk getting quite puzzling errors. At the very least, check the number of lines of the source language file and of the target language files, which should be strictly the same. In order to do this, type in your terminal the commands: wc -l {name_of_source_language_file} and wc -l {name_of_target_language_file} You may also want to check that the last line of the source language file does correspond to the last line of the target language file. You can even do more detailed checks. I.3. Do not use spaces in file names The name of all the files used in these scripts should not include spaces. I.4. Corpus files At the very least, you need 2 parallel corpus files for training (one source language file and one target language file, where the line number of a segment in the starting language file is exactly equal to the line number of its translation in the target language file, that is, those 2 files should be strictly aligned). The way the script works requires that both files share a common $corpusbasename (a variable that you set in the train script), that is, a “prefix” that identifies the corpus in question (e.g., “corpus1000”). To each file, you should also add a “suffix” that identifies the language in question (e.g., “en”, "fr" or “pt”). The name of the corpus files should be exclusively composed of these “prefixes” and “suffixes” If you want to create a corpus1000 for the pt and en languages (pt being the source language and en the target language in all the examples that follow), you therefore need the following files: corpus1000.pt corpus1000.en I.4.1. Using TMX files to create Moses corpus files If you do not have separate and strictly aligned files (one in the source language and another in the target language), but you do have *.TMX translation memory files, you can use ExtractTMX-Corpus (included in the windows-add-ins-1.043.tar.gz package) to transform such *.TMX files into files that Moses can use (it namely converts them to UTF-8 format). This program also does some cleaning of the corpus files that will prevent Moses errors during the training phase. I.5. Test files The make-test script allows you to use these 2 files to obtain a representative sample of their segments and creates 2 new corpus files in which it erases the corresponding lines of the segments that were extracted for testing. These segments can still occur in those new corpus files, since they may occur in more than one line. The new corpus files created by this script (whose name is given at the end of its execution) should then be the files used for setting the $corpusbasename. If you had started it with the example files indicated above, it would create 4 new files: corpus1000.for_train.pt (a file used for corpus training) corpus1000.for_train.en (a file used for corpus training) corpus1000.for_test.pt (a file used for testing the trained corpus) corpus1000.for_test.en (a file used for testing the trained corpus) I.6. Language model file You also need a single file for building the language model (this file must contain segments only in the target language). If you want, you can use the target language file used for the training, in which case you do not need a new file for this purpose. However, you might want to use a (possibly much bigger) file for building the language model, since this is a very important building-block that costs comparatively little time to process and affects the fluency of the translation. If you want to use a specific file for this purpose, you need to use a file, that you yourself create, whose “prefix” is arbitrary (e.g., “corpus145000”), but whose “suffix” is obligatorily the abbreviation of the target language already used in the corpus files (see point I.4). Therefore, continuing our example, if you want to use a file with 145000 segments for building the language model, you could name it: 145000.en I.7. Tuning files The tuning of a corpus should lead in principle to better translation results. For that, 2 conditions must be met: 1) the tuning parameter must be set to 1 in the train script (this is its default value); 2) you need two files: one for the source language and one for the target language. Again, you can use the files described in point I.4, in which case you do not need any new files. However, given that tuning is a very, very long process (and perhaps the longest of all), you might want to use a set of files with a ***much smaller size*** than those described in point 1. However, too small files lead generally to a big number of tuning iterations and to less improvement and too big files take more time to process. You have to strike a balance between these two extremes (1 000 to 2000 segments is good enough). Such files should have a common arbitrary “prefix” and as “suffix” the abbreviations of the languages already described in point I.4. Therefore, continuing our example, you could use 2 much smaller files named: 1. 100tuning.pt 2. 100tuning.en I.7.1. maxruns parameter Tuning is a phase that can easily take more time than all the others put together. Furthermore, you can't easily estimate its duration beforehand, since the number of its runs is highly variable from corpus to corpus. Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces some changes to the original Moses script so that the user can control the number of tuning runs through the parameter $maxruns of the train script. A value of -1 means that an unlimited number of runs is allowed. Any positive number >= 2 means that tuning should be stopped after that run number. The default value is 10. Good values can be between 5 and 10 (choose the lower end of this spectrum if you want to speed up things; but you will risk a worse tuning). I.8. Evaluation (testing) files If you want to evaluate the training of your trained corpus, you'll need 3 files, one for the source language, another for the human translation in the target language and the Moses translation file (in order to create this later file during training, the runtrainingtest parameter should be set to 1, which is its default value, in the train script. You can use the files described in point I.4 for that purpose, in which case you need no new files. You can use files especially created for the purpose or you can use the make-test script to create them (see note below). Such files should have a common arbitrary “prefix” and as “suffix” the abbreviations of the languages already described in point I.4. Therefore, continuing our example, you could use 2 much smaller files named: 1000.for_test.pt 1000.for_test.en NOTE: Since the choice of the segments used for the evaluation can considerably affect its results, a script (make-test) was made that divides the corpus files into X sectors and chooses randomly Y segments in each of those sectors based on a random segment line number that it chooses. The resulting test files are probably more representative of the several contexts and vocabularies of your corpus than a set of consecutive segments would be. I.9. Recaser files You also need a (possibly much bigger) file in the target language for recaser training. This is a comparatively quick process and you can therefore invest on it. The recaser training file has a “prefix” that is arbitrary (e.g., “10000recaser”), but a “suffix” that is obligatorily the abbreviation of the target language already used in the corpus files (see point I.4). Therefore, continuing our example, if you want to use a file with 10000 segments for recaser training, you could name it: 10000recaser.en J. make-test script Vital parameters: lang1, lang2, mosesdir, basefilename This script assumes that a Moses installation has already been done with the create script. It uses two aligned input files, one in the source and another in the target language, whose $basefilename should be equal and which differ by a “suffix” that indicates the abbreviations of their respective languages, as indicated in point I.4 (e.g., 200000.pt and 200000.en, the $basefilename being, in this case, “200000”). The script produces 2 new sets of files: 1) 2 files (one in the source language and another in the target language) used for testing the trained corpus; for that, it divides the corpus into X $sectors (a parameter that you can define) and then randomly selects Y $segments (another parameter you can define) in each sector. All the selected segments will have different line numbers (no line can be chosen more than once). This procedure offers a better guarantee that the segments used for testing the trained corpus are more representative of all the styles and contexts of the corpus being used than they would be if you would arbitrarily choose the same number of consecutive segments somewhere in the input files. These files have a basefilename equal to $basefilename.for_test and will be later used by the train script (e.g, 200000.for_test.pt and 00000.for_test.en). 2) 2 files (one in the source language and another in the destination language) that are equal to the starting files, except that the segments used for creating the 2 test files have been erased from them. These files have a basefilename equal to $basefilename.for_train and will be later used by the train script (e.g, 200000.for_train.pt and 00000.for_train.en). NOTE: if you want to compare the relative results of a change in training parameters, you should execute the training test before and after the change in parameters with the same set of test files (run make-test-* just once and use the test files it creates to test both trainings). J.1. The ominous control characters The Moses clean-corpus-n.perl script does not erase the control characters, what has led, in our experiments, to a crash during the memory-mapping of a reordering-table (with a control-K character), in one case several days after the training had started. Problems with this character have also been described in the Moses mailing list. Therefore, the make-test script (as well as the train script) also substitutes all instances of the control-G, control-H, control-L, control-M and control-K characters by a space in all the files. K. train script Vital parameters: mosesdir, lang1, lang2, corpusbasename, lmbasename, tuningbasename, testbasename, recaserbasename, reuse, paralleltraining, memmapping, tuning, runtrainingtest, lngmdl, Gram, mgizanumprocessors, maxruns This script assumes that a Moses installation has already been done with the create script and can optionally use the files created by the make-test script. Even though it might not cover all the features that you might like to play with (namely those necessary for factored training, which could involve the use of a lemmatizer and/or a parts of speech tagger), this script does allow you to train a corpus, to memory-map your training files (so that Moses uses less RAM resources), to do tuning, to do a training test (also with memory-mapping), and to get the NIST and BLEU scores of that test. It also makes available all the parameters used by IRSTLM, RandLM, mkcls, GIZA and MGIZA, as well as selected parameters used by the Moses decoder and the Moses scripts. These parameters are all set by default to the values they receive when you use the Moses scripts. If you are new to Moses, do not change them. If your computer has more than 1 processor, you should also change here the mgizanumprocessors parameter (set by default to 1) to the actual number of processors of your computer that you want to use with MGIZA. Just open the train script, search the name of this parameter change it and save your changes. This can lead to a much bigger training speed. At the very least, this script will build a linguistic model and will train a corpus (except if they exist already for the same parameters that you choose, in which case it will not rebuild them). Other steps are optional: memory-mapping, tuning and testing. In order to set the steps that will be executed, you have to change the parameters at the top of the script. The role of those parameters is also indicated there (in comments that precede each one of them). The directory structure created by these scripts ensures 2 things: 1) no training will interfere with the files of a previous training; 2) a posterior training will reuse as much as possible the files created in previous trainings; and 3) the training of an inverse corpus, like the training of a pt-en corpus after the training of an en-pt corpus with the same corpus files, is also speeded up because it uses some files that were already created in the first of the trainings. At the end of the training, a logfile (training summary) file will be created in the $mosesdir/logs directory. It includes details about the duration of the several phases of training, values that will be used when you translate files based on this trained corpus, a list of the main input files, a list of all the files created during training, a list of all the parameters used and the score of the trained corpus test (if a test was done). The name of this file is most important because it is used by the translate script to select the trained corpus that you want to use for translation. K.1. Description of some important parameters 1. reuse: if set to 1, any step already done in a previous training will not be redone; 2. memmapping: if set to 1, memory-mapping of the corpus will be done (this will reduce your RAM requirements) 3. tuning: if set to 1, tuning will be done; this should lead in principle to better results, but that does not always happen in practice; tuning can easily take more time than all the other steps combined 4. runtrainingtest: if set to 1, a test of the training will be done with scoring 5. lngmdl: package chosen to build the language model (in these scripts, you have 2 choices: 1 = IRSTLM; 5 = RandLM) 6. Gram: n-gram order; can influence significantly the results; the more, the better (maximal value: 9), but the execution time will suffer; normally between 3 and 9 (default: 5) 7. maxruns: maximal number of iterations made during training (default: 10; -1 means no limit) K.2. Greatly increasing the training speed The parameter mgizanumprocessors determines the number of processors of your computer that will be used during phase 2 (MGIZA) of training. Since this phase is the longest one of training, you should set this parameter to the number of processors (actual or virtual) of your computer that you want to use. K.3. Controlling tuning Tuning is a phase that can easily take more time than all the others put together. Furthermore, you can't easily estimate its duration beforehand, since the number of its runs is highly variable from corpus to corpus. Therefore, a modified mert-moses-new.pl script (mert-moses-new-modif.pl) introduces some changes to the original Moses script so that the user can control the number of tuning runs (iterations) through the parameter $maxruns of the train script. A value of -1 means that an unlimited number of runs is allowed. Any positive number >= 2 means that tuning should be stopped after that run number. The default value is 10. Good values can be between 5 and 10 (choose the lower end of this spectrum if you want to speed up things; but you will risk a worse tuning). K.4. Avoiding the destruction of a previous training by a subsequent training and reusing parts of a training already done in previous trainings In order to guarantee that the training of a corpus doesn't destroy files from a previously trained corpus and in order to ensure that each training reuses as much as possible the files already created in previous trainings, a complex (and confusing) directory structure was created. This, however, implies that the files of all the previous trainings are dispersed in the $mosesdir/corpora_trained directory. As already stated, this is a directory which you shouldn't change, since, by doing that, you can destroy not just one but several (or all) trainings. However, a $mosesdir/logs directory exists where you can find a summary of every training you made that describes, among other things, the parameters it used and the files it created. In order to use a trained corpus for translation you just have to copy the name of its log file into the $logfile parameter of the translate script. Nothing else is necessary for translation and indeed you can ignore where the trained corpus files are for all practical purposes. K.5. Training of an inverted corpus A particular case of reusing previously made files is the training of an inverse corpus of a corpus already trained (suppose en->pt when the pt->en trained corpus already exists). In that case, and if all the training parameters stay equal in both instances, this script detects this situation and uses the tokenized, cleaned and lowercased files of the previous training, as well as phases 1 and 2 (MGIZA, one of the longest steps of training) of the previous training. This can lead to savings of up to 25% of the execution time. K.6. Isolating a training from all the other ones You might feel tempted to isolate a training from all the other ones, perhaps because it is very important for you. However, so that a given training does not erase any part of a previous one and so that the scripts can reuse as much as possible the steps already done in previous trainings, the files of the several trainings are interspersed in the $mosesdir/corpora_trained directory. There is, nevertheless, an easy way to isolate a given training that you are going to make from all the other preceding ones, if you insist on that: 1) Rename the $mosesdir/corpora_trained directory to $mosesdir/corpora_trained-bak; 2) Rename the $mosesdir/logs directory to $mosesdir/logs-bak; 3) Do the training of the corpus you want to isolate from all others (this will create new $mosesdir/corpora_trained and $mosesdir/logs directories that will just contain the trained corpus data); 4) Move the newly created $mosesdir/corpora_trained and the $mosesdir/logs directories to a safe place (outside $mosesdir); 5) Rename the $mosesdir/corpora_trained-bak directory back to $mosesdir/corpora_trained; 6) Rename the $mosesdir/logs-bak directory to $mosesdir/logs. In order to be able to reuse again the training that you isolated from all others, you have to simply repeat steps 1) and 2) and to move its corpora_trained and logs directories that you isolated to the $mosesdir. K.7. NIST scorer errors If you get an error message from the NIST scorer at the end of the training, please see section M.2 and correct the situation as described there. L. translate script Vital parameters: mosesdir, logfile, translate_for_tmx (if this latter parameter is set to 1, then look also minseglen, othercleanings, improvesegmentation, removeduplicates) This script assumes that Moses with IRSTLM and RandLM has been compiled with create and that a trained model exists already for the type of language pair you want to translate (which requires you to have already run the train script). This script translates, using the trained model that you chose in its $logdir and $logfile parameters, the file or files that you yourself have put in $mosesdir/translation_input. In fact, it will translate in one step all the files that are there. It is very important that you fill correctly the name of the $logdir and $logfile parameters, since they are the only way of telling the script which trained corpus you want to use for translation. By default, $logdir has the value “$HOME/moses-irstlm-randlm/logs” and, if you haven't changed this parameter, you just have to go to this directory, identify the file that corresponds to the corpus you want to use and copy its name (omit the path!) into $logfile. Translation can have 2 types of outputs: 1) A “normal” Moses translation, if the $translate_for_tmx parameter is set to 0 (default value). The “normal” translation will be located in the $mosesdir/translation_output directory. or 2) A Moses translation especially suited for being used with a translation memory tool, if you set the $translate_for_tmx parameter to 1. This type of translation will be located, together with the modified input file, in the $mosesdir/translation_files_for_tmx directory. It is especially interesting for those who use machine translation together with translation memories (notably those who just use MT segments when there are no translation memory segments above a certain “match” percentage). By default, translate_for_tmx=0, which means it will do a "normal" translation. This type of translation respects fully the formatting of the original text, and therefore keeps long paragraphs, which, according to some sources, do not lead to the best results. That, however, didn't stop us to get very respectable BLEU and NIST scores, as you can see yourself if you try the demo or your own corpora. If, and only if, you set $translate_for_tmx to 1, then the values assigned to other parameters will be taken into account: a) $minseglen: if set to a value different from -1 and greater than 0, all segments with a length of less than minseglen will be erased; if set to -1, no segments will be erased whatever their length; default value: -1; b) $othercleanings: if set to 1, tabulation signs will be replaced by newlines and lines composed only of digits, spaces and parentheses will be removed; default value: 1; c) $improvesegmentation: if set to 1, replaces any of the characters [:;.!?] followed by a space by that character followed by a newline, deletes empty lines and substitutes double spaces by a single space; default value: 1; d) $removeduplicates: if set to 1, removes duplicated segments; default value: 1 If you want to do a scoring of the Moses translation (for that, you need to have a reference – human – translation) and if $translate_for_tmx is set to 1, then you MUST set $minseglen = -1, $othercleanings = 0, $improvesegmentation = 0 and $removeduplicates = 0 (so that the source document and the reference translation have the same number of segments). The names of the output files will be equal to those placed in $mosesdir/translation_input except for a suffix that is appended to them with the abbreviation of the target language. Therefore, if you input the file 100.pt you will get a translated 100.pt.en.moses file (if “en” is the abbreviation of the target language). Furthermore, both the source document and the Moses translation are also changed so that: 1) Some of the named entities defined in the TMX specification are duly created (e.g., “<” → “&lt;”); 2) “ / “ → “/”; #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The source file name MUST observe the following convention: Source file : <textbasename>.<abbreviation of source language> (ex: 100.en) !!! The <basename> should NOT include the character "." !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^ L.1. Speed Especially with very large trained corpora (several million segments), translation can be slow. According to the Moses manual, to get faster performance than the default Moses setting at roughly the same performance, use the parameters $searchalgorithm=1 (default: 0), $cubepruningpoplimit=2000 (default: 1000) and $stack=2000 (default: 100). You can also try to reduce both of the latter 2 parameters to values of 500 or less (say, 100) and experiment to determine it they significantly change the translation quality. L.2. Reusing tuning weights (only for advanced users) Since tuning is a very long phase and since its only useful product is a set of weights that it transfers to the moses.ini file, you could perhaps invest in a single long tuning for each pair of languages that you you are interested in and you could copy those weights from such a moses.ini to every other moses.ini that will be created for that same language pair, a very big time saving trick. If the files used for tuning are representative of the types of documents you are interested in, they should in principle lead to better results than the default values used when no tuning is done (that is not always the case). You might be interested in doing this only if the score script shows a significant increase in translation quality after tuning is made. In practice, you could first train a corpus without tuning, translate a representative text and then score that translation with the scoring script. Then, you could retrain the same corpus with tuning and translate that same representative text and score it (since the scripts reuse the previously made steps, the previous training will be reused and you will just do a new tuning and a new training test if no other parameters are changed). You can repeat this for several representative texts. If the scores obtained with tuning are significantly higher than those obtained without tuning, then you can use the tuning weights for all the similar corpora of that language pair. Depending on your settings, you can have a moses.ini file in $mosesdir/corpora_trained/model, $mosesdir/corpora_trained/memmaps, and $mosesdir/corpora_trained/evaluation. If you want to use the tuning weights, you should change those weights in all those moses.ini after the training is made. The weights in question are listed in the [weight-d], [weight-l], [weight-t] and [weight-w] sections of the $mosesdir/corpora_trained/tuning/.../moses.weight-reused.ini file. M. score script Vital parameters: mosesdir, scoreTMXdocuments, score_line_by_line (if this latter parameter is set to 1, then you should set the remove_equal parameters), tokenize and lowercase #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ !!! THIS SCRIPT SHOULD NOT BE USED WITH DOCUMENTS TRANSLATED WITH THE translate script WITH ITS $translate_for_tmx PARAMETER SET TO 1 ***UNLESS*** the $othercleanings, $improvesegmentation and $ removeduplicates parameters of that script were all set to 0 and $minseglen was set to -1 (this processing changes the order of the segments and can also make the source document have a number of segments that is different from the number of segments of the reference translation, namely because it can delete some segments and/or add some new ones) !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In some cases, you might want to score the translations you get with the translate script against reference (human) translations that might be available. This script scores all the Moses translation files put in the $mosesdir/translation_output (if the $scoreTMXdocuments parameter is set to 0) or in the $mosesdir/translation_files_for_tmx (if the $scoreTMXdocuments parameter is set to 1) directory. For that to happen correctly, it is very important that you respect the conventions that follow. The source document that was translated should also observe the naming criteria defined in the translate script. If you have done a “normal” translation, you should set the $scoreTMXdocuments parameter to 0 (its default value). The $scoreTMXdocuments parameter should be set to 1 if you have previously translated the text with $translate_for_tmx set to 1 in the translate script, indicating that you have done a translation especially suited for making translation memories. #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The source file and the reference translation file names MUST observe the following conventions: Source file : <scorebasename>.<abbreviation of source language> (ex: 100.en) Reference translation file: <scorebasename>.<abbreviation of target language>.ref (ex: 100.pt.ref) !!! The <basename> should NOT include the character "." !!!#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^ The script could be useful in order to determine whether some parameter changes lead to improvements or to even indirectly assess the satisfaction of the human users of translations made with your trained corpus (for that, just consider their translation, after it is done, as the reference translation and score the Moses translation against their reference translation). For a score to be done, you need do have a source document, a reference (human) translation of the source document and a Moses translation of the source document for each score that you intend to get. The reference translations should all be put in the $mosesdir/translation_reference directory. As seen in section L, you could have done a “normal” translation (if $translate_for_tmx = 0 in the translate script) or a translation especially suited for TMX translation memories (if $translate_for_tmx = 1). The source document should be placed in the $mosesdir/translation_input directory, for normal translations, or in $mosesdir/translation_files_for_tmx, for translations suited for TMX. The reference translation should be put in the $mosesdir/translation_reference directory. The scripts will put the Moses translation in the right location ($mosesdir/translation_output directory, for normal translations, and $mosesdir/translation_files_for_tmx for translations suited for TMX). M.1. Two types of scores 1) This script produces a NIST and BLEU score for a whole Moses translation if you set its $score_line_by_line parameter to a value different from 1 (default: 0). 2) In some cases, you might want to detect types of segments that are systematically very well or very incorrectly translated by Moses. In such cases, you want a BLEU or a NIST score of each segment translated by Moses. For that, you have to set the $score_line_by_line parameter to 1. In that case, a more detailed output file is produced with 6 fields per line: 1) BLEU score; 2) NIST score; 3) Number of the segment in the source document; 4) Source language segment; 5) Reference (human) translation of that segment; 6) Moses translation of that segment. Furthermore, when $score_line_by_line is set to 1, this script sorts the segments in ascending order of BLEU score. In our opinion, the $tokenize and $lowercase lead to results that better reflect human judgment. Their default values are set to 1. When the $score_line_by_line is set to 1 and the $remove_equal parameter to 1, then repeated segments will not appear in the scoring report. You can, however, easily determine how many they are if you subtract the total number of segments contained in the scoring report from the total number of segments of the scored text. M.2. Problems with the NIST scorers The more recent scorers freely provided by NIST (mteval-v13a-20091001.tar.gz and mteval-v12.pl ) stopped with an error in some of our texts (for instance, when “ & “ and the Unicode U+001E appeared in the texts to be scored). We have noticed that the older mteval-v11b.pl scorer (that is used by this script) didn't cause us so many problems. But even this latter scorer still has occasional problems with some characters (namely the Unicode U+001E character). The Unicode U+001E character is the “information separator two” and our converter of Word documents to UTF-8 texts has a tendency to convert non secant hyphens into this character. We solve this problem with the following code (that replaces all the occurrences of the U+001E character with an hypen): sed -e 's/\x1E/\-/g' inputfilename > outputfilename If you are faced with the same problem (training OK, but scorer error), you first have to discover in your texts the character that is causing the error and you can then use code like this to substitute the character in question by another, still suitable, character. We aren't sure about the cause of these problems. They might be due to our converter to UTF-8 or to the text files themselves. N. Utilities N.1. transfer-training-to-another-location script Vital parameters: mosesdirmine, newusername, mosesdirotheruser This script should be used on the computer in which the training that is to be transfered was made (not on the target computer). You might want to exchange trained corpora with other persons or transfer a training already made to a different location. In several of the training files, though, your own $mosesdir directory is written literally (e.g., “/home/mary/moses-irstlm-randlm”). This script creates in the $mosesdir that contains the trainings that you want to transfer ($mosesdirmine parameter) a new subdirectory $mosesdirmine/corpora_trained_for_another_location/$newusername and places there a copy of the $mosesdirmine/corpora_trained and $mosesdirmine/logs that you want to transfer. In these 2 latter subdirectories, the string that referred literally to the trained corpora initial location is replaced by the correct string that will enable them to be used by another user and/or another location that you specify in $newusername and $mosesdirotheruser (since you can have several Moses installations in the same computer). This script will copy all the trainings contained in the $mosesdirmine/corpora_trained directory. Your original trainings are not affected by this operation (they are not erased). You can then copy these 2 subdirectories (prepared for being transfered) to the new location or to the new computer where you want them to be used (it is you who has to manually copy them there, for instance copying them to a USB key or to an external hard disk and copying them to the $mosesdir directory where you want them to be used, which corresponds to the $mosesdirotheruser directory). After you have transferred the corpora to their intended location, you can safely erase the $mosesdirmine/corpora_trained_for_another_location. The $mosesdirmine parameter is the value of your $mosesdir (by default, $HOME/moses-irstlmrandlm) whose trainings you want to transfer. The $mosesdirotheruser parameter is the value of the $mosesdir to which you want to transfer your training. The $newusername parameter is the Linux login name of the user to whom you want to transfer your training (if you keep your own login, that means that you are trying to transfer the training to another Moses installation in your own computer). Since reading and writing to disk can lead to errors, we strongly suggest that you make a backup of the $mosesdirotheruser directory before transferring the $mosesdirmine/corpora_trained_for_another_location/$newusername/corpora_trained and the $mosesdirmine/corpora_trained_for_another_location/$newusername/logs subdirectories to it and especially before erasing or overwriting anything. Please note that you should just copy the corpora_trained and logs subdirectories to $mosesdirotheruser directory (not the $mosesdirmine/corpora_trained_for_another_location/$newusername directory). If the $mosesdirotheruser directory that is going to receive the new corpora_trained and logs subdirectories already has some trainings, when you merge them there you will be alerted that the subdirectories with the same name will be overwritten by the new ones. Even though we again stress that it is much safer to make a backup of the contents of $mosesdirotheruser before attempting this operation and highly recommend you to do that (any error might destroy the previous trainings already present there!), you should accept that “overwriting”. In fact, given the structure of those directories, the files already present there should not disappear. But this is something that you make at your own risk and peril. O. Improving quality and speed These scripts, especially the train and the translate scripts, allow you to control more than 80 parameters that influence either quality or speed and relate to both Moses and the packages it uses. Please refer to the comments that precede the parameters, especially those of the train and translate scripts, in order to learn more about them. They were often extracted from the manuals of both Moses and the packages upon which Moses depends. Such parameters are organized in what seems a logical way and an especial care was taken to specify simultaneous changes that should be made of several parameters that work together. According to the Moses manual, you should try first the (less numerous) parameters of the translate script. In case you want to reset the parameters that you changed to their default values, you can consult the Appendix of the present document to get those values. P. Deleting trained corpora Please remember that Moses for Mere Mortals reuses as much as possibly the steps done in previous trainings. Therefore, if, for instance, you always use the same target language and have always trained your corpora with the same language model, Moses for Mere Mortals in fact just uses a ***single*** $mosesdir/corpora_trained/lm directory for all your trainings. If you therefore erase that directory, ***all*** your trained corpora will cease to work. Please consider also that if in the future you will need to make some of the steps that you now want to erase, they will have to be redone again. In a general way, be very, very careful with deleting, especially if you do not know the directory structure and the files needed for each of the training steps. The fact that Moses for Mere Mortals already reuses steps previously done in other corpora trainings is in itself an insurance that you are saving disk space. P.1. You want to erase all the trainings that you have done That's really easy. Just delete the $mosesdir/corpora_trained and the $mosesdir/logs directories. Next time you'll use the train script, it will re-create these 2 directories. P.2. You want to erase just some of all the trainings that you have done There are 2 ways to delete corpora. The first one, though more accurate, requires you to have 2 Moses installations (you can have as many Moses installations as you want, each one in a $mosesdir with a different name). The second one is much more risky and isn't as effective, but it does not require you to have more than one Moses installation. In both cases, we strongly recommend that you carefully backup the corpora_trained and the logs subdirectories of the $mosesdir which will be changed. As you know, if you accidentally change by mistake the $mosesdir/corpora_trained or the $mosesdir/logs directories, you can lose one, several or even all the trainings you have done. P.2.1. Scenario 1: More than one Moses installations available This is by far the less risky and more efficient method of deleting unwanted trainings. Tip: If you do not have more than one Moses installations (that is, more than one $mosesdir), you can easily create a new one by running the create script and setting its $mosesdir parameter to a value different from the $mosesdir value that was used to create your present Moses installation (after its execution, you will have 2 different Moses installations). 1) Prepare the $mosesdir where you want to delete a trained corpus (let's call it $mosesdirstart) to be transferred to another location, by using the transfer-training-to-another-location script. 2) As you know (see section N), this script processes all the trained corpora of $mosesdirstart, that is, the ones you want delete and also the ones you do not want to delete, and creates 2 new directories: $mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs. Delete in these 2 directories, respectively, the subdirectories and the logfiles that correspond to the corpora that you want to delete (the most voluminous files are on the $mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained/m odel and $mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained/lm directories). NOTE: this is a very sensitive step. If, as explained in the beginning of this section, you delete any file that is still used by trained corpora that you want to retain, those trained corpora will cease to work. But since you are acting on a copy of your corpora (made by the transfer-* script) you really can play a bit, and even learn, as long as you will do step 17 before doing step 18). 3) Let's call the $mosesdir that will receive the trained corpora that you do not want to delete $mosesdirfinal. Just for playing safe, backup its $mosesdirfinal/corporatrained and $mosesdirfinal/logs subdirectories. 4) Now you just have to move the $mosesdirstart/corpora_trained_for_another_location/$newusername/corpora_trained and $mosesdirstart/corpora_trained_for_another_location/$newusername/logs to, respectively, $mosesdirfinal/corpora_trained and $mosesdirfinal/logs. 5) In order to verify that everything was well done, make a small translation with one of the trained corpora that you want to keep in $mosesdirfinal, as well as a translation with one of the corpora that you have now manually transferred to there. This is a most important step that you should never jump unless you want to risk some, and possibly even all, of your work. 6) If no problems were detected in the previous step, delete $mosesdirstart. P.2.2. Scenario 2: Single Moses installation available A more convoluted, much riskier and less efficient way is also available, but it is only intended for persons who are real experts (in computers, in Moses ***and*** in Moses for Mere Mortals). If you make any mistake and if you haven't backed up your files, you risk destroying your work. The log file of the training that you want to suppress (located in the $mosesdir/logs subdirectory) contains, at its very end, a list of the files used in that training. You can erase the files that use more space and that are surely not required for any other trainings if you just erase some of the files, and only ***some*** of the files, listed in the log file, located in the following subdirectories of $mosesdir/corpora_trained: 1) 2) 3) 4) 5) lm evaluation memmaps model tuning The expert side of things lies in the fact that, from such files, you should only erase the files of those directories that are not being used by a trained corpus that you want to retain. Since there are many cases of figure, that question is beyond the scope of the present Tutorial. Q. How to contribute You are actively encouraged to improve Moses for Mere Mortals. Comments, criticisms and further scripts or documentation that will make the process of using Moses more user-friendly are gladly welcome. If we accept your work, we will fully acknowledge his/her author (and only him/her) and we propose that in the very beginning of it you write: #copyright {year}, {your name} #licenced according to the {name of the licence} licence If you propose a significant change to an existing script that does not involve our work, the names of all of the authors will be mentioned on it and the licence will have to be agreed upon. Code that respects the structure of our scripts (and that preferably changes them the way you want) will be reviewed much more quickly and will cause us much less work. Other code, unless it is very well documented and its purpose seems us very interesting, risks not to be reviewed for sheer lack of time, unless it itself can work with the present Moses for Mere Mortals directory structure or creates itself a complete and different structure that is immediately functional. Please understand that we also have professional time pressures that may stop us to work as much as we want on Moses for Mere Mortals for some time. Even if we wouldn't agree to include your code in our work, you would of course still be free to use Moses-for-Mere-Mortals and to distribute it, according to the terms of the GPL licence. We expect you to test your code, but we reserve also the right to propose changes if it nevertheless doesn't pass the tests that we make and to which our own work is also subjected. We may not always have time to immediately test all your proposals, but you can expect a very quick and brief answer acknowledging the receipt of your code, as well as the expected schedule of its testing (except during holidays or times of particular work pressure). If you want to belong to our team, please send an e-mail to joao.luis.rosas@gmail.com describing yourself and the ideas that you would like to implement. We will answer every such proposal. We can't promise we will accept every proposal, but we are actively looking for more developers. R. Thanks Special thanks: Maria José Machado, whose suggestions and research have influenced significantly the score script. She helped in the evaluation of Moses output in general and organized, together with Hilário, a comparative evaluation, made by professional translators, of the qualitative results of Google, Moses and a rule-based MT engine. She suggested a deep restructuring of the present Help-Tutorial file and is a co-author of it. Hilário Leal Fontes, who made very helpful suggestions about the functionality of several scripts and made comprehensive tests. He is also the author of the nonbreaking_prefix.pt script (for the Portuguese language). He has compiled the corpora that were used to train Moses and to test these scripts, including 2 very large corpora with 6.6 and 12 million segments. He has also revised the Help/Tutorial file. Tom Hoar, for consolidating previous documentation into the Quick-Start-Guide.doc to help users to get up to speed very quickly. Manuel Tomas Carrasco Benitez, whose Xdossier application was used to create a pack of the Moses-for-Mere-Mortals files. Gary Daine, who made helpful remarks and who contributed code for Extract_TMX_Corpus. Authors of the http://www.dlsi.ua.es/~mlf/fosmt-moses.html (Mikel Forcada and Francis Tyers) and the http://www.statmt.org/moses_steps.html pages. These pages have helped me a lot in the first steps with Moses. Authors of the documentation of Moses, giza-pp, MGIZA, IRSTLM and RandLM; some of the comments of the present scripts describing the various parameters include extracts of them. European Commission's Joint Research Centre and Directorate-General for Translation for the DGT-TM Acquis - freely available on the JRC website and providing aligned corpora of about 1 million segments of Community law texts in 22 languages - which was used in the demonstration corpus. Please note that only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic. S. Author João Luís Amorim de Castro Rosas joao.luis.rosas@gmail.com The author wishes to stress that the very, very long (unimaginable) working hours and the numerous extremely relevant suggestions of Hilário Leal Fontes and Maria José Machado, who tested this software in an actual translation environment, were an immense contribution and created also a very pleasurable working environment (despite the stress we all suffered ;-) ). These scripts would not be the same, and would in fact be much worse, without their help, which made them reflect and try to answer the practical problems of professional translators. APPENDIX: default parameters of each of the scripts NOTE: the lines starting with the symbol “#” are comments that explain the role of the parameters; the parameters are indicated in bold. The vital parameters of each script (those that you probably will want to change if you train your own corpora) are indicated in bold red. 1) create script: #Full path of the base directory location of your Moses system mosesdir=$HOME/moses-irstlm-randlm #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! Please set $mosesnumprocessors to the number of processors of your computer !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Number of processors in your computer mosesnumprocessors=1 #Install small demo corpus: 1 = Install; Any other value = Do not install (!!! this will install a very small corpus that can be used to see what the scripts and Moses can do; if dodemocorpus is set to 1, this series of scripts will be able to use the demo corpus without you having to change their settings !!!) dodemocorpus=1 #Remove the downloaded compressed packages and some directories no longer needed once the installation is done; 1 = remove the downloaded packages; any other value = do not remove those packages removedownloadedpackges=1 2) make-test script: #Base path of Moses installation mosesdir=$HOME/moses-irstlm-randlm #Source language abbreviation lang1=pt #Target language abbreviation lang2=en #Number of sectors in which each input file will be cut totalnumsectors=100 #Number of segments pseudorandomly searched in each sector numsegs=10 #Name of the source language file used for creating one of the test files (!!! omit the path; the name should not include spaces !!!) basefilename=200000 3) train script: #Full path of the base directory location of your Moses system mosesdir=$HOME/moses-irstlm-randlm #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #NOTE 1: The corpus that you want to train, together with the respective tuning files (if different), the testing files (if different), the file used for recasing, and the file used to build the language model (if different) should be placed in $mosesdir/corpora_for_training !!! #NOTE 2: After the script is executed, you will find a summary of what has been done (the corpus summary file) in $mosesdir/logs #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #====================1. LANGUAGES ======================================== #Abbreviation of language 1 (source language) lang1=pt #Abbreviation of language 2 (target language) lang2=en #====================2. FILES ============================================= #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! The names of the files should not include spaces, as well as special characters, like asterisks, backslashes or question marks. Try to stick with letters, numbers, and the dash, dot, and underscore if you want to avoid Bash surprises. Avoid using a dash as the first character of a file name, because most Linux commands will treat it as a switch. If your files start with a dot, they'll become hidden files !!! The $corpusbasename, $lmbasename and $recaserbasename parameters that follow have OBLIGATORILY to be defined. #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Basename of the corpus placed in $mosesdir/corpora_for_training (the example that follows refers to the 2 files 200000.for_train.en and 200000.for_train.pt, whose basename is 200000.for_train) corpusbasename=200000.for_train #Basename of the file used to build the language model (LM), placed in $mosesdir/corpora_for_training (!!! this is a file in the target language !!!) lmbasename=300000 #Basename of the recaser training file, placed in $mosesdir/corpora_for_training recaserbasename=300000 #Basename of the tuning corpus, placed in $mosesdir/corpora_for_training tuningbasename=800 #Basename of the test set files (used for testing the trained corpus), placed in $mosesdir/corpora_for_training testbasename=200000.for_test #===================== 3. TRAINING STEPS =================================== #--------------------------------------------------------------------------------------------------------------------------#Reuse all relevant files that have already been created in previous trainings: 1= Do ; Any other value=Don't reuse=1 #--------------------------------------------------------------------------------------------------------------------------#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #NOTE 1: If in doubt, leave the settings that follow as they are; you will do a full training with memory mapping, tuning, a training test and scoring of the training test of the demo corpus; the results will appear in $mosesdir/corpora_trained and a log file will be available in $mosesdir/logs. #NOTE 2: You can also proceed step by step (e.g., first doing just LM building and corpus training and then testing), so as to better control the whole process. #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Do parallel corpus training: 1= Do ; Any other value=Don't !!! paralleltraining=1 #Number of the first training step (possible values: 1-9); choose 1 for a completely new corpus firsttrainingstep=1 #Number of the last training step (possible values: 1-9); choose 9 for a completely new corpus lasttrainingstep=9 #Do memory mapping: 1 = Do ; Any other value = Don't memmapping=1 #Do tuning: 1= Do ; Any other value=Don't; can lead, but does not always lead, to better results; takes much more time tuning=1 #Do a test (with scoring) of the training: 1 = Do ; Any other value = Don't runtrainingtest=1 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # If you are new to Moses, stop here for the time being #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #=================== 4. LANGUAGE MODEL PARAMETERS ====================== # Use IRSTLM (1) or RandLM (5) lngmdl=1 #Order of ngrams - the higher the better, but more memory required (choose between 3 and 9; good value: 5) Gram=5 #---------------------------*** 4.1. IRSTLM PARAMETERS ***-----------------------------------------------# Distributed language model: 1= Yes; Any other value = No (splits the file used to build the language model into parts, processes each part separately and finally merges the parts) distributed=1 # Number of parts to split dictionary into balanced n-gram prefix lists (in the creation of a distributed language model); default: 5; !!! Only used if distributed = 1 !!! dictnumparts=20 # Smoothing possible values: witten-bell (default); kneser-ney, improved-kneser-ney s='witten-bell' # Quantize LM (IRSTLM user manual, p. 4: "Reduces memory comsumption at the cost of some loss of performance") 1 = Do ; Any other value = Don't. May induce some accuracy loss. Reduces the size of the LM. quantize=0 # Memory-mapping of the LM. 1 = Do; Any other value = Don't. Avoids the creation of the binary LM directly in RAM (allows bigger LM at the cost of lower speed; often necessary when LM file is very big) !!! lmmemmapping=1 #------------------------------------*** 4.2. RandLM PARAMETERS ***---------------------------------------# The format of the input data. The following formats are supported: for a CountRandLM, "corpus" (tokenised text corpora, one sentence per line); for a BackoffRandLM, 'arpa' (an ARPA backoff language model) inputtype=corpus # The false positive rate of the randomised data structure on an inverse log scale so '-falsepos 8' produces a false positive rate of 1/2^8 falsepos=8 # The quantisation range used by the model. For a CountRandLM, quantisation is performed by taking a logarithm. The base of the logarithm is set as 2^{1/'values'}. For a BackoffRandLM, a binning quantisation algorithm is used. The size of the codebook is set as 2^{'values'} values=8 #====================== 5. TRAINING PARAMETERS =========================== #------------------------------------- *** 5.1. TRAINING STEP 1 ***--------------------------------------------#********** mkcls options #Number of mkcls interations (default:2) nummkclsiterations=2 #Number of word classes numclasses=50 #--------------------------------------*** 5.2. TRAINING STEP 2 ***--------------------------------------------#.................................................. 5.2.1. MGIZA parameters ....................................................................... #Number of processors of your computer that will be used by MGIZA (if you use all the processors available, the training will be considerably speeded) mgizanumprocessors=1 #........ 5.2.2. GIZA parameters ....................................................................... #maximum sentence length; !!! never exceed 101 !!! ml=101 #No. of iterations: #------------------#number of iterations for Model 1 model1iterations=5 #number of iterations for Model 2 model2iterations=0 #number of iterations for HMM (substitutes model 2) hmmiterations=5 #number of iterations for Model 3 model3iterations=3 #number of iterations for Model 4 model4iterations=3 #number of iterations for Model 5 model5iterations=0 #number of iterations for Model 6 model6iterations=0 # #parameters for various heuristics in GIZA++ for efficient training: #-----------------------------------------------------------------#Counts increment cutoff threshold countincreasecutoff=1e-06 #Counts increment cutoff threshold for alignments in training of fertility models countincreasecutoffal=1e-05 #minimal count increase mincountincrease=1e-07 #relative cutoff probability for alignment-centers in pegging peggedcutoff=0.03 #Probability cutoff threshold for lexicon probabilities probcutoff=1e-07 #probability smoothing (floor) value probsmooth=1e-07 #parameters for describing the type and amount of output: #----------------------------------------------------------#0: detailled alignment format, 1: compact alignment format compactalignmentformat=0 #dump frequency of Model 1 model1dumpfrequency=0 #dump frequency of Model 2 model2dumpfrequency=0 #dump frequency of HMM hmmdumpfrequency=0 #output: dump of transfer from Model 2 to 3 transferdumpfrequency=0 #dump frequency of Model 3/4/5 model345dumpfrequency=0 #for printing the n best alignments nbestalignments=0 #1: do not write any files nodumps=1 #1: write alignment files only onlyaldumps=1 #0: not verbose; 1: verbose verbose=0 #number of sentence for which a lot of information should be printed (negative: no output) verbosesentence=-10 #smoothing parameters: #--------------------#f-b-trn: smoothing factor for HMM alignment model #can be ignored by -emSmoothHMM emalsmooth=0.2 #smoothing parameter for IBM-2/3 (interpolation with constant)) model23smoothfactor=0 #smooting parameter for alignment probabilities in Model 4) model4smoothfactor=0.4 #smooting parameter for distortion probabilities in Model 5 (linear interpolation with constant model5smoothfactor=0.1 #smoothing for fertility parameters (good value: 64): weight for wordlength-dependent fertility parameters nsmooth=4 #smoothing for fertility parameters (default: 0): weight for word-independent fertility parameters nsmoothgeneral=0 #parameters modifying the models: #-------------------------------#0 = IBM-3/IBM-4 as described in (Brown et al. 1993); 1: distortion model of empty word is deficient; 2: distortion model of empty word is deficient (differently); setting this parameter also helps to avoid that during IBM-3 and IBM-4 training too many words are aligned with the empty word); 1 = only 3dimensional alignment table for IBM-2 and IBM-3 compactadtable=1 deficientdistortionforemptyword=0 #d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E) depm4=76 #d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E) depm5=68 #lextrain: dependencies in the HMM alignment model. &1: sentence length; &2: previous class; &4: previous position; &8: French position; &16: French class) emalignmentdependencies=2 #f-b-trn: probability for empty word emprobforempty=0.4 #parameters modifying the EM-algorithm: #-------------------------------------#fixed value for parameter p_0 in IBM-5 (if negative then it is determined in training) m5p0=-1 manlexfactor1=0 manlexfactor2=0 manlexmaxmultiplicity=20 #maximum fertility for fertility models maxfertility=10 #fixed value for parameter p_0 in IBM-3/4 (if negative then it is determined in training) p0=0.999 #0: no pegging; 1: do pegging pegging=0 #-------------- *** 5.3. TRAINING SCRIPT PARAMETERS ***---------------------------------------------#Heuristic used for word alignment; possible values: intersect (intersection seems to be a synonym), union, grow, grow-final,grow-diag, grow-diag-final-and (default value),srctotgt, tgttosrc (Moses manual, p. 72, 144) alignment=grow-diag-final-and #Reordering model; possible values: msd-bidirectional-fe (default), msd-bidirectional-f, msd-fe, msd-f, monotonicity-bidirectional-fe, monotonicity-bidirectional-f, monotonicity-fe, monotonicity-f (Moses manual, p. 77) reordering=msd-bidirectional-fe #Minimum length of the sentences (used by clean) MinLen=1 #Maximum length of the sentences (used by clean) MaxLen=60 #Maximum length of phrases entered into phrase table (max: 7; choose a lower value if phrase size length is an issue; good value for most purposes: 3) MaxPhraseLength=5 #-------------- *** 5.4. DECODER PARAMETERS ***-------------------------------------------------------#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! Only used in the training evaluation, and only if tuning = 0 !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #***** QUALITY TUNING: # Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good translations of each other weight_t=1 # Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target language weight_l=1 # Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence weight_d=1 # Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output; positive values favour short output); ensures translations do not get too long or too short weight_w=0 #-----------------------------------------# Use Minumum Bayes Risk (MBR) decoding (1 = Do; Any other value = do not); instead of outputting the translation with the highest probability, MBR decoding outputs the translation that is most similar to the most likely translations. mbr=0 # Number of translation candidates consider. MBR decoding uses by default the top 200 distinct candidate translations to find the translation with minimum Bayes risk mbrsize=200 # Scaling factor used to adjust the translation scores (default = 1.0) mbrscale=1.0 # Adds walls around punctuation ,.!?:;". 1= Do; Any other value = do not. Specifying reordering constraints around punctuation is often a good idea. monotoneatpunctuation=0 #***** SPEED TUNING: # Fixed limit for how many translation options are retrieved for each input phrase (0 = no limit; positive value = number of translation options per phrase) ttablelimit=20 # Use the relative scores of hypothesis for pruning, instead of a fixed limit (0= no pruning; decimal value = more pruning) beamthreshold=0 # Threshold for constructing hypotheses based on estimate cost (default: 0 = not used).During the beam search, many hypotheses are created that are too bad to be even entered on a stack. For many of them, it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of such hypotheses hazards a guess about their viability. This is based on correct score except for the actual language model costs which are very expensive to compute. Hypotheses that, according to this estimate, are worse than the worst hypothesis of the target stack, even given an additional specified threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try threshold factors between 0.5 and 1 earlydiscardingthreshold=0 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings $searchalgorithm=1, $cubepruningpoplimit=2000 and $stack=2000. With cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the -cube-pruning-pop-limit, i.e. the number of hypotheses added to each stack #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Search algorithm; cube pruning is faster than the traditional search at comparable levels of search errors; 0 = default; 1 = turns on cube pruning searchalgorithm=0 # Number of hypotheses added to each stack; only a fixed number of hypotheses are generated for each span; default is 1000, higher numbers slow down the decoder, may result in better quality cubepruningpoplimit=1000 # Reduce size of hypothesis stack, that keeps the best partial translations (=beam); default: 100 stack=100 # Maximum phrase length (default: 20) maxphraselen=20 # ****** SPEED AND QUALITY TUNING # Minimum number of hypotheses from each coverage pattern; you may also require that a minimum number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity, which sets the minimum. The default is 0 cubepruningdiversity=0 # Distortion (reordering) limit in maximum number of words (0 = monotone; -1 = unlimited ; any other positive value = maximal number of words; default:6)); limiting distortion often increases speed and quality distortionlimit=6 #================= 6. TUNING PARAMETERS ================================= # Maximum number of runs of tuning ( -1 = no limit; Any positive number = maximum number of runs) maxruns=10 4) translate script: #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! The names of the files to be translated should not include spaces !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # The source file name and the reference translation file MUST observe the following conventions: # Source file : <basename>.<abbreviation of source language> (ex: 100.en) # Reference translation file: <basename>.<abbreviation of target language>.ref (ex: 100.pt.ref) #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Full path of the base directory location of your Moses system mosesdir=$HOME/moses-irstlm-randlm ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Even if you are using the demonstration corpus, you have to fill the $logfile parameter so that the script can be executed !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Name of the log file of the corpus to be used (time-saving tip: copy and paste it here; the default directory of the log files is $mosesdir/logs); example of a possible name of a log file: pt-en.C200000.for_train-60-1.LM-300000.MM-1.day-18-01-10-time-14-08-50.txt) logfile= #Create a translation report when translations are finished; 1 = Do; Any other value = Do not create_translation_report=1 #--------------------------------------*** TMX OPTIONS ***----------------------------------------------------#Process both the document to be translated and the Moses translation so that the machine translation can best be used in TMX translation memories #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #!!! If you set this parameter to 1, you SHOULD NOT use the score script unless the $othercleanings, $improvesegmentation and $ removeduplicates parameters are all set to 0 and $minseglen is set to -1, since this processing changes the order of the segments and can also make the source document have a number of segments that is different from the number of segments of the reference translation (namely because it can delete some segments and/or add some new ones) !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ translate_for_tmx=0 #Minimal length of sentences; -1=any length; any other value=segments with less than $minseglen will be erased ( !!! only active if translate_for_tmx =1 !!!) minseglen=-1 #Substitute tabulation signs by newlines and remove lines composed only of digits, spaces and parentheses ( !!! only active if translate_for_tmx = 1 !!!) othercleanings=1 # Substitute any of the characters [:;.!?] followed by a space by that character followed by a newline; delete empty lines; substitute doublespaces by one space ( !!! only active if translate_for_tmx = 1 !!!) improvesegmentation=1 #Sort segments and remove those segments that are identical ( !!! only active if translate_for_tmx =1 !!! ) removeduplicates=1 #---------------------------------- *** MOSES DECODER PARAMETERS ***------------------------------#***** QUALITY TUNING: # Weights for phrase translation table (good values: 0.1-1; default: 1); ensures that the phrases are good translations of each other weight_t=1 # Weights for language model (good values: 0.1-1; default: 1); ensures that output is fluent in target language weight_l=1 # Weights for reordering model (good values: 0.1-1; default: 1); allows reordering of the input sentence weight_d=1 # Weights for word penalty (good values: -3 to 3; default: 0; negative values favor large output; positive values favour short output); ensures translations do not get too long or too short weight_w=0 #-----------------------------------------# Use Minumum Bayes Risk (MBR) decoding (1 = Do; Any other value = do not); instead of outputting the translation with the highest probability, MBR decoding outputs the translation that is most similar to the most likely translations. mbr=0 # Number of translation candidates consider. MBR decoding uses by default the top 200 distinct candidate translations to find the translation with minimum Bayes risk mbrsize=200 # Scaling factor used to adjust the translation scores (default = 1.0) mbrscale=1.0 # Adds walls around punctuation ,.!?:;". 1= Do; Any other value = do not. Specifying reordering constraints around punctuation is often a good idea. monotoneatpunctuation=0 #***** SPEED TUNING: # Fixed limit for how many translation options are retrieved for each input phrase (0 = no limit; positive value = number of translation options per phrase) ttablelimit=20 # Use the relative scores of hypothesis for pruning, instead of a fixed limit (0= no pruning; decimal value = more pruning) beamthreshold=0 # Threshold for constructing hypotheses based on estimate cost (default: 0 = not used).During the beam search, many hypotheses are created that are too bad to be even entered on a stack. For many of them, it is even clear before the construction of the hypothesis that it would be not useful. Early discarding of such hypotheses hazards a guess about their viability. This is based on correct score except for the actual language model costs which are very expensive to compute. Hypotheses that, according to this estimate, are worse than the worst hypothesis of the target stack, even given an additional specified threshold as cushion, are not constructed at all. This often speeds up decoding significantly. Try threshold factors between 0.5 and 1 earlydiscardingthreshold=0 #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #To get faster performance than the default Moses setting at roughly the same performance, use the parameter settings $searchalgorithm=1, $cubepruningpoplimit=2000 and $stack=2000. With cube pruning, the size of the stack has little impact on performance, so it should be set rather high. The speed/quality trade-off is mostly regulated by the -cube-pruning-pop-limit, i.e. the number of hypotheses added to each stack #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Search algorithm; cube pruning is faster than the traditional search at comparable levels of search errors; 0 = default; 1 = turns on cube pruning searchalgorithm=0 # Number of hypotheses added to each stack; only a fixed number of hypotheses are generated for each span; default is 1000, higher numbers slow down the decoder, may result in better quality cubepruningpoplimit=1000 # Reduce size of hypothesis stack, that keeps the best partial translations (=beam); default: 100 stack=100 # Maximum phrase length (default: 20) maxphraselength=20 # ****** SPEED AND QUALITY TUNING # Minimum number of hypotheses from each coverage pattern; you may also require that a minimum number of hypotheses is added for each word coverage (they may be still pruned out, however). This is done using the switch -cube-pruning-diversity, which sets the minimum. The default is 0 cubepruningdiversity=0 # Distortion (reordering) limit in maximum number of words (0 = monotone; -1 = unlimited ; any other positive value = maximal number of words; default:6)); limiting distortion often increases speed and quality distortionlimit=6 5) score script: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! THIS SCRIPT SHOULD NOT BE USED WITH DOCUMENTS TRANSLATED WITH THE translate script WITH ITS $translate_for_tmx PARAMETER SET TO 1 IF ITS $remove_equal PARAMETER IS SET TO A VALUE DIFFERENT FROM 0 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # !!! The names of the source and target reference translation files should not include spaces !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The source file name and the reference translation file MUST observe the following conventions: Source file : <basename>.<abbreviation of source language> (ex: 100.en) Reference translation file: <basename>.<abbreviation of target language>.ref (ex: 100.pt.ref) !!! The <testbasename> should NOT include the character "." !!! #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ #Base directory of your Moses installation (made with create) mosesdir=$HOME/moses-irstlm-randlm #Scores documents prepared for TMX translation memories. If this parameter is set to 1, the script will look for the documents $s and $m in the $mosesdir/translation_files_for_tmx directory; if not set to 1, it will look for the $s document in the mosesdir/translation_input directory and for the $m document in $mosesdir/translation_output; in both cases, it will look for the $r document in $mosesdir/translation_reference scoreTMXdocuments=0 #Create a report where each segment gets its own score; 0 = score the whole document; 1 = score each segment score_line_by_line=0 #Remove moses translation segments that are equal to reference translation segments and whose BLEU score is zero (!!! Only active if score_line_by_line=1 !!!) remove_equal=1 #Tokenize the source document and the reference and the Moses translation tokenize=1 #Lowercase the source document and the reference and the Moses translation lowercase=1 6) transfer-training-to-another-location script #Base dir of your the Moses system (e.g., $HOME/moses-irstlm-randlm) whose trainings you want to transfer mosesdirmine=$HOME/moses-irstlm-randlm #Login name of the user to whom the trained corpora will be transfered newusername=john #Basedir of the Moses system of the user to which the trained corpora will be transfered mosesdirotheruser=/media/1.5T/moses-irstlm-randlm