Estonian specific enhancements which could be used in statistical and hybrid MT systems. R&D report. Draft Dec 15th 2014 Project name: Linguistic Knowledge in Estonian Machine Translation Project acronym: EKT63 1. Introduction We describe our work done in 2014 (morphology toolkit integration, bilingual dictionaries etc) and list ideas for subsequent years. 2. State of the art Recent advances in MT Since beginning of this century statistical machine translation (SMT) has become dominant approach in machine translation. While at the beginning it has been applied mostly to widely spoken languages, .e.g., English, Spanish or French, with the growth of size of the language resources it becomes more popular for smaller languages, including languages of Baltic countries - Estonian, Latvian and Lithuanian (e.g. Fishel et al. 2007, Skadiņa 2008). Recently SMT as practical approach has been accepted by European Commission where MT@EC1 is implemented aiming at providing easy access to European Public Services information across Europe in user’s mother tongue. The most popular approach used in SMT systems is phrase-based SMT (Koehn et al. 2003), where sequences of words (called phrases) are retrieved automatically from parallel texts. Such approach has shown rather good results for languages with simple morphology, while for languages with rich morphology and rather free word order this approach has shown several important deficiencies that in many cases make output of general SMT system incomprehensible even for gisting purposes. Manual analysis of English-Latvian SMT output (Skadiņa et al. 2012) has reviled main problems with phrasebased SMT when it has been applied to morphology rich language, i.e., almost half of sentences has incorrect word forms, in many cases word order is incorrect. As result a lot of research has been performed to incorporate linguistic knowledge into SMT. Among them, the most popular are factored models (Koehn and Hieu 2007) that allows to incorporate different morphosyntactic properties. Application of factored models for translation into Baltic languages (Latvian 1 http://ec.europa.eu/isa/actions/02-interoperability-architecture/2-8action_en.htm and Lithuanian) has shown some improvements where it concerns translation quality (e.g. Skadiņš et al. 2010). Similarly, Clifton and Sarkar (2011) achieve best published result on English-Finnish task using an unsupervised morphological segmenter with a supervised post-processing merge step. Another proposal is to apply syntax-based models. Moses framework (Koehn et al. 2007) supports treeto-string, string-to-tree or tree-to-tree models. One of the limitation for this approach is availability of parser. Initial experiments with tree-to-string models in Moses framework for English-Latvian has shown no improvements in translation quality. This could be explained by the lack of the Latvian phrase structure parser. Several other frameworks, including Joshua (Weese et al. 2011), Jane (Freitag et al. 2014) and cdec (Dyer et al. 2010), support syntax-based models as well. Besides linguistically based tree models, Chiang (2007) has proposed hierarchical models where trees are automatically obtained from parallel texts. This approach has shown good results for English-Lithuanian, French-Lithuanian and Russian-Lithuanian language pairs in terms of the BLEU score. However, training and decoding speed has significantly decreased. Comprehensive overview of hybrid approaches to machine translation is provided by Thurmair (2009). It includes work on pre and processing (Stymne 2011). Syntactic preordering (Xu et al. 2009, Isozaki et al. 2010) aims to rearrange source language sentences into order closer to target language. It is useful for languages with different syntactic structures and word order. Post-processing can be used to correct MT output and improve translation quality (Stymne 2011, Mareček et al. 2011). Also domain adaptation, specific treatment of terminology, multiword units and named entities, has been researched and has shown improvements in translation quality. Recently research has turned to deep neural networks (e.g. Le et al. 2012; Auli et el. 2013). Estonian specific approaches http://masintolge.ut.ee/info/references.php [Fishel et al 2007] Estonian-English Statistical Machine Translation: the First Results.Main problems found: wrong order of phrases and sparse data. [Kirik 2008] Unsupervised Morphology in Statistical Machine Translation. Bachelor thesis. Using factored word alignment usualluy improves translation quality, but using factored word reordering lowers. The best factor schema is created by separating the word to a stem factor from its rightmost suffix, and using the stem factor for word alignment training. [Fishel et al 2010]. Linguistically Motivated Unsupervised Segmentation for Machine Translation. (link) [Kirik 2010] Language model based improvements in statistikal machine translation. Master thesis. Factored MT: an additional (secondary) LM using POS tags, word frequences in training corpora and word forms. No strategies were found which will improve translation quality. Any major paper not listed? 3. Integration of a morphology knowledge into Estonian-EnglishEstonian MT system Best practice of using morphology tools in MT The phrase-based approach in SMT allows translating source words differently depending on their context by translating whole phrases, whereas target language model allows matching target phrases at their boundaries. However, most phrases in inflectionally rich languages can be inflected in case, number, tense, mood and other morphosyntactic properties, producing considerable amount of variations. Estonian belong to the class of inflected languages which are complex from the point of view of morphology. There are over 2000 different morphology tags for Estonian. With Estonian as the target language for SMT, the high inflectional variation of target language increases data sparseness at the boundaries of translated phrases, where a language model over surface forms might be inadequate to estimate the probability of target sentence reliably. Following the approach of English-Czech factored SMT (Bojar et al., 2009; Tamchyna & Bojar, 2013), and English-Latvian/Lithuanian (Skadiņš et al., 2010) the most promising method to incorporate linguistic knowledge in SMT is to use morphology in factored SMT models (Koehn & Hoang, 2007). We have improved word alignment calculated over lemmas instead of surface forms, and we introduced an additional language model over disambiguated morphologic part-of-speech tags in the English-Estonian system. An additional language model over morphosyntactic part-of-speech tags can be built in order to improve inter-phrase consistency (Skadiņš et al., 2010; 2014).The tags contain morphologic properties generated by a statistical part-of-speech tagger. The order of the tag LM was increased to 7, as the tag data has significantly smaller vocabulary. At the moment we have evaluated the applicability of these methods for English-Estonian SMT using only small scale experiments, the large scale experiments using all collected parallel training data will be done until the end of the project. Estonian part of speech tagger for MT As part of speech tagger is necessary par factored SMT, we have created the Estonian part of speech tagger. There are two conceptually different types of POS-taggers: 1) 2) POS-taggers that perform POS-tag guessing POS-taggers that perform POS-tag disambiguation. The POS-taggers that are based on the disambiguation methodology perform morphological analysis of each token of a text and then perform POS-tag disambiguation by analysing the surrounding context (the context may include several tokens around the token being disambiguated and morphological information of the surrounding tokens). The guessers, on the other hand, rely solely on the text tokens and, depending on the machine learning algorithms applied, possibly also POS-tags of previously tagged tokens. I.e., if the disambiguation based POS-taggers have a relatively small set of possible POS-tags to select from when disambiguating a token, the guessing-based POS-taggers have to select the correct tag (i.e., the most likely tag in a particular context) from all possible tags. Both methods can be combined into hybrid methods that allow handling of out-of-vocabulary words better. The POS-tagger developed in the project is based on the disambiguation methodology. The workflow of POS-tagging a single document is depicted in the figure below. Input data (plaintext) Sentence breaking Tokenisation Morphological analysis Morphological disambiguation Output data (POS-tagged) The workflow is as follows: 1. At first, a document is broken down into sentences. For sentence breaking we use Tilde’s proprietary solutions developed prior to the project. 2. Then, each sentence is tokenised. For tokenisation we use Tilde’s proprietary solutions developed prior to the project. 3. Next, each token is morphologically analysed using the freely available Filosoft morphological analyser Vabamorf2. In order to use the morphological analyser, we developed a wrapper (a tool that integrates Vabamorf) that allows passing Tilde’s tokeniser output data to Vabamorf and converts Vabamorf output data in a format compliant to the POS-tagger’s input data. An example of the output format of the morphological analyser for the sentence “Eesti Vabariik on riik Põhja-Euroopas .” is as follows: Eesti Eesti Eesti Eesti Eesti eesti Vabariik vabariik on ole on ole riik riik Põhja-Euroopas Põhja-Euroopa . . N--sg---------------------fN--sn---------------------fG-------------------------fN--sn---------------------fVp-s--3--i-a----a---------lVp-p--3--i-a----a---------lN--sn---------------------lN--ss---------------------uT--------------------------- 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 3 4 5 The example shows that there are 3 possible morphological analyses for the token “Eesti”, one analysis for “Vabariik”, 2 for “on”, and 1 for the remaining three tokens. The task of the POStagger’s disambiguation module is to select the most probable tag from the given tags. 4. Finally, as the last step, the disambiguation is performed by a disambiguation module that is based on averaged perceptron (Rosenblatt, 1958) methodology3 and uses a pre-trained POStagging model. The POS-tagger provides two output formats – a TreeTagger4 compliant output format and a Moses5 factored data compliant output format. Examples are as follows: 2 Vabamorf is available online at: https://github.com/Filosoft/vabamorf More information about the perceptron algorithm can be found online at: http://en.wikipedia.org/wiki/Perceptron. The averaged perceptron algorithm is described online at: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/. 3 1. TreeTagger format example. Eesti Vabariik on riik Põhja-Euroopas . N N V N N T Eesti vabariik ole riik Põhja-Euroopa . N--sg---------------------fN--sn---------------------fVp-s--3--i-a----a---------lN--sn---------------------lN--ss---------------------uT--------------------------- 2. Moses factored data format example. Eesti|Eesti|N--sg---------------------f- Vabariik|vabariik|N--sn--------------------f- on|ole|Vp-s--3--i-a----a---------l- riik|riik|N--sn--------------------l- Põhja-Euroopas|Põhja-Euroopa|N--ss---------------------u- .|.|T-------------------------- In order to train the Estonian POS-tagging model, we used the morphologically annotated corpus of Estonian created by the University of Tartu6. As the corpus was annotated using a different tagset than the tagset used by the morphological analyser, we transformed the tags of the annotated data into tags provided for the tokens by the morphological analyser. For POS-tagger model training purposes for each token in the annotated data we identified the correct POS-tag, the correct lemma, and provided other possible analyses provided by the morphological analyser for the token. During the semi-automatic training data preparation process approximately 2,500 sentences were discarded, because they contained tokens, for which the morphological analyser did not provide an analysis that could be uniquely matched to the annotated data tags (a possible explanation are mismatches between annotation guidelines used in the corpus creation process and the morphological analyser as well as annotation mistakes in the morphologically annotated corpus). After transformation of the annotated data, a total of 42,287 sentences (585,965 tokens) remained in the annotated training data corpus. When the training data was ready, we trained the POS-tagging model in several iterative steps by improving the feature set (i.e., the context length, different morphological parameter configurations, etc.) used by the averaged perceptron learning algorithm. In each iteration, the POS-tagging model was evaluated using 10-fold cross-validation. The resulting POS-tagger achieves a precision of 97.51±0.08% with a confidence interval of 99%. This is a state-of-the-art result for POS-tagging of Estonian texts. 4. Parallel corpora and trained MT systems This section accounts for the general-domain MT systems Tilde has built in 2014 and its comparison with some other MT systems. In the past at Tilde we have been building Estonian MT systems on a regular basis for at least 3 years trying to make it better each time. We have done both general-domain as well as some domain-specific MT systems. We have documented and presented our experience with that (Skadiņš et al., 2014). 4 TreeTagger can be acquired online at: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/. More information about the Moses factored data format can be found online at: http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining. 6 The morphologically annotated corpus is available online at: http://www.cl.ut.ee/korpused/morfkorpus/. 5 This year our goal for building the new system general-domain MT system was to make the best possible general-domain English-Estonian MT system by making use of the best data set available with our current technology. We train a SMT system with the LetsMT MT platform (Vasiļjevs et al., 2012) which is based on Moses toolkit (Koehn et al., 2007). In this MT system training we did not use any newer technological advancement in comparison to the previous system – they are in the pipeline and will be examined in future system builds in 2015. We made use of some new public corpora resources as well as data processed within EOPC project in addition to the formerly used data. We can still see a correlation between amount of training data used and quality of MT system. We use both publicly available corpora collected by other institutions and corpora collected by us. The most important sources of data are: 1) Publicly available parallel corpora – Europarl corpus, DGT-TM, JRC-Acquis, ECDC and other corpora available from the Joint Research Center, the OPUS corpus, which includes data from European Medicines Agency, European Central Bank, EU constitution and other. 2) Parallel corpora collected – national legislation, standards, technical documents and product descriptions widely available on the web (some, examples: www.ceresit.net, www.europenikon.com), EU brochures from EU Bookshop, news portals and many more. 3) Monolingual corpora collected – mainly data crawled from the web (state institutions, portals, newspapers etc.). See Table 1 for amount of data used in the training of our SMT systems. We used BLEU metric for the automatic evaluation using a general-domain evaluation corpus that represents general domain data which is mixture of texts in different domains representing the expected translation needs of a typical user. The corpus includes texts from the fiction, business letters, IT texts, news and magazine articles, legal documents, popular science texts, manuals and EU legal texts, and it contains 512 parallel sentences in English and Estonian. Table 1. Amount of training data and results of the automatic evaluation Corpora size, sentences Language direction BLEU score Parallel Monolingual English – Estonian, 2012 10.5 M 28.3 M 23.78 English – Estonian v2.1, 2013 12.5 M 33.1 M 24.22 English – Estonian v2.2, 2014 17.8 M 37.1 M 24.48 The summary of automatic evaluation results in comparison with Google translator is presented in Figure 1. en-et systems – BLEU Scores 25 24 24.48 24.22 23.78 23 22 21.45 20.69 21 Tilde Google 20 19 18 en-et en-et v2.1 (2013) en-et v2.2 (2014) 16.11.2012 17.09.2013 08.12.2014 Figure 1. Our MT systems automatically compared to Google For human evaluation of the systems we used ranking of translated sentences relative to each other. This is the official determinant of translation quality used in the Workshop on Statistical Machine Translation shared tasks. The summary of human evaluation results in comparison with Google Translator is presented in Figure 2: 50% en-et systems – Human Evaluation 49% 49% 48% 47% 47% 46% 45% 45% 44% 43% 43% Tilde Google - 42% 41% 40% en-et v2.1 (2013) en-et v2.2 (2014) 17.09.2013 08.12.2014 Figure 2. Our MT systems compared to Google Translator by human evaluation The work of training a new English-Estonian MT system has proven to be worthwhile as the results obtained differ positively both in terms of comparing with the previous versions of Tilde MT systems as well as in comparison with the competitor. 5. R&D plans for 2015 As a future directions for quality improvements we see: Continuing parallel data collection (including (i) advanced content crawling methods and (ii) targeted MT output post-editing to create new MT training data) More efficient use of Estonian morphology knowledge in SMT (including (i) improved word alignments using morphology and language specific data pre-processing rules, (ii) improved language modelling) Better treatment of non-translatable tokens (e-mails, web addresses, numbers, brackets, quotes, tags etc.) References Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint Language and Translation Modeling with Recurrent Neural Networks. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Eleftherios Avramidis and Philipp Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, page 763-770, Columbus, Ohio, USA.Association for Computational Linguistics. O. Bojar, D. Mareček, V. Novák et al., English-Czech MT in 2008, in Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, Association for Computational Linguistics, 2009 Ann Clifton and Anoop Sarkar. 2011. Combining Morpheme-based Machine Translation with Postprocessing Morpheme Prediction. ACL 2011. C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. In Proceedings of ACL, July, 2010 M. Freitag, M. Huck, and H. Ney. Jane: Open Source Machine Translation System Combination. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Schweden, April 2014. Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. 2010. Head finalization: A simple reordering rule for SOV languages. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 244–251. Philipp Koehn, Franz Josef Och, Daniel Marcu (2003). Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics. Koehn, Philipp and Hoang, Hieu (2007): Factored Translation Models, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. ACL 2007, demonstration session. Hai-Son Le, Alexandre Allauzen, and Francois Yvon.2012. Continuous Space Translation Models with Neural Networks. In Proc. of HLT-NAACL, pages 39–48, Montreal, Canada. Association for Computational Linguistics. David Mareček, Rudolf Rosa, Petra Galuščáková and Ondřej Bojar: Two-step translation with grammatical post-processing. In Proceedings of WMT 2011, EMNLP 6th Workshop on Statistical Machine Translation, Edinburgh, UK, pp. 426–432, 2011 Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6), 386. Skadiņa I., Brālītis E. 2008. Experimental Statistical Machine Translation System for Latvian. // Proceedings of the 3rd Baltic Conference on HLT, Vilnius, 2008, 281-286. Skadiņa I., K. Levāne-Petrova, G.Rābante. 2012. Linguistically Motivated Evaluation of English-Latvian Statistical Machine Translation. // Human Language Technologies – The Baltic Perspective Proceedings of the Fifth International Conference Baltic HLT 2012, IOS Press, Frontiers in Artificial Intelligence and Applications, Vol. 247, pp. 221-229. Skadiņš, R., Goba, K., & Šics, V. (2010). Improving SMT for Baltic Languages with Factored Models. In Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192 (pp. 125–132). Riga: IOS Press. Skadiņš, R., Šics, V., & Rozis, R. (2014). Building the World’s Best General Domain MT for Baltic Languages. In A. Utka, G. Grigonytė, J. Kapočiūtė-Dzikienė, & J. Vaičenonienė (Eds.), Frontiers in Artificial Intelligence and Applications: Volume 286. Human Language Technologies – The Baltic Perspective - Proceedings of the Sixth International Conference Baltic HLT 2014 (pp. 141–148). Kaunas, Lithuania: IOS Press. doi:10.3233/978-1-61499-442-8-141 Sara Stymne. 2011. Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages. In Proceedings of the ACL 2011 Student Session. Pages 12-17. June 19-24, 2011. Portland, Oregon, USA. Tamchyna, A., & Bojar, O. (2013). No Free Lunch in Factored Phrase-Based Machine Translation. In Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science Volume 7817, (pp, 210-223).: Springer Berlin Heidelberg Gregor Thurmair, 2009: Comparing different architectures of hybrid machine translation systems. MT Summit XII: proceedings of the twelfth Machine Translation Summit, August 26-30, 2009, Ottawa, Ontario, Canada; pp.340-347. Vasiļjevs, A., Skadiņš, R., & Tiedemann, J. (2012). LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation. In Proceedings of the ACL 2012 System Demonstrations (pp. 43–48). Jeju Island, Korea: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/P12-3008 Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post, and Adam Lopez. 2011. Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor. In Proceedings of WMT10. Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve smt for subject-object-verb languages. In North American Chapter of the Association for Computational Linguistics, pages 245–253.