This section accounts for the general

advertisement
Estonian specific enhancements which
could be used in statistical and hybrid
MT systems. R&D report.
Draft Dec 15th 2014
Project name: Linguistic Knowledge in Estonian Machine Translation
Project acronym: EKT63
1. Introduction
We describe our work done in 2014 (morphology toolkit integration, bilingual dictionaries etc) and list
ideas for subsequent years.
2. State of the art
Recent advances in MT
Since beginning of this century statistical machine translation (SMT) has become dominant approach in
machine translation. While at the beginning it has been applied mostly to widely spoken languages, .e.g.,
English, Spanish or French, with the growth of size of the language resources it becomes more popular
for smaller languages, including languages of Baltic countries - Estonian, Latvian and Lithuanian (e.g.
Fishel et al. 2007, Skadiņa 2008). Recently SMT as practical approach has been accepted by European
Commission where MT@EC1 is implemented aiming at providing easy access to European Public Services
information across Europe in user’s mother tongue.
The most popular approach used in SMT systems is phrase-based SMT (Koehn et al. 2003), where
sequences of words (called phrases) are retrieved automatically from parallel texts. Such approach has
shown rather good results for languages with simple morphology, while for languages with rich
morphology and rather free word order this approach has shown several important deficiencies that in
many cases make output of general SMT system incomprehensible even for gisting purposes. Manual
analysis of English-Latvian SMT output (Skadiņa et al. 2012) has reviled main problems with phrasebased SMT when it has been applied to morphology rich language, i.e., almost half of sentences has
incorrect word forms, in many cases word order is incorrect.
As result a lot of research has been performed to incorporate linguistic knowledge into SMT. Among
them, the most popular are factored models (Koehn and Hieu 2007) that allows to incorporate different
morphosyntactic properties. Application of factored models for translation into Baltic languages (Latvian
1
http://ec.europa.eu/isa/actions/02-interoperability-architecture/2-8action_en.htm
and Lithuanian) has shown some improvements where it concerns translation quality (e.g. Skadiņš et al.
2010). Similarly, Clifton and Sarkar (2011) achieve best published result on English-Finnish task using an
unsupervised morphological segmenter with a supervised post-processing merge step.
Another proposal is to apply syntax-based models. Moses framework (Koehn et al. 2007) supports treeto-string, string-to-tree or tree-to-tree models. One of the limitation for this approach is availability of
parser. Initial experiments with tree-to-string models in Moses framework for English-Latvian has shown
no improvements in translation quality. This could be explained by the lack of the Latvian phrase
structure parser. Several other frameworks, including Joshua (Weese et al. 2011), Jane (Freitag et al.
2014) and cdec (Dyer et al. 2010), support syntax-based models as well.
Besides linguistically based tree models, Chiang (2007) has proposed hierarchical models where trees are
automatically obtained from parallel texts. This approach has shown good results for English-Lithuanian,
French-Lithuanian and Russian-Lithuanian language pairs in terms of the BLEU score. However, training
and decoding speed has significantly decreased.
Comprehensive overview of hybrid approaches to machine translation is provided by Thurmair (2009). It
includes work on pre and processing (Stymne 2011). Syntactic preordering (Xu et al. 2009, Isozaki et al.
2010) aims to rearrange source language sentences into order closer to target language. It is useful for
languages with different syntactic structures and word order. Post-processing can be used to correct MT
output and improve translation quality (Stymne 2011, Mareček et al. 2011).
Also domain adaptation, specific treatment of terminology, multiword units and named entities, has
been researched and has shown improvements in translation quality.
Recently research has turned to deep neural networks (e.g. Le et al. 2012; Auli et el. 2013).
Estonian specific approaches
http://masintolge.ut.ee/info/references.php





[Fishel et al 2007] Estonian-English Statistical Machine Translation: the First Results.Main
problems found: wrong order of phrases and sparse data.
[Kirik 2008] Unsupervised Morphology in Statistical Machine Translation. Bachelor thesis. Using
factored word alignment usualluy improves translation quality, but using factored word
reordering lowers. The best factor schema is created by separating the word to a stem factor
from its rightmost suffix, and using the stem factor for word alignment training.
[Fishel et al 2010]. Linguistically Motivated Unsupervised Segmentation for Machine Translation.
(link)
[Kirik 2010] Language model based improvements in statistikal machine translation. Master
thesis. Factored MT: an additional (secondary) LM using POS tags, word frequences in training
corpora and word forms. No strategies were found which will improve translation quality.
Any major paper not listed?
3. Integration of a morphology knowledge into Estonian-EnglishEstonian MT system
Best practice of using morphology tools in MT
The phrase-based approach in SMT allows translating source words differently depending on their
context by translating whole phrases, whereas target language model allows matching target phrases at
their boundaries. However, most phrases in inflectionally rich languages can be inflected in case,
number, tense, mood and other morphosyntactic properties, producing considerable amount of
variations.
Estonian belong to the class of inflected languages which are complex from the point of view of
morphology. There are over 2000 different morphology tags for Estonian. With Estonian as the target
language for SMT, the high inflectional variation of target language increases data sparseness at the
boundaries of translated phrases, where a language model over surface forms might be inadequate to
estimate the probability of target sentence reliably. Following the approach of English-Czech factored
SMT (Bojar et al., 2009; Tamchyna & Bojar, 2013), and English-Latvian/Lithuanian (Skadiņš et al., 2010)
the most promising method to incorporate linguistic knowledge in SMT is to use morphology in factored
SMT models (Koehn & Hoang, 2007). We have improved word alignment calculated over lemmas instead
of surface forms, and we introduced an additional language model over disambiguated morphologic
part-of-speech tags in the English-Estonian system. An additional language model over morphosyntactic
part-of-speech tags can be built in order to improve inter-phrase consistency (Skadiņš et al., 2010;
2014).The tags contain morphologic properties generated by a statistical part-of-speech tagger. The
order of the tag LM was increased to 7, as the tag data has significantly smaller vocabulary.
At the moment we have evaluated the applicability of these methods for English-Estonian SMT using
only small scale experiments, the large scale experiments using all collected parallel training data will be
done until the end of the project.
Estonian part of speech tagger for MT
As part of speech tagger is necessary par factored SMT, we have created the Estonian part of speech
tagger. There are two conceptually different types of POS-taggers:
1)
2)
POS-taggers that perform POS-tag guessing
POS-taggers that perform POS-tag disambiguation.
The POS-taggers that are based on the disambiguation methodology perform morphological analysis of
each token of a text and then perform POS-tag disambiguation by analysing the surrounding context (the
context may include several tokens around the token being disambiguated and morphological
information of the surrounding tokens). The guessers, on the other hand, rely solely on the text tokens
and, depending on the machine learning algorithms applied, possibly also POS-tags of previously tagged
tokens. I.e., if the disambiguation based POS-taggers have a relatively small set of possible POS-tags to
select from when disambiguating a token, the guessing-based POS-taggers have to select the correct tag
(i.e., the most likely tag in a particular context) from all possible tags. Both methods can be combined
into hybrid methods that allow handling of out-of-vocabulary words better.
The POS-tagger developed in the project is based on the disambiguation methodology. The workflow of
POS-tagging a single document is depicted in the figure below.
Input data
(plaintext)
Sentence
breaking
Tokenisation
Morphological
analysis
Morphological
disambiguation
Output data
(POS-tagged)
The workflow is as follows:
1. At first, a document is broken down into sentences. For sentence breaking we use Tilde’s
proprietary solutions developed prior to the project.
2. Then, each sentence is tokenised. For tokenisation we use Tilde’s proprietary solutions
developed prior to the project.
3. Next, each token is morphologically analysed using the freely available Filosoft morphological
analyser Vabamorf2. In order to use the morphological analyser, we developed a wrapper (a tool
that integrates Vabamorf) that allows passing Tilde’s tokeniser output data to Vabamorf and
converts Vabamorf output data in a format compliant to the POS-tagger’s input data. An
example of the output format of the morphological analyser for the sentence “Eesti Vabariik on
riik Põhja-Euroopas .” is as follows:
Eesti
Eesti
Eesti
Eesti
Eesti
eesti
Vabariik
vabariik
on
ole
on
ole
riik
riik
Põhja-Euroopas Põhja-Euroopa
.
.
N--sg---------------------fN--sn---------------------fG-------------------------fN--sn---------------------fVp-s--3--i-a----a---------lVp-p--3--i-a----a---------lN--sn---------------------lN--ss---------------------uT---------------------------
0
0
0
0
0
0
0
0
0
0
0
0
1
2
2
3
4
5
The example shows that there are 3 possible morphological analyses for the token “Eesti”, one
analysis for “Vabariik”, 2 for “on”, and 1 for the remaining three tokens. The task of the POStagger’s disambiguation module is to select the most probable tag from the given tags.
4. Finally, as the last step, the disambiguation is performed by a disambiguation module that is
based on averaged perceptron (Rosenblatt, 1958) methodology3 and uses a pre-trained POStagging model.
The POS-tagger provides two output formats – a TreeTagger4 compliant output format and a Moses5
factored data compliant output format. Examples are as follows:
2
Vabamorf is available online at: https://github.com/Filosoft/vabamorf
More information about the perceptron algorithm can be found online at:
http://en.wikipedia.org/wiki/Perceptron. The averaged perceptron algorithm is described online at:
https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/.
3
1. TreeTagger format example.
Eesti
Vabariik
on
riik
Põhja-Euroopas
.
N
N
V
N
N
T
Eesti
vabariik
ole
riik
Põhja-Euroopa
.
N--sg---------------------fN--sn---------------------fVp-s--3--i-a----a---------lN--sn---------------------lN--ss---------------------uT---------------------------
2. Moses factored data format example.
Eesti|Eesti|N--sg---------------------f- Vabariik|vabariik|N--sn--------------------f- on|ole|Vp-s--3--i-a----a---------l- riik|riik|N--sn--------------------l- Põhja-Euroopas|Põhja-Euroopa|N--ss---------------------u- .|.|T--------------------------
In order to train the Estonian POS-tagging model, we used the morphologically annotated corpus of
Estonian created by the University of Tartu6. As the corpus was annotated using a different tagset than
the tagset used by the morphological analyser, we transformed the tags of the annotated data into tags
provided for the tokens by the morphological analyser. For POS-tagger model training purposes for each
token in the annotated data we identified the correct POS-tag, the correct lemma, and provided other
possible analyses provided by the morphological analyser for the token.
During the semi-automatic training data preparation process approximately 2,500 sentences were
discarded, because they contained tokens, for which the morphological analyser did not provide an
analysis that could be uniquely matched to the annotated data tags (a possible explanation are
mismatches between annotation guidelines used in the corpus creation process and the morphological
analyser as well as annotation mistakes in the morphologically annotated corpus). After transformation
of the annotated data, a total of 42,287 sentences (585,965 tokens) remained in the annotated training
data corpus.
When the training data was ready, we trained the POS-tagging model in several iterative steps by
improving the feature set (i.e., the context length, different morphological parameter configurations,
etc.) used by the averaged perceptron learning algorithm. In each iteration, the POS-tagging model was
evaluated using 10-fold cross-validation. The resulting POS-tagger achieves a precision of 97.51±0.08%
with a confidence interval of 99%. This is a state-of-the-art result for POS-tagging of Estonian texts.
4. Parallel corpora and trained MT systems
This section accounts for the general-domain MT systems Tilde has built in 2014 and its comparison with
some other MT systems. In the past at Tilde we have been building Estonian MT systems on a regular
basis for at least 3 years trying to make it better each time. We have done both general-domain as well
as some domain-specific MT systems. We have documented and presented our experience with that
(Skadiņš et al., 2014).
4
TreeTagger can be acquired online at: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
More information about the Moses factored data format can be found online at:
http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining.
6
The morphologically annotated corpus is available online at: http://www.cl.ut.ee/korpused/morfkorpus/.
5
This year our goal for building the new system general-domain MT system was to make the best possible
general-domain English-Estonian MT system by making use of the best data set available with our
current technology.
We train a SMT system with the LetsMT MT platform (Vasiļjevs et al., 2012) which is based on Moses
toolkit (Koehn et al., 2007). In this MT system training we did not use any newer technological
advancement in comparison to the previous system – they are in the pipeline and will be examined in
future system builds in 2015.
We made use of some new public corpora resources as well as data processed within EOPC project in
addition to the formerly used data. We can still see a correlation between amount of training data used
and quality of MT system.
We use both publicly available corpora collected by other institutions and corpora collected by us. The
most important sources of data are:
1) Publicly available parallel corpora – Europarl corpus, DGT-TM, JRC-Acquis, ECDC and other
corpora available from the Joint Research Center, the OPUS corpus, which includes data from
European Medicines Agency, European Central Bank, EU constitution and other.
2) Parallel corpora collected – national legislation, standards, technical documents and product
descriptions widely available on the web (some, examples: www.ceresit.net, www.europenikon.com), EU brochures from EU Bookshop, news portals and many more.
3) Monolingual corpora collected – mainly data crawled from the web (state institutions, portals,
newspapers etc.).
See Table 1 for amount of data used in the training of our SMT systems.
We used BLEU metric for the automatic evaluation using a general-domain evaluation corpus that
represents general domain data which is mixture of texts in different domains representing the expected
translation needs of a typical user. The corpus includes texts from the fiction, business letters, IT texts,
news and magazine articles, legal documents, popular science texts, manuals and EU legal texts, and it
contains 512 parallel sentences in English and Estonian.
Table 1. Amount of training data and results of the automatic evaluation
Corpora size, sentences
Language direction
BLEU score
Parallel
Monolingual
English – Estonian, 2012
10.5 M
28.3 M
23.78
English – Estonian v2.1, 2013
12.5 M
33.1 M
24.22
English – Estonian v2.2, 2014
17.8 M
37.1 M
24.48
The summary of automatic evaluation results in comparison with Google translator is presented in Figure
1.
en-et systems – BLEU Scores
25
24
24.48
24.22
23.78
23
22
21.45
20.69
21
Tilde
Google
20
19
18
en-et
en-et v2.1 (2013)
en-et v2.2 (2014)
16.11.2012
17.09.2013
08.12.2014
Figure 1. Our MT systems automatically compared to Google
For human evaluation of the systems we used ranking of translated sentences relative to each other.
This is the official determinant of translation quality used in the Workshop on Statistical Machine
Translation shared tasks. The summary of human evaluation results in comparison with Google
Translator is presented in Figure 2:
50%
en-et systems – Human Evaluation
49%
49%
48%
47%
47%
46%
45%
45%
44%
43%
43%
Tilde Google -
42%
41%
40%
en-et v2.1 (2013)
en-et v2.2 (2014)
17.09.2013
08.12.2014
Figure 2. Our MT systems compared to Google Translator by human evaluation
The work of training a new English-Estonian MT system has proven to be worthwhile as the results
obtained differ positively both in terms of comparing with the previous versions of Tilde MT systems as
well as in comparison with the competitor.
5. R&D plans for 2015
As a future directions for quality improvements we see:



Continuing parallel data collection (including (i) advanced content crawling methods and (ii) targeted
MT output post-editing to create new MT training data)
More efficient use of Estonian morphology knowledge in SMT (including (i) improved word
alignments using morphology and language specific data pre-processing rules, (ii) improved language
modelling)
Better treatment of non-translatable tokens (e-mails, web addresses, numbers, brackets, quotes,
tags etc.)
References
Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. 2013. Joint Language and Translation
Modeling with Recurrent Neural Networks. Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing.
Eleftherios Avramidis and Philipp Koehn. 2008. Enriching morphologically poor languages for statistical
machine translation. In Proceedings of the 46th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, page 763-770, Columbus, Ohio,
USA.Association for Computational Linguistics.
O. Bojar, D. Mareček, V. Novák et al., English-Czech MT in 2008, in Proceedings of the Fourth Workshop
on Statistical Machine Translation, Athens, Greece, Association for Computational Linguistics,
2009
Ann Clifton and Anoop Sarkar. 2011. Combining Morpheme-based Machine Translation with Postprocessing Morpheme Prediction. ACL 2011.
C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik.
cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free
Translation Models. In Proceedings of ACL, July, 2010
M. Freitag, M. Huck, and H. Ney. Jane: Open Source Machine Translation System Combination. In
Conference of the European Chapter of the Association for Computational Linguistics (EACL),
Gothenburg, Schweden, April 2014.
Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. 2010. Head finalization: A simple
reordering rule for SOV languages. In Proceedings of the Joint Fifth Workshop on Statistical
Machine Translation and MetricsMATR, pages 244–251.
Philipp Koehn, Franz Josef Och, Daniel Marcu (2003). Statistical phrase based translation. In Proceedings
of the Joint Conference on Human Language Technologies and the Annual Meeting of the North
American Chapter of the Association of Computational Linguistics.
Koehn, Philipp and Hoang, Hieu (2007): Factored Translation Models, Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL)
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Constantin, Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation.
ACL 2007, demonstration session.
Hai-Son Le, Alexandre Allauzen, and Francois Yvon.2012. Continuous Space Translation Models with
Neural Networks. In Proc. of HLT-NAACL, pages 39–48, Montreal, Canada. Association for
Computational Linguistics.
David Mareček, Rudolf Rosa, Petra Galuščáková and Ondřej Bojar: Two-step translation with grammatical
post-processing. In Proceedings of WMT 2011, EMNLP 6th Workshop on Statistical Machine
Translation, Edinburgh, UK, pp. 426–432, 2011
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in
the brain. Psychological review, 65(6), 386.
Skadiņa I., Brālītis E. 2008. Experimental Statistical Machine Translation System for Latvian. //
Proceedings of the 3rd Baltic Conference on HLT, Vilnius, 2008, 281-286.
Skadiņa I., K. Levāne-Petrova, G.Rābante. 2012. Linguistically Motivated Evaluation of English-Latvian
Statistical Machine Translation. // Human Language Technologies – The Baltic Perspective Proceedings of the Fifth International Conference Baltic HLT 2012, IOS Press, Frontiers in Artificial
Intelligence and Applications, Vol. 247, pp. 221-229.
Skadiņš, R., Goba, K., & Šics, V. (2010). Improving SMT for Baltic Languages with Factored Models. In
Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial
Intelligence and Applications, Vol. 2192 (pp. 125–132). Riga: IOS Press.
Skadiņš, R., Šics, V., & Rozis, R. (2014). Building the World’s Best General Domain MT for Baltic
Languages. In A. Utka, G. Grigonytė, J. Kapočiūtė-Dzikienė, & J. Vaičenonienė (Eds.), Frontiers in
Artificial Intelligence and Applications: Volume 286. Human Language Technologies – The Baltic
Perspective - Proceedings of the Sixth International Conference Baltic HLT 2014 (pp. 141–148).
Kaunas, Lithuania: IOS Press. doi:10.3233/978-1-61499-442-8-141
Sara Stymne. 2011. Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages.
In Proceedings of the ACL 2011 Student Session. Pages 12-17. June 19-24, 2011. Portland,
Oregon, USA.
Tamchyna, A., & Bojar, O. (2013). No Free Lunch in Factored Phrase-Based Machine Translation. In
Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science
Volume 7817, (pp, 210-223).: Springer Berlin Heidelberg
Gregor Thurmair, 2009: Comparing different architectures of hybrid machine translation systems. MT
Summit XII: proceedings of the twelfth Machine Translation Summit, August 26-30, 2009, Ottawa,
Ontario, Canada; pp.340-347.
Vasiļjevs, A., Skadiņš, R., & Tiedemann, J. (2012). LetsMT!: Cloud-Based Platform for Do-It-Yourself
Machine Translation. In Proceedings of the ACL 2012 System Demonstrations (pp. 43–48). Jeju
Island, Korea: Association for Computational Linguistics. Retrieved from
http://www.aclweb.org/anthology/P12-3008
Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post, and Adam Lopez. 2011. Joshua 3.0:
Syntax-based Machine Translation with the Thrax Grammar Extractor. In Proceedings of WMT10.
Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz Och. 2009. Using a dependency parser to improve
smt for subject-object-verb languages. In North American Chapter of the Association for
Computational Linguistics, pages 245–253.
Download