Literature Review: Machine Translation

advertisement
Literature Review: Machine Translation
Aurelia Drummer (DRMAUR002)
Department of Computer Science, University of Cape Town
Abstract
This paper provides a general overview of machine translation. It covers a brief
history and the basic principles of machine translation followed by a description
of a number of language models and a conceptual introduction to statistical
machine translation.
1
Introduction
SMT, and the difficulties and motivation for its use in South Africa.
Machine translation (MT) is the use
of computerised systems to translate
text from one language, the source language (SL), into equivalent text in a
second language, the target language
(TL). This can be done with or without the help of human translators and
editors, although the goal is to have
minimal human aid [7].
2
History
The concept of automatic translation
of text has been around since the 17th
century [7]. Of course, at this time
the idea consisted only of rudimentary
word-for-word translation of texts and
was entirely hypothetical.
MT systems aim to achieve quality
translations that are both semantically With the development of modern comequivalent to the source text as well as puter systems as well as advances in
syntactically correct.
linguistic theory, automatic translation has become a reality. Serious
There have been many approaches to work on MT began in the late 1940s
MT that rely on language analysis to and it continues to be an important
varying degrees. Statistical machine and practical field of research.
translation (SMT) is one of the newest
approaches and has been the focus of From 1947 until the late 1960s much
the the last decade of MT research. work was done on MT systems [7].
SMT systems are based on the statis- Advances in linguistics allowed retical analysis of large bodies of parallel searchers to develop systems that retext as opposed to other MT systems lied heavily on language analysis and
which rely more heavily on human linguistic understanding of the specific
knowledge for their development [10]. languages involved.
This review covers a brief history of
MT in general, various approaches to
In 1966 the Automatic Language Processing Advisory Committee (ALPAC)
1
concluded that MT could not produce
quality translations as quickly or affordably as human translators [1, 17].
Their report resulted a decline in MT
research in the United States for the
decade that followed. Research continued in other parts of the world, especially in Canada, Japan, and Europe.
and fertility of the words in the target sentence [7]. These models can be
determined by automatic or manual
linguistic analysis of the relevant languages.
SMT systems aim to assist these models by resolving semantic ambiguity
according to the results of statistical analysis of large bodies of already
translated texts or parallel texts. This
allows for development of general SMT
systems that, when provided with
enough parallel text, can be trained
relatively quickly to translate between
any languages [13].
Experiments in statistical machine
translation began at IBM in the late
1980s [17]. The feasibility of SMT systems increased in the 1990s due to the
increase in availability of parallels text
due to the Internet and large language
databases. In 2006 an open-source
SMT tool called Moses was released
and is currently the most complete The common models used in MT sysSMT software available [12].
tems can be categorised as follows.
3
Basics of Machine
Translation
3.1 Finite State Transducers
In general MT systems are comprised
of one or more sets of rules that govern the transformation of a sentence in Finite state transducers (FST) are
the SL into a translationally equivalent a variation of finite state automata
sentence in the TL [13].
(FSA). An FSA consists of a set of
states, a set of labels, and a set of
Ideally, an MT should both preserve transitions. FSAs read and write lathe meaning of the source sentence and bels as they transition between the
follow the grammatical and syntactic states.
rules of the TL when generating the
target sentence. This is generally done FSTs extend this concept by having
in a number of steps, such as the di- two sets of labels - one for the SL and
rect translation of words, determining one for the TL. As one label is read in
how many words in the TL correspond from the source text, the appropriate
to a word or phrase in the SL (this is transition is made and a label from the
known as the fertility of the word)[13], TL set is written to the output [13].
and deciding between different transla- The different types of FST models are
tions of an ambiguous or tensed word. classified according to the types of labels in the sets.
In order to achieve this, MT systems
make use of large bilingual (or multi- Usually FST systems are made up of
lingual) dictionaies that provides word multiple FSTs joined together to profor word translation between the lan- vide both word or phrase translation
guages, along with a set of rules or and the reordering of the words or
model that determines the ordering phrases [13].
2
3.1.1
Word-for-word Models
Synchronous context-free grammars
(SCFG) are an extension of contextfree grammars (CFG). A CFG consists of a set of non-terminals and a
set of terminals along with a mapping from each non-terminal to one or
more sequences of terminals and nonterminals. Instead of a single grammar, SCFGs have two grammars that
are defined at the same time [13]. In
a SCFG each non-terminal is mapped
to two sets of sequences - one for each
language. This system allows for the
reordering of multiple phrases, single
phrases, and words.
Word-for-word FST models are the
simplest MT models. Translation is
done on the word level and then the
words in the TL are reordered according to the syntax of the TL. In
these models the labels in the FST are
words. In general the first FST in the
system will take individual words and
duplicate them according to their fertility. The next FST will translate the
from the SL to words in the TL. The
last FST will reorder the words [13].
Word-for-word models are no longer
commonly used as they can not compete with phrase or syntax based models [12].
4
Use of Statistics
Bayesian statistical methods are often
used to supplement the above models
3.1.2 Phrase-Based Models
[3]. Probabilities can be used when a
Phrase-based models offer an improve- SMT needs to decide between a numment on word-for-word models by ber of translations of a given word or
segmenting the source sentence into phrase. Bayes Theorem is given by the
phrases instead of simply by word [12]. formula below.
These SL phrases must be matched or
P (B|A) · P (A)
aligned with a phrase in the TL. The
(1)
P (A|B) =
P (B)
phrases are translated and then sentence is reordered on a phrase level. In statistical translation systems A and
Each of these steps will be performed B would be the words, phrases, or even
by a separate FST in the system.
sentences in the SL and TL respec-
3.2
tively. The probabilities of P (B|A),
P (A), and P (B) are calculated by examining the frequency of A and B in
the parallel texts [13].
Synchronous ContextFree Grammars
A synchronous context-free grammar is
a type of syntax-based model. Syntaxbased models segment the sentences
to be translated into syntactic parts,
such as nouns phrases, verb phrases,
prepositions, etc. [18]. These segments
can then be translated and reordered.
This approach is intuitive because the
structure of a language (and thus the
reordering of words when manually
translating a sentence) is defined in
terms of the parts of speech and not
by specific words or common phrases.
5
Evaluation of Translations
Evaluation of the various MT systems was initially done manually by
subjective human evaluators. However, a number of automatic tools have
been developed recently. These tools
provide metrics or measurements for
evaluating the accuracy of a translation [8]. In general the accuracy of a
machine-translated text is determined
3
by comparing it to the same text that
has been translated by a human.
the most internationally used South
African official language and thus it is
the most commonly used language in
Approaches to automatic evaluation the commercial and official domains.
are briefly described below.
However, it is only the 5th most common home language [16]. The constitution guarantees equal status to all
5.1 Word Error Rates
official languages. This should include
Word error rates (WER) are cal- making good quality translations of
culated by counting the number of official documents available in each of
changes that need to be made to an the 11 official languages.
automatically translated text to transform it into an existing translation Education is also a major problem in
of the same source sentence. These a country with a large number of lanchanges include substitution, addition, guages. Textbooks, educational tools,
and deletion of words [14]. Some and literature should be made availvariations of WER evaluation do not able to as many people as possible and
take the position of the words in the thus in as many languages as possible.
sentence into account, these evaluation approaches are called position- These problems require a great deal
independent error rates
of translation and it would be ideal for
the majority of this translation to be
done automatically.
5.2 NIST
The NIST evaluation system uses Ngrams (sequences that are N words
long) in the translated text to determine its accuracy. The N-grams in the
automatically translated text are compared to those in the manually translation of the same source sentence. The
number of N-grams that the sentences
have in common determines the score
of the translation [5].
Unfortunately the problem with SMT
systems is that they require large parallel texts. There is not enough data
to create a sufficiently good SMT system for South African languages. A
great deal more manual/human translation would need to be done if a purely
statistical system was to be used. It
may be possible for SMTs to be improved upon with further analysis of
the structure of African languages and
the application of this analysis in the
5.3 BLEU
adaptation of one of the existing transSimilarly to NIST evaluations, BLEU
lation models.
uses N-grams to calculate a score for
an automatically translated sentence.
However, BLEU uses a geometric func- 7
Conclusion
tion to calculate the final accuracy
score [15].
There has been a vast progress in the
field of machine translation over the
last 60 years. Statistical systems are
6 Application
in showing promise as the increasingly
global and digital world provides coSouth Africa
pious amounts if readily available parThe South African constitution recog- allel text. This is not the case in develnises 11 official languages. English is oping countries such as South Africa.
4
References
[1] ALPAC. Languages and machines: computers in translation and linguistics.
a report by the automatic language processing advisory committee, division
of behavioral sciences, national academy of sciences, 1966.
[2] D Arnold, L Balkan, S Meijer, R Humphreys, and L Sadler. Machine
translation: An introductory guide. ncc blackwell, 1994.
[3] Adam L Berger, Peter F Brown, Stephen A Della Pietra, Vincent J
Della Pietra, John R Gillett, John D Lafferty, Robert L Mercer, Harry
Printz, and Luboš Ureš. The candide system for machine translation. In
Proceedings of the workshop on Human Language Technology, pages 157–
162. Association for Computational Linguistics, 1994.
[4] Alexandra Birch, Phil Blunsom, and Miles Osborne. A quantitative analysis of reordering phenomena. In Proceedings of the Fourth Workshop on
Statistical Machine Translation, pages 197–205. Association for Computational Linguistics, 2009.
[5] George Doddington. Automatic evaluation of machine translation quality
using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, HLT ’02,
pages 138–145, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
[6] H. Ney F.J. Och. Statistical machine translation. In EAMT Workshop,
pages 39–46, 2005.
[7] John Hutchins. Machine translation: History and general principles. pages
2322–2332, 1994.
[8] John Hutchins. Evaluation of machine translation and translation tools.
Iš: Survey of the State of the Art in Human Language Technology, pages
418–419, 1997.
[9] John Hutchins. Machine translation and computer-based translation tools:
Whats available and how its used. A New Spectrum of Translation Studies.
University of Valladolid, 2003.
[10] John Hutchins. The history of machine translation in a nutshell. Retrieved
December, 20:2009, 2005.
[11] Kevin Knight. Automating knowledge acquisition for machine translation.
AI Magazine, 18(4):81, 1997.
[12] Philipp Koehn. Statistical machine translation. http://www.statmt.org/.
[13] Adam Lopez. Statistical machine translation. ACM Computing Surveys
(CSUR), 40(3):8, 2008.
[14] Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics, 2003.
5
[15] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a
method for automatic evaluation of machine translation. In Proceedings of
the 40th annual meeting on association for computational linguistics, pages
311–318. Association for Computational Linguistics, 2002.
[16] SouthAfrica.info.
The
languages
of
South
Africa.
http://www.southafrica.info/about/people/language.htm, 2012.
[Online; accessed 23-April-2013].
[17] TAUS.
A
translation
automation
timeline.
http://www.translationautomation.com/timeline/a-translationautomation-timeline, 2013. [Online; accessed 27-April-2013].
[18] Kenji Yamada and Kevin Knight. A syntax-based statistical translation
model. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 523–530. Association for Computational Linguistics, 2001.
6
Download