Literature Review: Machine Translation Aurelia Drummer (DRMAUR002) Department of Computer Science, University of Cape Town Abstract This paper provides a general overview of machine translation. It covers a brief history and the basic principles of machine translation followed by a description of a number of language models and a conceptual introduction to statistical machine translation. 1 Introduction SMT, and the difficulties and motivation for its use in South Africa. Machine translation (MT) is the use of computerised systems to translate text from one language, the source language (SL), into equivalent text in a second language, the target language (TL). This can be done with or without the help of human translators and editors, although the goal is to have minimal human aid [7]. 2 History The concept of automatic translation of text has been around since the 17th century [7]. Of course, at this time the idea consisted only of rudimentary word-for-word translation of texts and was entirely hypothetical. MT systems aim to achieve quality translations that are both semantically With the development of modern comequivalent to the source text as well as puter systems as well as advances in syntactically correct. linguistic theory, automatic translation has become a reality. Serious There have been many approaches to work on MT began in the late 1940s MT that rely on language analysis to and it continues to be an important varying degrees. Statistical machine and practical field of research. translation (SMT) is one of the newest approaches and has been the focus of From 1947 until the late 1960s much the the last decade of MT research. work was done on MT systems [7]. SMT systems are based on the statis- Advances in linguistics allowed retical analysis of large bodies of parallel searchers to develop systems that retext as opposed to other MT systems lied heavily on language analysis and which rely more heavily on human linguistic understanding of the specific knowledge for their development [10]. languages involved. This review covers a brief history of MT in general, various approaches to In 1966 the Automatic Language Processing Advisory Committee (ALPAC) 1 concluded that MT could not produce quality translations as quickly or affordably as human translators [1, 17]. Their report resulted a decline in MT research in the United States for the decade that followed. Research continued in other parts of the world, especially in Canada, Japan, and Europe. and fertility of the words in the target sentence [7]. These models can be determined by automatic or manual linguistic analysis of the relevant languages. SMT systems aim to assist these models by resolving semantic ambiguity according to the results of statistical analysis of large bodies of already translated texts or parallel texts. This allows for development of general SMT systems that, when provided with enough parallel text, can be trained relatively quickly to translate between any languages [13]. Experiments in statistical machine translation began at IBM in the late 1980s [17]. The feasibility of SMT systems increased in the 1990s due to the increase in availability of parallels text due to the Internet and large language databases. In 2006 an open-source SMT tool called Moses was released and is currently the most complete The common models used in MT sysSMT software available [12]. tems can be categorised as follows. 3 Basics of Machine Translation 3.1 Finite State Transducers In general MT systems are comprised of one or more sets of rules that govern the transformation of a sentence in Finite state transducers (FST) are the SL into a translationally equivalent a variation of finite state automata sentence in the TL [13]. (FSA). An FSA consists of a set of states, a set of labels, and a set of Ideally, an MT should both preserve transitions. FSAs read and write lathe meaning of the source sentence and bels as they transition between the follow the grammatical and syntactic states. rules of the TL when generating the target sentence. This is generally done FSTs extend this concept by having in a number of steps, such as the di- two sets of labels - one for the SL and rect translation of words, determining one for the TL. As one label is read in how many words in the TL correspond from the source text, the appropriate to a word or phrase in the SL (this is transition is made and a label from the known as the fertility of the word)[13], TL set is written to the output [13]. and deciding between different transla- The different types of FST models are tions of an ambiguous or tensed word. classified according to the types of labels in the sets. In order to achieve this, MT systems make use of large bilingual (or multi- Usually FST systems are made up of lingual) dictionaies that provides word multiple FSTs joined together to profor word translation between the lan- vide both word or phrase translation guages, along with a set of rules or and the reordering of the words or model that determines the ordering phrases [13]. 2 3.1.1 Word-for-word Models Synchronous context-free grammars (SCFG) are an extension of contextfree grammars (CFG). A CFG consists of a set of non-terminals and a set of terminals along with a mapping from each non-terminal to one or more sequences of terminals and nonterminals. Instead of a single grammar, SCFGs have two grammars that are defined at the same time [13]. In a SCFG each non-terminal is mapped to two sets of sequences - one for each language. This system allows for the reordering of multiple phrases, single phrases, and words. Word-for-word FST models are the simplest MT models. Translation is done on the word level and then the words in the TL are reordered according to the syntax of the TL. In these models the labels in the FST are words. In general the first FST in the system will take individual words and duplicate them according to their fertility. The next FST will translate the from the SL to words in the TL. The last FST will reorder the words [13]. Word-for-word models are no longer commonly used as they can not compete with phrase or syntax based models [12]. 4 Use of Statistics Bayesian statistical methods are often used to supplement the above models 3.1.2 Phrase-Based Models [3]. Probabilities can be used when a Phrase-based models offer an improve- SMT needs to decide between a numment on word-for-word models by ber of translations of a given word or segmenting the source sentence into phrase. Bayes Theorem is given by the phrases instead of simply by word [12]. formula below. These SL phrases must be matched or P (B|A) · P (A) aligned with a phrase in the TL. The (1) P (A|B) = P (B) phrases are translated and then sentence is reordered on a phrase level. In statistical translation systems A and Each of these steps will be performed B would be the words, phrases, or even by a separate FST in the system. sentences in the SL and TL respec- 3.2 tively. The probabilities of P (B|A), P (A), and P (B) are calculated by examining the frequency of A and B in the parallel texts [13]. Synchronous ContextFree Grammars A synchronous context-free grammar is a type of syntax-based model. Syntaxbased models segment the sentences to be translated into syntactic parts, such as nouns phrases, verb phrases, prepositions, etc. [18]. These segments can then be translated and reordered. This approach is intuitive because the structure of a language (and thus the reordering of words when manually translating a sentence) is defined in terms of the parts of speech and not by specific words or common phrases. 5 Evaluation of Translations Evaluation of the various MT systems was initially done manually by subjective human evaluators. However, a number of automatic tools have been developed recently. These tools provide metrics or measurements for evaluating the accuracy of a translation [8]. In general the accuracy of a machine-translated text is determined 3 by comparing it to the same text that has been translated by a human. the most internationally used South African official language and thus it is the most commonly used language in Approaches to automatic evaluation the commercial and official domains. are briefly described below. However, it is only the 5th most common home language [16]. The constitution guarantees equal status to all 5.1 Word Error Rates official languages. This should include Word error rates (WER) are cal- making good quality translations of culated by counting the number of official documents available in each of changes that need to be made to an the 11 official languages. automatically translated text to transform it into an existing translation Education is also a major problem in of the same source sentence. These a country with a large number of lanchanges include substitution, addition, guages. Textbooks, educational tools, and deletion of words [14]. Some and literature should be made availvariations of WER evaluation do not able to as many people as possible and take the position of the words in the thus in as many languages as possible. sentence into account, these evaluation approaches are called position- These problems require a great deal independent error rates of translation and it would be ideal for the majority of this translation to be done automatically. 5.2 NIST The NIST evaluation system uses Ngrams (sequences that are N words long) in the translated text to determine its accuracy. The N-grams in the automatically translated text are compared to those in the manually translation of the same source sentence. The number of N-grams that the sentences have in common determines the score of the translation [5]. Unfortunately the problem with SMT systems is that they require large parallel texts. There is not enough data to create a sufficiently good SMT system for South African languages. A great deal more manual/human translation would need to be done if a purely statistical system was to be used. It may be possible for SMTs to be improved upon with further analysis of the structure of African languages and the application of this analysis in the 5.3 BLEU adaptation of one of the existing transSimilarly to NIST evaluations, BLEU lation models. uses N-grams to calculate a score for an automatically translated sentence. However, BLEU uses a geometric func- 7 Conclusion tion to calculate the final accuracy score [15]. There has been a vast progress in the field of machine translation over the last 60 years. Statistical systems are 6 Application in showing promise as the increasingly global and digital world provides coSouth Africa pious amounts if readily available parThe South African constitution recog- allel text. This is not the case in develnises 11 official languages. English is oping countries such as South Africa. 4 References [1] ALPAC. 