LINKÖPINGS UNIVERSITET Institutionen för datavetenskap Kognitionsvetenskapliga programmet Artificiell Intellligens, HKGBB0, ht 2005 Machine Translation - Corpus Linguistics Joel Hinz joehi914 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz Abstract Due to the importance of communication between two or more people, companies, or even nations, the need for good translators has long been obvious around the world, and since humans are faulty, it is not surprising that many attempts to automate the process of translation have been made throughout history. This is called machine translation. Corpus linguistics, a branch of the machine translation tree, utilises statistical methods to analyse text samples – corpora – and makes conclusions based on the results. The goal of this paper is to identify and account for the problems and possibilities associated with this kind of statistical approach to translation, as well as give a brief view of the history of the topic and glimpse at projects currently in research. 1 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz Content 1. Introduction 1.1 History________________________________________ 1.2 Possibilities_________________________________ 1.3 Some things MT cannot do _________________________ 3 3 3 4 2. Corpora___________________________________ 5 2.1 Some things MT cannot do________________________ 5 2.2 Methods used in corpora______________________________ 6 3.0 Corpora in machine translation________________ 8 3.1 What corpora can offer MT ____________________________ 8 3.2 The Linköping Translation Corpus (LTC)________________ 8 3.3 3.2 The PLUG Word Aligner (PWA)____________________ 10 4.0 Conclusion______________________________ 11 References________________________________ 12 2 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz 1.0 Introduction 1.1 History The skill of translation is by nature a very old one, and that fact alone makes it hard to argue that a particular person or people first came up with the idea of automating the craft. However, the first serious (and quite optimistic) attempts were conducted in the 1950s, and IBM held the first public machine translation (hereafter MT) system demonstration in 19541. Although today’s linguists have to tackle different problems than the linguists of the time and their approaches vary substantially, the conclusions that can be drawn from the results are still roughly the same: a lexicon is enough to translate individual words and those words can make a native speaker of the language the text is being translated into understand the general meaning of the text – but since languages differ in syntax and use different grammar the translations are more often than not just rough and not usable for anything other than getting the overall meaning across (there are, as will be shown later, exceptions from this). Apart from statistical and example-based methods such as corpus linguistics (CL) several ways of making MT feasible have been proposed, among which for instance grammar based and dictionary-entry based methods, but in this paper focus will be solely on CL. 1.2 Possibilities Considering the abundance of spoken and written languages, it is easy to hype up the possibilities of MT systems. There are limitations, of course (this topic covered in 1.3), but the steady shortage of decent translators in itself makes MT a must – as Doug Arnold and co-authors state in Machine Translation - An Introductory Guide2: “it seems as though automation of translation is a social and political necessity for modern societies 1 2 Wikipedia Arnold, Machine Translation - An Introductory Guide (web version), 2001 3 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz which do not wish to impose a common language on their members”3. Therefore it is important to realise what MT is capable of and what it is not. Disregarding the future for a moment, MT can today be a great tool for humans4, for instance by quickly providing a first translation draft that can often be quite timeconsuming when done manually – one of its primary assets, if not the, is definitely speed. It could also be used to quickly translate a large number of texts of which only a handful actually will require in-depth translation. A person could then go through the chunks of drafts and decide which are relevant enough to delegate to a human translator. What the future will bring is very much a matter of debate and more highly so one of speculation. The science-fictionesque dream of 100 % perfect, instant translations between any languages is certainly not realistic today, but it may even be that it’s impossible in principle, too – as reported by Yehoshua Bar-Hillel in 19595. In any case, it is quite realistic to hope for further advancements in the fields, and even though perhaps no-one will ever find that elusive, “magic” formula, the tools of today might be enhanced. 1.3 Some things MT cannot do It is important to realize that the goal of most (maybe all) people working on MT is not complete and perfect translations of any given text. For instance, literary texts often require other skills than just interpreting6 since their authors often try to convey feelings through sentences, unlike technical documentations, manuals et cetera. Another topic currently under research is speech-to-speech translation, but, as Arnold states7, “In general, there are many open research problems to be solved before MT systems will become close to the abilities of human translators.” 3 Arnold, p.4 Arnold, p.8 5 According to Hutchins, Machine Translation: Past, Present, Future, 1986 6 Arnold, p.6 7 Arnold, p.11 4 4 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz 2.0 Corpora 2.1 What is a corpus? Being a statistical method, CL requires quick automatic data processing (imagine going trough some 20 000 words with no help of a computer) and as such has grown in popularity somewhat recently8. McEnery & Wilson list four characteristics of the modern corpus9 as sampling and representativeness, finite size, machine-readable form, and a standard reference. I will attempt to describe each. Of course, any text sample could be called a corpus, corpus being Latin for “body” and thus meaning “body of text”. But for a corpus to be useful, it has to be representative. Therefore, texts to be included in a corpus must be carefully selected, just as no statistician who wishes to be taken seriously includes only 80-year-olds in a survey supposed to represent a whole people. As put but McEnery & Wilson10: “What we are looking for is a broad range of authors and genres which, when taken together, may be considered to ‘average out’ and provide a reasonably accurate picture of the entire language population in which we are interested.” A finite corpus has the advantage of consisting of qualitative data, whereas an infinite one would change constantly thus clouding the samples somewhat but enabling new texts to be added that maybe reflect reality better. Both types exist, but it is implied that a corpus is of finite size. Machine-readable form simply means that it should be available for a machine to read. This is almost always the case today. A standard reference is also fairly selfexplanatory, meaning that “a corpus constitutes a standard reference for the language variety that it represents”11, and therefore it can be used by researchers in successive studies. 8 McEnery & Wilson, web supplement to Corpus Linguistics, 2nd ed., 2001, s.1 p.12. For a good overview of the history of CL, also see Leech, The state of the art in corpus linguistics, 1991 9 McEnery & Wilson, s.2 p.1.1-4 10 McEnery & Wilson, s.2 p.1.1 11 McEnery & Wilson, s.2 p.1.4 5 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz 2.2 Methods used in corpora There are a number of methods that can be utilised in the construction and use of a corpus. Some of them are somewhat contradictory to each other and some override others, and as such they require selection when a corpus is being built. First, a decision should be made whether as to create a qualitative or quantitative corpus. They both have their advantages and disadvantages. A qualitative corpus is complete and detailed in the sense that it understands every single occurrence and meaning of a word since it doesn’t discriminate between words that appear many times and words that appear only once. For instance, the word “ball” could mean a sphere or it could be a dance party. The problem is that qualitative corpora are not extendable due to the results not necessarily being statistically reliable – a found significance may be due to chance. A quantitative corpus, however, classifies every word and it counts occurrences. That way it can be used to form statistical models, and it is directly comparable to other corpora of the same kind, meaning the results can be generalised, but it classifies words on an Aristotelian basis – either a word is of a class or else it isn’t, no in-betweens. Quantitative analysis is really an ideal method, but it is not always achievable as words tend to have multiple meanings. Another decision the linguist has to make is what the kind of texts the corpus should represent. Since a sample could be of infinite length in theory, this was a problem before computers could handle large bodies of text, although today it is somewhat less important. The sample can, or perhaps even should12, consist of multiple types of texts (or genres), and if it is not representative of its population conclusions about the population based on the sample may not be valid. Still, the classification of genres are done by human linguists and as such could be affected by believes of those linguists. One relatively easy way of working quantitatively is by counting frequency of given words (or shorter sequences), either by counting the separate words or by counting their types. For example: the words “hunt”, “hunted” and “hunter” could be classified as separate words, or as three instances of the lexeme “hunt”. 12 McEnery & Wilson, s.3 p.4 6 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz The approach of frequency counting is rather botched, though. It cannot be used to compare separate texts to each other – a text with 47 instances of the word “cow” may speak more or less about cows than another text, also with 47 instances of it, especially if one of them is much larger than the other. Proportions allows comparisons since it counts the ratio of the words and not just the how many times they appear. 14 instances out of 200 is obviously more (ratio of 0.07) than 14 out of 2 000 (ratio of 0.007). Statistical methods are also always subject to significance tests. Two of the more common are the chi2 test and the t-test (recommended reading: http://en.wikipedia.org/wiki/Chi-square and http://en.wikipedia.org/wiki/T-test). Again using the example of the number of cows in a text, can we always be sure that the ratio difference is big enough to assume that it’s due to “cow” being a more important word in one of the texts? The answer is no. Using significance tests, result comparisons can be evaluated and the chance of an occurrence being significant given a percentage, usually with 95 or 99 % certainty although other values are possible to calculate with too. Associations are another potentially highly relevant part of a corpus, in the form of collocations. The word “queen”, when thought upon, may trigger not only an image of a queen but also one of a playing card (like the queen of hearts). There is a problem with this, namely that it is hard to know whether an occurrence is because of chance or because of a real collocation. To possibly avoid this, pairs of words are given scores and the higher the score the bigger the relevance between them. A comparison is made between the probability of the pairing being a result of chance and the probability that they belong together (like “chopping” and “block” may form “chopping block”). Collocations are useful for identifying passages of words that often appear together, and also for discriminating between different meanings for the same word. It also works the other way around, making it easier to discover different words that have roughly the same meaning – but can also detect the minor differences between them. In order to compare a bigger number of samples (the methods mentioned earlier are useful only when comparing a few samples) multivariate techniques should be applied, by which it is possible to take statistical similarities and summarise them, thus “creating” a smaller sample. Some of the most common methods are factor analysis, 7 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz multidimensional scaling and cluster analysis. All of them utilise cross tabulation (a good explanation and examples available at http://en.wikipedia.org/wiki/Cross_tab) before starting work on the results. This really is a topic on its own and I will therefore not attempt to explain them in depth. Finally, log-linear models can be used to trace the origin of a collocation or the cause of two or more words standing together. In this method, all combinations of a removing words from a phrase are tried, and the one with the worst results (i.e. lowest maintained meaning) is considered the cause. For instance, with four words 15 combinations will be tried (1-2-3-4, 1-2-3, 1-2-4, 1-3-4, 2-3-4, 1-2, 1-3, 1-4, 2-3, 2-4, 3-4, 1, 2, 3, and 4). 3.0 Corpora in machine translation 3.1 What corpora can offer MT There is not one universal way of using corpus-based machine translating, but there seems to be a tendency to rely more on bilingual parallel texts. Guidère claims that “a bilingual corpus is richer in information about the language than a monolingual corpus”13. I believe that explanation is often best executed by examples, and therefore I have chosen two examples through which I intend to show a typical example of CL at work in MT. 3.1 The Linköping Translation Corpus (LTC) The LTC is a Swedish corpus that consists of translations from English source material. Most of the texts are either fiction or software manuals, but there are some dialogue too – making a grand total of slightly more than 1’500’000 words. The translations have been conducted either by humans or by IBM’s translation memory tool, or in one case completely by automation, as shown by table 1. 13 Guidère 8 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz Table 1, copied from Merkel14 The LTC includes data for word type/token ratio, sentence type/token ratio, average number of words per sentence, number of repeated sentences, and recurrent sentence rate15. The string level data can then be used to make better versions of – or train if you will – statistical models for machine translation. It can also be used to find collocations which in turn can help statistical models even further. This is enough to make observations, as Merkel does, but not necessarily enough to be a direct help in actual translations. For that purpose a word aligner is needed, which I intend to show through example too in the next section. The differences, or perhaps rather resemblances, between the source and goal texts in the corpus showed how the goal texts and the originals in all of the above-mentioned aspects were quite alike albeit not extremely so16. The conclusions drawn after experimenting with the LTC were that “a great deal of information can be extracted”, and that the simple statistical methods used were actually quite successful.17 14 M. Merkel, Comparing source and target texts in a translation corpus, p.1 15 Merkel, p.1 16 Merkel, p.2 17 Merkel, p.5 9 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz 3.2 The PLUG Word Aligner (PWA) The PLUG Word Aligner is a joint effort between the universities of Uppsala and Linköping, and contains the Linköping Word Aligner and the Uppsala Word Aligner in the same package, created in 1997-2000. They are both automatic alignment tools for bilingual parallel texts, but for the sake of ease I will concentrate on the Linköping variant, LWA. The goal of a word aligner such as the LWA is to “find link instances in a bitext and to generate a non-probabilistic translation lexicon from the link instances”18, meaning that it searches for pairs in two texts (a bitext is a text and the translated version of it, compared sentence-for-sentence) and tries to create a lexicon of translations based on them. This lexicon can then be used. Table 2, copied from Hillertz19 18 Ahrenberg et. al., A System for Incremental and Interactive Word Linking, p.2 19 Hillertz, Korpusbaserad maskinöversättning, p.11 10 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz An easy example would be the sentence in table 2, taken from Merkel20. It shows how words are linked to each other (the capitalised words are the linked ones), sometimes word-by-word and sometimes through smaller phrases, like “set up” that consists of two words in English but only one in Swedish. LWA utilises an iterative algorithm, so that every link is detected and translated. Every time a match and translation is made, the linked words are removed from the next search. When no links exist in the looped text, the process comes to a halt. Alternatively, or as a complement, a human can review proposed links and adjust or correct them before letting the loop continue to make the linking and thus in effect the lexicon better. 4.0 Conclusion In this paper, I have accounted for what a corpus is, and how it can influence machine translation. I have also shown examples on research projects utilising this, and gone through a brief history of machine translation. Personally, I find corpus linguistics a very enjoyable topic, and I am glad I got the opportunity to study it more intensely than I had originally planned. The use of machine translation is obvious and I trust that it will be researched properly in the future. I know from experience how much time a complete translation can take, and I for one welcome any assisting tools. As for current projects, there are many that seem interesting. I chose the ones in Linköping for really no other reason than that I reside there. I don’t think the goal should be to produce a perfect translator, it may well be impossible and even if it isn’t, chances are that would take way to long anyway. Instead, MT has proven a great tool for humans (both translators and other), and though it is good, there is still much room for improvement, making it a suitable for research – and let’s not neglect the economical value of speedier translating! 20 Merkel, Understanding and Enhancing Translation by Parallel Text Processing, p.123 11 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz References LITERATURE W. J. Hutchins, Machine Translation: Past, Present, Future, 1986 M. Merkel, Understanding and Enhancing Translation by Parallel Text Processing, 1999 INTERNET D. J. Arnold, L. Balkan, R. Lee Humphreys, S. Meijer & L. Sadler. Machine Translation: An Introductory Guide http://www.essex.ac.uk/linguistics/clmt/MTbook (051018) Wikipedia pages on corpus linguistics and machine translation, author unknown http://en.wikipedia.org/wiki/Corpus_linguistics (051016) http://en.wikipedia.org/wiki/Machine_translation (051021) McEnery & Wilson, web supplement to Corpus Linguistics, 2nd ed., 2001 http://bowland-files.lancs.ac.uk/monkey/ihe/linguistics/contents.htm (051019) M. Guidère, Toward Corpus-Based Machine Translation for Standard Arabic, 2001 http://accurapid.com/journal/19mt.htm (051020) M. Merkel, Comparing source and target texts in a translation corpus http://www.ida.liu.se/~magme/publications/merkel-comparing.pdf (051012) L. Ahrenberg, M. Andersson, M. Merkel, A System for Incremental and Interactive Word Linking http://www.ida.liu.se/~magme/publications/ahrenberg-lrec-2002-new.pdf (051013) 12 LINKÖPINGS UNIVERSITET Kognitionsvetenskapliga programmet Machine Translation - Corpus Linguistics Joel Hinz Hillertz, Korpusbaserad maskinöversättning http://www.ida.liu.se/~HKGBB0/studentpapper-02/anna-hillertz.pdf (051018) 13