LINKÖPINGS UNIVERSITET Institutionen för datavetenskap Kognitionsvetenskapliga programmet Artificiell Intellligens, HKGBB0, ht 2005

Institutionen för datavetenskap
Kognitionsvetenskapliga programmet
Artificiell Intellligens, HKGBB0, ht 2005
Machine Translation
- Corpus Linguistics
Joel Hinz
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
Due to the importance of communication between two or more people, companies, or
even nations, the need for good translators has long been obvious around the world, and
since humans are faulty, it is not surprising that many attempts to automate the process of
translation have been made throughout history. This is called machine translation. Corpus
linguistics, a branch of the machine translation tree, utilises statistical methods to analyse
text samples – corpora – and makes conclusions based on the results. The goal of this
paper is to identify and account for the problems and possibilities associated with this
kind of statistical approach to translation, as well as give a brief view of the history of the
topic and glimpse at projects currently in research.
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
1. Introduction
1.1 History________________________________________
1.2 Possibilities_________________________________
1.3 Some things MT cannot do _________________________
2. Corpora___________________________________
2.1 Some things MT cannot do________________________
2.2 Methods used in corpora______________________________ 6
3.0 Corpora in machine translation________________ 8
3.1 What corpora can offer MT ____________________________ 8
3.2 The Linköping Translation Corpus (LTC)________________ 8
3.3 3.2 The PLUG Word Aligner (PWA)____________________ 10
4.0 Conclusion______________________________
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
1.0 Introduction
1.1 History
The skill of translation is by nature a very old one, and that fact alone makes it hard to
argue that a particular person or people first came up with the idea of automating the
craft. However, the first serious (and quite optimistic) attempts were conducted in the
1950s, and IBM held the first public machine translation (hereafter MT) system
demonstration in 19541. Although today’s linguists have to tackle different problems than
the linguists of the time and their approaches vary substantially, the conclusions that can
be drawn from the results are still roughly the same: a lexicon is enough to translate
individual words and those words can make a native speaker of the language the text is
being translated into understand the general meaning of the text – but since languages
differ in syntax and use different grammar the translations are more often than not just
rough and not usable for anything other than getting the overall meaning across (there
are, as will be shown later, exceptions from this).
Apart from statistical and example-based methods such as corpus linguistics (CL)
several ways of making MT feasible have been proposed, among which for instance
grammar based and dictionary-entry based methods, but in this paper focus will be solely
on CL.
1.2 Possibilities
Considering the abundance of spoken and written languages, it is easy to hype up the
possibilities of MT systems. There are limitations, of course (this topic covered in 1.3),
but the steady shortage of decent translators in itself makes MT a must – as Doug Arnold
and co-authors state in Machine Translation - An Introductory Guide2: “it seems as
though automation of translation is a social and political necessity for modern societies
Arnold, Machine Translation - An Introductory Guide (web version), 2001
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
which do not wish to impose a common language on their members”3. Therefore it is
important to realise what MT is capable of and what it is not.
Disregarding the future for a moment, MT can today be a great tool for humans4,
for instance by quickly providing a first translation draft that can often be quite timeconsuming when done manually – one of its primary assets, if not the, is definitely speed.
It could also be used to quickly translate a large number of texts of which only a handful
actually will require in-depth translation. A person could then go through the chunks of
drafts and decide which are relevant enough to delegate to a human translator.
What the future will bring is very much a matter of debate and more highly so one
of speculation. The science-fictionesque dream of 100 % perfect, instant translations
between any languages is certainly not realistic today, but it may even be that it’s
impossible in principle, too – as reported by Yehoshua Bar-Hillel in 19595. In any case, it
is quite realistic to hope for further advancements in the fields, and even though perhaps
no-one will ever find that elusive, “magic” formula, the tools of today might be
1.3 Some things MT cannot do
It is important to realize that the goal of most (maybe all) people working on MT is not
complete and perfect translations of any given text. For instance, literary texts often
require other skills than just interpreting6 since their authors often try to convey feelings
through sentences, unlike technical documentations, manuals et cetera. Another topic
currently under research is speech-to-speech translation, but, as Arnold states7, “In
general, there are many open research problems to be solved before MT systems will
become close to the abilities of human translators.”
Arnold, p.4
Arnold, p.8
According to Hutchins, Machine Translation: Past, Present, Future, 1986
Arnold, p.6
Arnold, p.11
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
2.0 Corpora
2.1 What is a corpus?
Being a statistical method, CL requires quick automatic data processing (imagine going
trough some 20 000 words with no help of a computer) and as such has grown in
popularity somewhat recently8. McEnery & Wilson list four characteristics of the modern
corpus9 as sampling and representativeness, finite size, machine-readable form, and a
standard reference. I will attempt to describe each.
Of course, any text sample could be called a corpus, corpus being Latin for
“body” and thus meaning “body of text”. But for a corpus to be useful, it has to be
representative. Therefore, texts to be included in a corpus must be carefully selected, just
as no statistician who wishes to be taken seriously includes only 80-year-olds in a survey
supposed to represent a whole people. As put but McEnery & Wilson10: “What we are
looking for is a broad range of authors and genres which, when taken together, may be
considered to ‘average out’ and provide a reasonably accurate picture of the entire
language population in which we are interested.”
A finite corpus has the advantage of consisting of qualitative data, whereas an
infinite one would change constantly thus clouding the samples somewhat but enabling
new texts to be added that maybe reflect reality better. Both types exist, but it is implied
that a corpus is of finite size.
Machine-readable form simply means that it should be available for a machine to
read. This is almost always the case today. A standard reference is also fairly selfexplanatory, meaning that “a corpus constitutes a standard reference for the language
variety that it represents”11, and therefore it can be used by researchers in successive
McEnery & Wilson, web supplement to Corpus Linguistics, 2nd ed., 2001, s.1 p.12. For a good overview
of the history of CL, also see Leech, The state of the art in corpus linguistics, 1991
McEnery & Wilson, s.2 p.1.1-4
McEnery & Wilson, s.2 p.1.1
McEnery & Wilson, s.2 p.1.4
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
2.2 Methods used in corpora
There are a number of methods that can be utilised in the construction and use of a
corpus. Some of them are somewhat contradictory to each other and some override
others, and as such they require selection when a corpus is being built.
First, a decision should be made whether as to create a qualitative or quantitative
corpus. They both have their advantages and disadvantages. A qualitative corpus is
complete and detailed in the sense that it understands every single occurrence and
meaning of a word since it doesn’t discriminate between words that appear many times
and words that appear only once. For instance, the word “ball” could mean a sphere or it
could be a dance party. The problem is that qualitative corpora are not extendable due to
the results not necessarily being statistically reliable – a found significance may be due to
chance. A quantitative corpus, however, classifies every word and it counts occurrences.
That way it can be used to form statistical models, and it is directly comparable to other
corpora of the same kind, meaning the results can be generalised, but it classifies words
on an Aristotelian basis – either a word is of a class or else it isn’t, no in-betweens.
Quantitative analysis is really an ideal method, but it is not always achievable as words
tend to have multiple meanings.
Another decision the linguist has to make is what the kind of texts the corpus
should represent. Since a sample could be of infinite length in theory, this was a problem
before computers could handle large bodies of text, although today it is somewhat less
important. The sample can, or perhaps even should12, consist of multiple types of texts
(or genres), and if it is not representative of its population conclusions about the
population based on the sample may not be valid. Still, the classification of genres are
done by human linguists and as such could be affected by believes of those linguists.
One relatively easy way of working quantitatively is by counting frequency of
given words (or shorter sequences), either by counting the separate words or by counting
their types. For example: the words “hunt”, “hunted” and “hunter” could be classified as
separate words, or as three instances of the lexeme “hunt”.
McEnery & Wilson, s.3 p.4
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
The approach of frequency counting is rather botched, though. It cannot be used
to compare separate texts to each other – a text with 47 instances of the word “cow” may
speak more or less about cows than another text, also with 47 instances of it, especially if
one of them is much larger than the other. Proportions allows comparisons since it counts
the ratio of the words and not just the how many times they appear. 14 instances out of
200 is obviously more (ratio of 0.07) than 14 out of 2 000 (ratio of 0.007).
Statistical methods are also always subject to significance tests. Two of the more
reading: and Again
using the example of the number of cows in a text, can we always be sure that the ratio
difference is big enough to assume that it’s due to “cow” being a more important word in
one of the texts? The answer is no. Using significance tests, result comparisons can be
evaluated and the chance of an occurrence being significant given a percentage, usually
with 95 or 99 % certainty although other values are possible to calculate with too.
Associations are another potentially highly relevant part of a corpus, in the form
of collocations. The word “queen”, when thought upon, may trigger not only an image of
a queen but also one of a playing card (like the queen of hearts). There is a problem with
this, namely that it is hard to know whether an occurrence is because of chance or
because of a real collocation. To possibly avoid this, pairs of words are given scores and
the higher the score the bigger the relevance between them. A comparison is made
between the probability of the pairing being a result of chance and the probability that
they belong together (like “chopping” and “block” may form “chopping block”).
Collocations are useful for identifying passages of words that often appear together, and
also for discriminating between different meanings for the same word. It also works the
other way around, making it easier to discover different words that have roughly the same
meaning – but can also detect the minor differences between them.
In order to compare a bigger number of samples (the methods mentioned earlier
are useful only when comparing a few samples) multivariate techniques should be
applied, by which it is possible to take statistical similarities and summarise them, thus
“creating” a smaller sample. Some of the most common methods are factor analysis,
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
multidimensional scaling and cluster analysis. All of them utilise cross tabulation (a good
explanation and examples available at before
starting work on the results. This really is a topic on its own and I will therefore not
attempt to explain them in depth.
Finally, log-linear models can be used to trace the origin of a collocation or the
cause of two or more words standing together. In this method, all combinations of a
removing words from a phrase are tried, and the one with the worst results (i.e. lowest
maintained meaning) is considered the cause. For instance, with four words 15
combinations will be tried (1-2-3-4, 1-2-3, 1-2-4, 1-3-4, 2-3-4, 1-2, 1-3, 1-4, 2-3, 2-4, 3-4,
1, 2, 3, and 4).
3.0 Corpora in machine translation
3.1 What corpora can offer MT
There is not one universal way of using corpus-based machine translating, but there
seems to be a tendency to rely more on bilingual parallel texts. Guidère claims that “a
bilingual corpus is richer in information about the language than a monolingual
corpus”13. I believe that explanation is often best executed by examples, and therefore I
have chosen two examples through which I intend to show a typical example of CL at
work in MT.
3.1 The Linköping Translation Corpus (LTC)
The LTC is a Swedish corpus that consists of translations from English source material.
Most of the texts are either fiction or software manuals, but there are some dialogue too –
making a grand total of slightly more than 1’500’000 words. The translations have been
conducted either by humans or by IBM’s translation memory tool, or in one case
completely by automation, as shown by table 1.
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
Table 1, copied from Merkel14
The LTC includes data for word type/token ratio, sentence type/token ratio, average
number of words per sentence, number of repeated sentences, and recurrent sentence
rate15. The string level data can then be used to make better versions of – or train if you
will – statistical models for machine translation. It can also be used to find collocations
which in turn can help statistical models even further.
This is enough to make observations, as Merkel does, but not necessarily enough
to be a direct help in actual translations. For that purpose a word aligner is needed, which
I intend to show through example too in the next section. The differences, or perhaps
rather resemblances, between the source and goal texts in the corpus showed how the
goal texts and the originals in all of the above-mentioned aspects were quite alike albeit
not extremely so16.
The conclusions drawn after experimenting with the LTC were that “a great deal
of information can be extracted”, and that the simple statistical methods used were
actually quite successful.17
M. Merkel, Comparing source and target texts in a translation corpus, p.1
Merkel, p.1
Merkel, p.2
Merkel, p.5
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
3.2 The PLUG Word Aligner (PWA)
The PLUG Word Aligner is a joint effort between the universities of Uppsala and
Linköping, and contains the Linköping Word Aligner and the Uppsala Word Aligner in
the same package, created in 1997-2000. They are both automatic alignment tools for
bilingual parallel texts, but for the sake of ease I will concentrate on the Linköping
variant, LWA.
The goal of a word aligner such as the LWA is to “find link instances in a bitext
and to generate a non-probabilistic translation lexicon from the link instances”18, meaning
that it searches for pairs in two texts (a bitext is a text and the translated version of it,
compared sentence-for-sentence) and tries to create a lexicon of translations based on
them. This lexicon can then be used.
Table 2, copied from Hillertz19
Ahrenberg et. al., A System for Incremental and Interactive Word Linking, p.2
Hillertz, Korpusbaserad maskinöversättning, p.11
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
An easy example would be the sentence in table 2, taken from Merkel20. It shows
how words are linked to each other (the capitalised words are the linked ones), sometimes
word-by-word and sometimes through smaller phrases, like “set up” that consists of two
words in English but only one in Swedish.
LWA utilises an iterative algorithm, so that every link is detected and translated.
Every time a match and translation is made, the linked words are removed from the next
search. When no links exist in the looped text, the process comes to a halt. Alternatively,
or as a complement, a human can review proposed links and adjust or correct them before
letting the loop continue to make the linking and thus in effect the lexicon better.
4.0 Conclusion
In this paper, I have accounted for what a corpus is, and how it can influence machine
translation. I have also shown examples on research projects utilising this, and gone
through a brief history of machine translation.
Personally, I find corpus linguistics a very enjoyable topic, and I am glad I got the
opportunity to study it more intensely than I had originally planned. The use of machine
translation is obvious and I trust that it will be researched properly in the future. I know
from experience how much time a complete translation can take, and I for one welcome
any assisting tools.
As for current projects, there are many that seem interesting. I chose the ones in
Linköping for really no other reason than that I reside there.
I don’t think the goal should be to produce a perfect translator, it may well be
impossible and even if it isn’t, chances are that would take way to long anyway. Instead,
MT has proven a great tool for humans (both translators and other), and though it is good,
there is still much room for improvement, making it a suitable for research – and let’s not
neglect the economical value of speedier translating!
Merkel, Understanding and Enhancing Translation by Parallel Text Processing, p.123
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
W. J. Hutchins, Machine Translation: Past, Present, Future, 1986
M. Merkel, Understanding and Enhancing Translation by Parallel Text Processing, 1999
D. J. Arnold, L. Balkan, R. Lee Humphreys, S. Meijer & L. Sadler. Machine Translation:
An Introductory Guide (051018)
Wikipedia pages on corpus linguistics and machine translation, author unknown (051016) (051021)
McEnery & Wilson, web supplement to Corpus Linguistics, 2nd ed., 2001 (051019)
M. Guidère, Toward Corpus-Based Machine Translation for Standard Arabic, 2001 (051020)
M. Merkel, Comparing source and target texts in a translation corpus (051012)
L. Ahrenberg, M. Andersson, M. Merkel, A System for Incremental and Interactive Word
Linking (051013)
Kognitionsvetenskapliga programmet
Machine Translation - Corpus Linguistics
Joel Hinz
Hillertz, Korpusbaserad maskinöversättning (051018)